"Read stuff from a file and chop it up to do stuff" code advice wanted.
- From: landspeedrecord <landspeedrecord@xxxxxxxxx>
- Date: Thu, 25 Oct 2007 00:57:44 -0000
I am writing code that reinvents the "read stuff from a file and chop
it up to do stuff" code wheel (of pain conan). If you know what I
mean... Why am I doing this?
a) to learn lisp and also how to program...
b) because I am too stupid (or it is too hard) to find code that does
the insanely simple crap I want to do as a newbie. I tried with
google... really. Plus I don't know how to use packages yet.
Anyway... I am seeking some advice... any advice about how to write
code better or with better style. Especially advice on how to write
more beautiful and/or simple code. Am I over documenting? Under
using useful functions that I am too stupid/ignorant to use? I assume
that there is a library that already does what my code is doing... but
I couldn't find it so I wrote my own. If so let me know what the
library is called. P.S. I am on a PC so path names are of the form C:
\blah.text.
If anyone finds the code useful, please let me know!
HERE IS MY CODE (that was so painful to create! aaaarrrgghhh!)
;; NOT-TEXT?
;; INPUT: a character
;; OUTPUT: T or NIL
;;
;; This function returns TRUE if any character
;; passed to it is below #\!... which is where
;; all the ascii (unicode as well?) control characters
;; are. Also below #\! is #\space etc...
(defun not-text? (char)
(if (char< char #\!)
T
NIL))
;; ALPHA-CHAR?
;; INPUT: a character
;; OUTPUT: T or NIL
;;
;; Only works with ASCII??? (I am NOT SURE.)
;; Returns TRUE if any character passed to it
;; is alphabetic. Problematically, Unicode
;; characters (I think it is Unicode at least! - not sure)
;; 91. #\[ 92. #\\ 93. #\] 94. #\^ 95. #\_ 96. #\`
;; are between 65. #\A and 122. #\z. and so they
;; would ALSO make the function return TRUE.
;; I don't know how to make Lisp change between different
;; character code sets so I can't fix/figure out how
;; to solve this issue.
(defun alpha-char? (char)
(if (AND (char-not-lessp char #\A)(char-not-greaterp char #\z))
T
NIL))
;; GET-TEXTCHUNK-FROM-STREAM
;; INPUT: a stream and an array to hold characters in temp memory.
;; OUTPUT: a string or NIL if at the absolute end of file and
;; the temp memory array is empty.
;;
;; This function grabs characters from a stream until it has
;; a chunk of text surrounded by white space
;; or linefeeds or carriage returns and returns the
;; resulting string. It works via recursion... IS THIS
;; INEFFECIENT or slower than a loop??? Don't know.
;;
;; Some notes about the steps of the "COND" part of the fuction:
;; 1 If new-char is nil - then end of file/stream! but if the
;; array contains characters then return the array so the final
;; characters in the stream aren't lost!
;; 2 new-char = NIL & the array has no characters.
;; end of file/stream! Return NIL.
;; 3 Any character below "!" is a control character like
;; #\newline or #\space etc... Text-chunk complete!
;; Throw out the control character and return the chunk.
;; 4 Still reading in legitimate characters. Push new-char
;; onto the array and then keep going
;; via recursion (passing in the updated char-array...)
(defun get-textchunk-from-stream (stream char-array)
;;the below NILs are important! for endoffile error avoidance.
(let ((new-char (read-char stream nil)))
;1
(cond ((AND (eq nil new-char) (> (length char-array) 0)) char-
array)
;2
((eq nil new-char) nil)
;3
((not-text? new-char) char-array)
;4
(T (progn (vector-push-extend new-char char-array)
(get-textchunk-from-stream stream char-array))))))
;; GET-TEXTCHUNK
;; INPUT: a stream
;; OUTPUT: a string
;;
;; This function is just a helper function that sets up
;; GET-TEXTCHUNK-FROM-STREAM to begin its recursive process
;; properly. I could probably have done without it but
;; I couldn't figure out how to make GET-TEXTCHUNK-FROM-STREAM
;; self contained.
(defun get-textchunk (stream)
(let ((char-catcher (make-array 0 :element-type 'character
:fill-pointer 0
:adjustable t)))
(get-textchunk-from-stream stream char-catcher)))
;; GET-ALL-TEXTCHUNKS
;; INPUT: a stream.
;; OUTPUT: a list of strings.
;;
;; This function loops GET-TEXTCHUNK over and over again
;; until the stream ends and GET-TEXTCHUNK returns NIL finally,
;; whereupon GET-ALL-TEXTCHUNKS
;; returns a list of strings.
(defun get-all-textchunks (stream)
(loop for word = (get-textchunk stream)
while word collect word))
;; SLURP-STREAM5
;; INPUT: a stream
;; OUTPUT: a very long string?
;;
;; Holy Crap this grabs all the text from a stream so fast!
;; I got this off the web at:
;; http://www.emmett.ca/~sabetts/slurp.html
;; My older code was grabbing one word from the file stream at a time.
;; It is MUCH faster to use this code to read in all the text at once
;; and then run my code on the text string that this code creates.
(defun slurp-stream5 (stream)
(let ((seq (make-array (file-length stream)
:element-type 'character
:fill-pointer t)))
(setf (fill-pointer seq) (read-sequence seq stream))
seq))
;; SUPER-TEXT-SLURP
;; INPUT: a file path
;; OUTPUT: the output of slurp-string5, i.e. a single long string.
;;
;; The functin uses slurp-stream5 to open a stream and
;; return all the text as a single long string. SUPER-TEXT-SLURP is
;; just some time saving code that saves me from having
;; to open and close strings just to read in text.
;; It also has the advantage of using WITH-OPEN-FILE
;; so the closing of the file is done automatically.
(defun super-text-slurp (file-location)
(with-open-file (temp-var file-location
:direction :input
:if-does-not-exist :error)
(slurp-stream5 temp-var)))
;; *TEMP-TEXT-HOLDER*
;;
;; This is just a global variable to temporarily hold
;; all the text being read in via SUPER-TEXT-SLURP
;; This is also serves as an example call to SUPER-TEXT-SLURP.
;; "Test.txt" is just a giant ASCII text file
;; that I got off of projectgutenburg.org.
;;
;; NOTE:
;; !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
;; Put in the path to any text file on your hard drive
;; to get the other code that follows this function to work.
;; I am using a windows file path here... watch out linux people!
(defparameter *temp-text-holder* (super-text-slurp "C:\\test.txt"))
;; *TEXT-ARRAY*
;;
;; This is an empty adjustable character array.
;; SHOULD I USE DEFVAR INSTEAD of DEFPARAMETER???
;; What is the difference between them anyway???
(defparameter *text-array* (make-array 0 :element-type 'character
:fill-pointer 0
:adjustable t))
;; This next bit is just some necessary code to make it all run...
;; I suppose I should embed it in a master function called
;; main... whateva.
;;
;; *TEXT-ARRAY* is an empty adjustable character array.
;; *temp-text-holder* is a gigantic book length string.
;; "streamz" is a stream created by with-input-from-string.
;; This is necessary because GET-ALL-TEXTCHUNKS only
;; accepts a stream.
;; GET-ALL-TEXTCHUNKS chops it all up into stings.
;; Each string is one word.
;; The contents of *text-array* is a list of strings.
(setq *text-array*
(with-input-from-string (streamz *temp-text-holder*)
(get-all-textchunks streamz)))
;; FIRST-ALPHACHAR-POS
;; INPUT: a character
;; OUTPUT: a non-negative integer
;;
;; This function returns the position
;; of the first alphabetic character in a string;
;; i.e. an "a" through a "z" irrespective
;; of case. Zero is position 1, 1 is position
;; 2 etc... I hate how Lisp does that but
;; whateva. Don't fight the system mang.
;; That is "\\\zing" would return 3 NOT 4.
(defun first-alphachar-pos (string)
(loop for i from 0 to (1- (length string))
do (if (alpha-char? (elt string i))
(return i))
finally (return i)))
;; LAST-ALPHACHAR-POS
;; INPUT: a character
;; OUTPUT: a non-negative integer or -1.
;;
;; This function returns the position
;; of the last alphabetic character in a string;
;; Zero is position 1, 1 is position
;; 2 etc... That is "shlorp&^%" would return 5 NOT 6.
(defun last-alphachar-pos (string)
(loop for i from (1- (length string)) downto 0
do (if (alpha-char? (elt string i))
(return i))
finally (return i)))
;; STRING-LIST-NONCHAR-FIXER (this is way too big of a function!)
;; INPUT: Nothing + a global variable included
;; as part of the function's internal code. This is
;; because I could never figure out how to
;; properly pass a global variable in Lisp.
;; OUTPUT: A list of properly parsed word and non-word
;; strings that retains the proper order from the
;; initial text (the strings stored in the global variable
;; must have been stored as a single list of strings
;; or this function will not work!)
;;
;; This function grabs a string from a global variable's list of
strings
;; (The GV is called "*text-array*" in the code right now but this can
;; change later if need be) and then cuts up the string into
;; non-alphabetic and alphebetic string chunks. It then pushes the
;; chunk(s) in the proper order onto a new list and grabs the
;; next string. When all the strings are processed the function
;; returns the new list in toto.
;;
;; The purpose of this function is to clean
;; up junk AND PUNCTUATION that gets included when "words"
;; are grabbed from a stream based on being surrounded by
;; spaces. For instance "---bob!" would get chuncked up
;; into "---", "bob", and "!". This allows for easy removal
;; of non-words and makes handling punctuation easier in the
;; future. Of course all this assumes that the best way to deal
;; with a large corpus or text is to seperate it into a list of
;; strings... maybe a hash table or an array is more efficient.
;; This function will not fix several problems:
;; 1) Words with non-alpha typos embedded in them or two words
;; stuck together with a non-alphabetic character with no space.
;; 2) Hacker speak type stuff like "7am3r" or "$hit".
;; 3) It doesn't seperate numbers from junk! I bet I have to modify
it
;; later to fix this!
(defun string-list-nonchar-fixer ()
(let ((newlist nil)
(listlen (1- (length *text-array*))))
(loop for i from listlen downto 0
for j = (nth i *text-array*)
;position of first alpha character.
do (let* ((firstcut (first-alphachar-pos j))
;position of first ending-group non-alpha
character.
(lastcut (1+ (last-alphachar-pos j)))
;total length of the string minus 1.
(stringlen (length j))
;only alphabetic characters in string?
(all-alpha (if (and (= firstcut 0) (= lastcut
stringlen))
T NIL)))
;This cond form pushes all the bits onto "newlist".
;1 - only alphabetic characters in string.
(cond (all-alpha (push j newlist))
;2 - non-alphabetic string - push it along...
; perhaps (= lastcut 0) would be more
effecient?
((> firstcut lastcut) (push j newlist))
;3 - junk before but not after.
((= lastcut stringlen)
(progn (push (subseq j firstcut stringlen)
newlist)
(push (subseq j 0 firstcut)
newlist)))
;4 - junk after but not before.
((= firstcut 0)
(progn (push (subseq j lastcut stringlen)
newlist)
(push (subseq j 0 lastcut)
newlist)))
;5 - junk at start and end.
(T (progn (push (subseq j lastcut stringlen)
newlist)
(push (subseq j firstcut lastcut)
newlist)
(push (subseq j 0 firstcut)
newlist))))))
newlist))
;; SAVE-TO-FILE
;; INPUT: a variable to write to file and a filepath with a filename
;; included.
;; OUTPUT: a textfile on your computers harddrive! YAY!
;;
;; This is a handy function for saving text into a file.
;; I don't know if it works with multi-byte characters or not.
(defun SAVE-TO-FILE (variable filename)
(with-open-file (out filename
:direction :output
:if-exists :supersede)
(with-standard-io-syntax
(print variable out))))
(string-list-nonchar-fixer)
.
- Follow-Ups:
- Prev by Date: Re: Common Lisp and RegressionTesting
- Next by Date: Re: Learning LISP from scratch
- Previous by thread: Why "Let*" and not "Let" after "do" in a loop??? Grrr.
- Next by thread: Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
- Index(es):
Relevant Pages
|