"Read stuff from a file and chop it up to do stuff" code advice wanted.



I am writing code that reinvents the "read stuff from a file and chop
it up to do stuff" code wheel (of pain conan). If you know what I
mean... Why am I doing this?

a) to learn lisp and also how to program...
b) because I am too stupid (or it is too hard) to find code that does
the insanely simple crap I want to do as a newbie. I tried with
google... really. Plus I don't know how to use packages yet.

Anyway... I am seeking some advice... any advice about how to write
code better or with better style. Especially advice on how to write
more beautiful and/or simple code. Am I over documenting? Under
using useful functions that I am too stupid/ignorant to use? I assume
that there is a library that already does what my code is doing... but
I couldn't find it so I wrote my own. If so let me know what the
library is called. P.S. I am on a PC so path names are of the form C:
\blah.text.

If anyone finds the code useful, please let me know!

HERE IS MY CODE (that was so painful to create! aaaarrrgghhh!)

;; NOT-TEXT?
;; INPUT: a character
;; OUTPUT: T or NIL
;;
;; This function returns TRUE if any character
;; passed to it is below #\!... which is where
;; all the ascii (unicode as well?) control characters
;; are. Also below #\! is #\space etc...

(defun not-text? (char)
(if (char< char #\!)
T
NIL))



;; ALPHA-CHAR?
;; INPUT: a character
;; OUTPUT: T or NIL
;;
;; Only works with ASCII??? (I am NOT SURE.)
;; Returns TRUE if any character passed to it
;; is alphabetic. Problematically, Unicode
;; characters (I think it is Unicode at least! - not sure)
;; 91. #\[ 92. #\\ 93. #\] 94. #\^ 95. #\_ 96. #\`
;; are between 65. #\A and 122. #\z. and so they
;; would ALSO make the function return TRUE.
;; I don't know how to make Lisp change between different
;; character code sets so I can't fix/figure out how
;; to solve this issue.

(defun alpha-char? (char)
(if (AND (char-not-lessp char #\A)(char-not-greaterp char #\z))
T
NIL))



;; GET-TEXTCHUNK-FROM-STREAM
;; INPUT: a stream and an array to hold characters in temp memory.
;; OUTPUT: a string or NIL if at the absolute end of file and
;; the temp memory array is empty.
;;
;; This function grabs characters from a stream until it has
;; a chunk of text surrounded by white space
;; or linefeeds or carriage returns and returns the
;; resulting string. It works via recursion... IS THIS
;; INEFFECIENT or slower than a loop??? Don't know.
;;
;; Some notes about the steps of the "COND" part of the fuction:
;; 1 If new-char is nil - then end of file/stream! but if the
;; array contains characters then return the array so the final
;; characters in the stream aren't lost!
;; 2 new-char = NIL & the array has no characters.
;; end of file/stream! Return NIL.
;; 3 Any character below "!" is a control character like
;; #\newline or #\space etc... Text-chunk complete!
;; Throw out the control character and return the chunk.
;; 4 Still reading in legitimate characters. Push new-char
;; onto the array and then keep going
;; via recursion (passing in the updated char-array...)

(defun get-textchunk-from-stream (stream char-array)
;;the below NILs are important! for endoffile error avoidance.
(let ((new-char (read-char stream nil)))
;1
(cond ((AND (eq nil new-char) (> (length char-array) 0)) char-
array)
;2
((eq nil new-char) nil)
;3
((not-text? new-char) char-array)
;4
(T (progn (vector-push-extend new-char char-array)
(get-textchunk-from-stream stream char-array))))))



;; GET-TEXTCHUNK
;; INPUT: a stream
;; OUTPUT: a string
;;
;; This function is just a helper function that sets up
;; GET-TEXTCHUNK-FROM-STREAM to begin its recursive process
;; properly. I could probably have done without it but
;; I couldn't figure out how to make GET-TEXTCHUNK-FROM-STREAM
;; self contained.

(defun get-textchunk (stream)
(let ((char-catcher (make-array 0 :element-type 'character
:fill-pointer 0
:adjustable t)))
(get-textchunk-from-stream stream char-catcher)))



;; GET-ALL-TEXTCHUNKS
;; INPUT: a stream.
;; OUTPUT: a list of strings.
;;
;; This function loops GET-TEXTCHUNK over and over again
;; until the stream ends and GET-TEXTCHUNK returns NIL finally,
;; whereupon GET-ALL-TEXTCHUNKS
;; returns a list of strings.

(defun get-all-textchunks (stream)
(loop for word = (get-textchunk stream)
while word collect word))



;; SLURP-STREAM5
;; INPUT: a stream
;; OUTPUT: a very long string?
;;
;; Holy Crap this grabs all the text from a stream so fast!
;; I got this off the web at:
;; http://www.emmett.ca/~sabetts/slurp.html
;; My older code was grabbing one word from the file stream at a time.
;; It is MUCH faster to use this code to read in all the text at once
;; and then run my code on the text string that this code creates.

(defun slurp-stream5 (stream)
(let ((seq (make-array (file-length stream)
:element-type 'character
:fill-pointer t)))
(setf (fill-pointer seq) (read-sequence seq stream))
seq))



;; SUPER-TEXT-SLURP
;; INPUT: a file path
;; OUTPUT: the output of slurp-string5, i.e. a single long string.
;;
;; The functin uses slurp-stream5 to open a stream and
;; return all the text as a single long string. SUPER-TEXT-SLURP is
;; just some time saving code that saves me from having
;; to open and close strings just to read in text.
;; It also has the advantage of using WITH-OPEN-FILE
;; so the closing of the file is done automatically.

(defun super-text-slurp (file-location)
(with-open-file (temp-var file-location
:direction :input
:if-does-not-exist :error)
(slurp-stream5 temp-var)))



;; *TEMP-TEXT-HOLDER*
;;
;; This is just a global variable to temporarily hold
;; all the text being read in via SUPER-TEXT-SLURP
;; This is also serves as an example call to SUPER-TEXT-SLURP.
;; "Test.txt" is just a giant ASCII text file
;; that I got off of projectgutenburg.org.
;;
;; NOTE:
;; !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
;; Put in the path to any text file on your hard drive
;; to get the other code that follows this function to work.
;; I am using a windows file path here... watch out linux people!

(defparameter *temp-text-holder* (super-text-slurp "C:\\test.txt"))



;; *TEXT-ARRAY*
;;
;; This is an empty adjustable character array.
;; SHOULD I USE DEFVAR INSTEAD of DEFPARAMETER???
;; What is the difference between them anyway???

(defparameter *text-array* (make-array 0 :element-type 'character
:fill-pointer 0
:adjustable t))


;; This next bit is just some necessary code to make it all run...
;; I suppose I should embed it in a master function called
;; main... whateva.
;;
;; *TEXT-ARRAY* is an empty adjustable character array.
;; *temp-text-holder* is a gigantic book length string.
;; "streamz" is a stream created by with-input-from-string.
;; This is necessary because GET-ALL-TEXTCHUNKS only
;; accepts a stream.
;; GET-ALL-TEXTCHUNKS chops it all up into stings.
;; Each string is one word.
;; The contents of *text-array* is a list of strings.

(setq *text-array*
(with-input-from-string (streamz *temp-text-holder*)
(get-all-textchunks streamz)))


;; FIRST-ALPHACHAR-POS
;; INPUT: a character
;; OUTPUT: a non-negative integer
;;
;; This function returns the position
;; of the first alphabetic character in a string;
;; i.e. an "a" through a "z" irrespective
;; of case. Zero is position 1, 1 is position
;; 2 etc... I hate how Lisp does that but
;; whateva. Don't fight the system mang.
;; That is "\\\zing" would return 3 NOT 4.

(defun first-alphachar-pos (string)
(loop for i from 0 to (1- (length string))
do (if (alpha-char? (elt string i))
(return i))
finally (return i)))


;; LAST-ALPHACHAR-POS
;; INPUT: a character
;; OUTPUT: a non-negative integer or -1.
;;
;; This function returns the position
;; of the last alphabetic character in a string;
;; Zero is position 1, 1 is position
;; 2 etc... That is "shlorp&^%" would return 5 NOT 6.

(defun last-alphachar-pos (string)
(loop for i from (1- (length string)) downto 0
do (if (alpha-char? (elt string i))
(return i))
finally (return i)))



;; STRING-LIST-NONCHAR-FIXER (this is way too big of a function!)
;; INPUT: Nothing + a global variable included
;; as part of the function's internal code. This is
;; because I could never figure out how to
;; properly pass a global variable in Lisp.
;; OUTPUT: A list of properly parsed word and non-word
;; strings that retains the proper order from the
;; initial text (the strings stored in the global variable
;; must have been stored as a single list of strings
;; or this function will not work!)
;;
;; This function grabs a string from a global variable's list of
strings
;; (The GV is called "*text-array*" in the code right now but this can
;; change later if need be) and then cuts up the string into
;; non-alphabetic and alphebetic string chunks. It then pushes the
;; chunk(s) in the proper order onto a new list and grabs the
;; next string. When all the strings are processed the function
;; returns the new list in toto.
;;
;; The purpose of this function is to clean
;; up junk AND PUNCTUATION that gets included when "words"
;; are grabbed from a stream based on being surrounded by
;; spaces. For instance "---bob!" would get chuncked up
;; into "---", "bob", and "!". This allows for easy removal
;; of non-words and makes handling punctuation easier in the
;; future. Of course all this assumes that the best way to deal
;; with a large corpus or text is to seperate it into a list of
;; strings... maybe a hash table or an array is more efficient.
;; This function will not fix several problems:
;; 1) Words with non-alpha typos embedded in them or two words
;; stuck together with a non-alphabetic character with no space.
;; 2) Hacker speak type stuff like "7am3r" or "$hit".
;; 3) It doesn't seperate numbers from junk! I bet I have to modify
it
;; later to fix this!

(defun string-list-nonchar-fixer ()
(let ((newlist nil)
(listlen (1- (length *text-array*))))
(loop for i from listlen downto 0
for j = (nth i *text-array*)
;position of first alpha character.
do (let* ((firstcut (first-alphachar-pos j))
;position of first ending-group non-alpha
character.
(lastcut (1+ (last-alphachar-pos j)))
;total length of the string minus 1.
(stringlen (length j))
;only alphabetic characters in string?
(all-alpha (if (and (= firstcut 0) (= lastcut
stringlen))
T NIL)))
;This cond form pushes all the bits onto "newlist".
;1 - only alphabetic characters in string.
(cond (all-alpha (push j newlist))
;2 - non-alphabetic string - push it along...
; perhaps (= lastcut 0) would be more
effecient?
((> firstcut lastcut) (push j newlist))
;3 - junk before but not after.
((= lastcut stringlen)
(progn (push (subseq j firstcut stringlen)
newlist)
(push (subseq j 0 firstcut)
newlist)))
;4 - junk after but not before.
((= firstcut 0)
(progn (push (subseq j lastcut stringlen)
newlist)
(push (subseq j 0 lastcut)
newlist)))
;5 - junk at start and end.
(T (progn (push (subseq j lastcut stringlen)
newlist)
(push (subseq j firstcut lastcut)
newlist)
(push (subseq j 0 firstcut)
newlist))))))
newlist))

;; SAVE-TO-FILE
;; INPUT: a variable to write to file and a filepath with a filename
;; included.
;; OUTPUT: a textfile on your computers harddrive! YAY!
;;
;; This is a handy function for saving text into a file.
;; I don't know if it works with multi-byte characters or not.

(defun SAVE-TO-FILE (variable filename)
(with-open-file (out filename
:direction :output
:if-exists :supersede)
(with-standard-io-syntax
(print variable out))))


(string-list-nonchar-fixer)

.



Relevant Pages

  • Re: "Read stuff from a file and chop it up to do stuff" code advice wanted.
    ... ;; This function returns TRUE if any character ... (if (char< char #\!) ... a stream and an array to hold characters in temp memory. ... ;; resulting string. ...
    (comp.lang.lisp)
  • [TOMOYO #15 3/8] Common functions for TOMOYO Linux.
    ... This file contains common functions (e.g. policy I/O, pattern matching). ... Since TOMOYO Linux is a name based access control, ... TOMOYO Linux's string manipulation functions make reviewers feel crazy, ... the Linux kernel accepts all characters but NUL character ...
    (Linux-Kernel)
  • Re: ReplacerStream
    ... string, do a replace on that string and create a stream again to be ... If those are problems, and you are looking just for a single string, it seems to me that you could just read the stream one character at a time, checking to see if it matches the current character in your search string. ...
    (microsoft.public.dotnet.framework)
  • Re: Extent of standard C/C++ library support in Visual C++
    ... Check if character is alphanumeric ... Reopen stream with different file or mode ... Write formatted data to string (function) ... Find element in range (function template) ...
    (comp.sources.d)
  • RfD: Escaped Strings version 4
    ... the S" string can only contain printable characters, ... the S" string cannot contain the '"' character, ... as an escape character for the entry of characters that cannot be ... \b BS (backspace, ASCII 8) ...
    (comp.lang.forth)