Re: massive data analysis with lisp




fofiko@xxxxxxxxxxxxxxx wrote:
Yes, an excellent additional trick!

To finish complete your thought, all one needs to do is:

(file-position data-stream indexed-position)
(read data-stream)

and you've got the data nearly instantly! In this manner you can get
hold of as many or few of the entries totally at random access.

Thanks for the extension!

To finish up on the subject this is the code for a second index
(customer->ratings).

(defun read-mov-idx ()
(with-open-file (i "mov_index.txt")
(let ((htable (make-hash-table)))
(loop for res = (read i nil)
sum 1 into k
while res do
(setf (gethash k htable) res))
htable)))

(defun make-cust-idx (movidx)
(let ((assoctab (make-hash-table)))
(with-open-file (ifile "data.lisp")
(loop for k being the hash-keys in movidx using (hash-value v) do
(file-position ifile v)
(let ((res (cdr (read ifile))))
(loop for elem in res do
(let ((custid (first elem)))
(if (not (gethash custid assoctab))
(setf (gethash custid assoctab)
(make-array 1 :element-type 'fixnum
:fill-pointer 0 :adjustable
t)))
(vector-push-extend k (gethash custid
assoctab)))))))
(with-open-file (ofidx "cust_index.txt" :direction :output
:if-exists :supersede)
(with-open-file (ofcust "cust.lisp" :direction :output
:if-exists :supersede)
(loop for k being the hash-keys in assoctab using
(hash-value v) do
(format ofidx "~A ~A~%" k (file-position ofcust))
(format ofcust "~S~%" v))))))

It requires about 600mb of mem in order to build the index and after
its
done cust.lisp takes up 550mb and cust_index.txt about 8.5mb.

So with both indices in place and resident in memory, total memory
requirements
are about 8.6mb for a very fast way to get to your data at minimum
time wasted, no sql
and fully integrated with lisp ;-)

Thanks, this is great. Is it right that the third entry in the movie
index does not correspond to the entry for movie-id=3, as the output of
directory is not sorted? I ended up doing something like this to make
sure that the movie-index was accurate.

;; MovieIDs range from 1 to 17770 sequentially

(defparameter *num-movies* 17770)

;; mv_0000026.txt

(defun make-filename-for-movie (mid)
(format nil "~Amv_~7,'0D.txt" *trainingdir* mid))

(defun load-all-ratings-sequentially (o oindex dates? &optional
(num-movies *num-movies*))
(dotimes (i num-movies)
(read-movie (make-filename-for-movie (+ i 1)) o oindex dates?)))

.



Relevant Pages

  • Re: massive data analysis with lisp
    ... while res do ... (loop for k being the hash-keys in movidx using (hash-value v) ... (if (not (gethash custid assoctab)) ...
    (comp.lang.lisp)
  • Re: little isprime challenge
    ... Would it be faster to compute the square root of p, and use it as the loop ... The speed of your routine will be ... > res = .false. ... > COMMON block which contains an array holding the primes found so far. ...
    (comp.lang.fortran)
  • Re: Seaching across the top
    ... if not res is nothing then ... msgbox "Not found" ... It keeps running the loop until one ...
    (microsoft.public.excel.programming)
  • Re: Countnumberofdays while function = TRUE
    ... Loop while res <= 10 ... behind each cell in B. ...
    (microsoft.public.excel.programming)