Re: Parallel Common-Lisp with at least 64 processors?

From: "John Thingstad" <jpth...@xxxxxxxxx>
"The Scieneer implementation of the Common Lisp language has been
developed to support enterprise and performance computing
applications, and features innovative multi-threading support for
symmetrical multi-processor systems which clearly differentiates it
from the competition."

Hmm, it looks pretty nice. Does anyone here have direct experience
with it sufficient to rate it as to how well it actually works in

Browsing links from there:

Linkname: UFFI: Universal Foreign Function Interface for Common Lisp
... Every Common Lisp implementation has a method
for interfacing to such libraries. Unfortunately, these method vary
widely amongst implementations. ^s
(typo, anybody here have authorization to fix the typo?)

... UFFI wraps this common subset of functionality with
it's own syntax ...
x (another typo)

It does not support vectorization if that is what you mean.

Vectorization, in computer science, is the process of converting a
computer program from a scalar implementation, which does an operation
on a pair of operands at a time, to a vectorized program where a
single instruction can perform multiple operations or a pair of vector
(series of adjacent values) operands. Vector processing is a major
feature of both conventional and modern supercomputers.
OK, clarification: When I use the term, I'm not referring to the
*automatic* conversion from a conventional program to a vectorized
version. I'm merely referring to the final result, a single CPU
instruction that processes a whole array of data with the same
function with memory access and actual computing overlapped as fast
as the internal busses in the CPU can accomodate. For example, a
single instruction might compute the pairwise difference of two
arrays writing the differences to a pre-allocated third array, and
a second instruction might compute the squares of those
differences, and a third instruction might compute the sum of those
squares, thereby computing the variance between two vectors in just
three machine instructions. A fourth, non-vectorized operation,
would compute the square root of that sum of squares of
differences, thereby computing the Cartesian distance between the
two original vectors.

But this is a completely separate topic from the multi-CPU question
I posed. For my current application, I have a set of records, each
of which is to be pre-processed in exactly the same way:

- Convert to list of words, all lower case.
- Convert each word to bigrams trigrams and tetragrams, separately,
and accumulate those results separately for each list-of-words.
I imagine all of that to be runnable on parallel processes, hence
the 64k query. No vectorization happens there.

- Accumulate all those bigram trigram and tetragram statistics
separately for the entire corpus, yielding three whole-corpus
That would be done on the main computer after getting the many
individual triples from the sub-processes.

- Divide each of the three single-record histograms for each record
by the whole-corpus histrogram for that class among the three, to
yield the three frequency-ratio histograms for each such record.
- Merge those three frequency-ratio histograms for the record into
a single ratio histogram for that record.
- Normalize that merged ratio historgram to have Cartesian length 1.
- Compute the ProxHash, which is a 64-component vector, for the record.
I imagine all of that to be runnable on parallel processes, hence
the 64k query. No vectorization happens there.

- Perform various calculations of the difference between partial or
full ProxHash vectors in the process of building a nearest-neighbor
This is where vectorization would happen, but *not* on distributed
computers, merely on one (1) moderately vectorized computer, able
to compute Cartesian distance between two vectors (up to 64
elements in each vector) in four machine instructions
(diff,square,sum,sqrt). The first three operations would be
standard vectorized opcodes, whereas the sqrt would be specially
micro-coded to run extremely quickly by finite Newton's method
pipelined directly in the CPU internal bus structure. (It may be
that some commercial CPUs already include a built-in
vendor-supplied SQRT opcode, in which case of course no additional
micro-coding would be needed.)

Of course the CPU I speak of for the vectorized calculation might
be an auxiliary "vector processor", or it might be functionality
built into a high-performance main CPU.