Re: nonhomogenous structs (was: lisp performance questions and observations)

From: Duane Rettig (duane_at_franz.com)
Date: 10/26/04


Date: 26 Oct 2004 10:43:56 -0700

rif <rif@mit.edu> writes:

> Duane Rettig <duane@franz.com> writes:
>
> [many interesting comments deleted]
>
> > So you might read and understand this, and say "But I don't care about
> > bit fields; I care about the mixture of bytes, ints, longs, floats,
> > and double-floats". And if this is the case (as I suspect it is), I
> > will leave it to you to guess what how I might respond to that...
>
> Upon the first couple readings of your post, I thought that the thrust
> of your argument is that there is a speed tradeoff in using less space
> --- as I start to use fields that need to be bitmasked and shifted to
> be extracted, I will pay a speed penalty as compared to using machine
> word sized pieces. (I am certainly aware of this issue, as it has
> come up for me before in the "double-float floating point operations
> are often much faster than single-float, in addition to being more
> precise.")

Yes, that's _part_ of it.

> However, I am clearly still missing something, as I don't think I
> understand the difference between C and asm in this context. Perhaps
> you meant that in C, one might end up (naively?) using bit-field
> extraction operations on fields that were already properly aligned,
> which is therefore wasteful, and in assembly you'd be likely to notice
> this? It seems unlikely to me that that's what you meant, but I'm not
> sure what you did mean.

No. Actually, it is more important to understand the difference
between asm and "the machine level" (which is what I've been
emphasizing) than the differences between asm and C. In this
context asm is not likely to know any more about the hardware than
C.

We programmers tend to view the instruction level of an architecture
as atomic. This is reinforced by the limited hardware courses we
might take, where we might build some simple CPU which has a single
instruction decoder and which executes instructions one-per-cycle.
But in order to know what is really efficient, we must dive further
into the hadrware than just to the asm level.

We do get a smattering of this - terms like "pipeline stalls" and
"locality of reference" make their way into our vocabulary, even
as programmers. [Note that I include myself in this group, even
though my initial career had been in hardware - it is the _context_
of software which makes us ignore hardware concepts that might have
surprising effects on the software we write]. But the hardware field
continues to advance, and I notice some trends:

1. Data access is getting wider all the time. You can still access very
small amounts of data, it just becomes less and less efficient in
relative terms; wider-word accesses are what are being optimized more
than sub-word accesses. DEC [sic] engineers recognized this when they
first designed the Alpha architecture, and didn't even _have_ sub-word
accessors like load-byte or store-byte instructions; you had to do
shift-and-mask operations to get byte values (this ommission turned out
to be a mistake and they created new sub-word accessors after all; even
thought they were grossly inefficient, programmers rejected the lack of
byte-accessors because it was what they were used to).

2. Caches are expanding. When "virtual memory" (i.e. paging) was first
invented, the concept of "locality of reference" meant "not forcing a
page from amin memory out to swap disk". In practical terms, it became
important to do as much work as possible on one page, so that all of
the work is likely to stay in main memory. Nowadays, main memory is
much too slow; you want to do all of your work in "cache lines". This
means that if you have an array-of-structs with three elements in each
inner struct, and you are looping on each array element and doing work
with one or two of the three elements, it would actually give _better_
locality of reference to rearrange this data structure into three
arrays with one individual element in each - the cache lines are exhausted
at a rate of 1/3 or 2/3 the frequency that they are when the inner
structs are grouped together.

3. Problem sets are being optimized for their vectorizability. What
this means is that if you lay out your data in vectors of words,
vectors of bytes, vectors of floats, vectors of double-floats, then
a smarter-compiler (tm) might be able to vectorize operations on
these vectors. If you examine one of the SSE/SSE2 architecture manuals,
you'll see loads of instructions that perform the same operations on
same-size-same-type operands, packed as many as possible into 128-bit
words at a time. I don't know how many compilers can do this at this
time, but that is definitely the trend.

4. Vector concepts are being applied to instruction decode and execution.
This is one I've only recently learned, while doing optimizations on
our new AMD64 64-bit Allegro CL port; there is an optimization guide,
and one of the missives in it is that mixing different-size operations
(e.g., "movq rax,rbx" followed by "cmpb al,0") will cause pipeline
stalls and false-dependencies (in this conext the al/ax/eax/rax register
is viewed by the hardware as two different registers, even though they
share sets of bits).

The conclusion is that in general, vectors of non-homogenous structs
are becoming less and less efficient in a relative sense, which means
that homogenous data is becoming more and more efficient in newer
harwdare.

Analyzing this: Lisp never did non-homogenous well; but C does
non-homogenous in an increasingly pessimized way (unless the compiler
is able to reasom about the aliasing and physically break up the
inner struct, in sort of a pseudo-strength-reduction rewrite -
I strongly doubt that this is the case, or whether C as a language
would even allow such a restructuring). In other words, whether
by design or by accident, Lisp anticipated the direction that
modern hardware would take!

> In my application, all the fields are actually integers. (To be fair,
> I had not said this before.)

Yes; it is possible that you have a problem set that minimizes the
distinction I'm making. However, I suspect (even as yet not knowing
what your requirements are) that there is still some gain to be had in
going ahead and breaking up those vectors-of-structs into vectors.

  It's fine to represent them as 32-bit
> ints (or as CMUCL fixnums, actually). I'm not trying to squeeze every
> last bit out of each field, as I wouldn't want to pay the speed price
> for that in C. The big problem with my CL version is not about
> bit-packing, it's about the fact that if what I want seems to be
> naturally represented as an array of structs, I'm going to pay (under
> CMUCL) an extra 3 machine words/object: one word because instead of
> laying the objects directly into the array, I have an array of
> pointers, and two more words because each struct has to have the
> header that says what type it is, even though I know in advance that
> they're all the same type.
>
> (Note that in this application, because everything is actually an
> integer, I can represent the array of structs as a single big array of
> integers. I can develop syntax that hides this abstraction from the
> algorithm, and this is what I'm likely to do. If I had a mix of
> "individually well-aligned" types, things would be tougher --- I'd
> have to use parallel arrays to do it cleanly in CL, and my locality of
> reference would suck, even in the case where everything laid out
> nicely in C.)

I suspect that you will be surprised at the result of this exercise;
it may even get _faster_, depending on the precentage of the inner structs
you access in any particular loops. If you even have only one
loop that works on part of the cross-section of data at each index,
it would still probably be worth breaking the structs apart.

> So actually, I don't care about the mixture between bytes, longs,
> ints, and floats. I care about the fact that I'm paying 3 extra
> machine words for each struct before I even get to the "data" for the
> struct itself.

OK, but in today's hardware, unless your data is 128 bits long, it
can still stand optimization by rearranging into vectors. As you
obviously know (since you proposed the solution and rejected it for
other reasons) this only requires up to three extra words per
inner-struct-element-item which should get you much closer to your
goal of matching the C space requirement.

> I still don't know what "C-centric" means in the context of this
> conversation. I suspect I misunderstood your point; I apologize in
> advance.

Hopefully by now you've figured out that C-centric means "arrays of
non-homogenous structs (note that the subject line, which I had changed
last iteration, provides the clue). It is C-centric because of the
threelanguages under consideration: C, asm (which is agnostic), and
CL (which looks for word orientation), C is the only one which
considers packing nonhomogenous data into a struct and replecating
that concoction to be the most efficient way of doing things. In
fact, it is becoming more and more machine-eccentric and pessimized.

-- 
Duane Rettig    duane@franz.com    Franz Inc.  http://www.franz.com/
555 12th St., Suite 1450               http://www.555citycenter.com/
Oakland, Ca. 94607        Phone: (510) 452-2000; Fax: (510) 452-0182   


Relevant Pages

  • Re: C# and C++ (again)
    ... The answer is to use classes, not structs. ... No need to break apart a portion of an array ... >> language where I will be most productive. ... it would be very nice to treat this 27 element array ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: array of struct from c++ to c#
    ... I currently have a single struct being passed from a C++ dll to a C# ... I'm now trying to pass an array of structs back to ...
    (microsoft.public.dotnet.framework.interop)
  • Re: * struct-like list *
    ... Age: 108 Birthday: 061095 SocialSecurity: 476892771999 ... I would like to have an array of "structs." ... I want to go through the file, filling up my list of structs. ... DOB: 061095 ...
    (comp.lang.python)
  • Re: Allocating structs on the stack
    ... > I am preparing for a class this fall that teaches C# programming. ... One big academic issue to me is that use of structs in ... > an array, in which case the rule seems to not be implemented). ... The compiler doesn't know whether an array element has been fully ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Problem with creating classes as runtime - PLEASE HELP
    ... instruction anyway... ... perhaps see if it disappears in release mode... ... are you always handling classes? ... in structs then the IL changes to accomodate boxing and unboxing; ...
    (microsoft.public.dotnet.languages.csharp)

Loading