Re: Population count in SSE2, again




"James Van Buskirk" wrote in message
"Maarten Kronenburg" wrote in message

For what it's worth below I give my 32-bit assembler for popcnt.

This is a crude version of the well-known code in AMD's optimization
manual, and is not as fast as the AMD code.


Yes it's the basic one, I'm now putting it in xmm and see where I get.
I work in 32-bit system so the number of registers is a limitation.

Also I peeked for popcnt in GMP 4.2.2, see:
http://gmplib.org/
They have an IA-64 Itanium popcnt with 1 cycle/limb (64-bit), using the
IA-64 popcnt instruction.

I couldn't find where you got that number from. In


Just go to:
http://gmplib.org
and download gmp-4.2.2.tar.gz.
Then look for the files popcount.asm.
I just copied their timings, perhaps they have already improved.

http://www.technovelty.org/code/arch/popcount.html

I see a result of 6 clocks per 8 bytes for Itanium 2.

Their table lookup popcnt runs 8 cycle/limb (32-bit).

Obviously one should be able to get close to 1 byte per cycle
with a Core 2 Duo and 8-bit LUT. Only crappy code would be
short of this mark by a factor of 2.

Clearly your xmm popcnt1 is also very fast, about 1 cycle/limb (32-bit)!

Also on a machine with throughput of one 64-bit hardware per clock
cycle it may be possible to get about 10 bytes per cycle, if the
other throughput problem I am experiencing is not also a problem in
the code I envision.

But because it seems more complicated you should also test it for all
possible 128-bit bit patterns.

No, the thing to do is to follow the references given and assure
oneself that the algorithm is mathematically sound. The compression
stage only needs a truth table for all combinations of three input
bits (2**3 antries) to be exhaustive and the wrap-up stage can
be thought of as 16 independent bytes in parallel. The only
aspects that I haven't seen in previous code were use PSADBW
for the final sweep stage (which is, if anything, more
transparent than the multiply and shift given by AMD, but integer
registers don't have SIMD operations in the x86 ISA) and the
negative logic in the compression stage (which can be extablished
by an 8-entry truth table if you flunked logic circuits.)


Yes I'm sure you are right, but a typo in assembler is easily made.

After that, testing with extreme conditions such as all bits set
and all bits clear as well as random bits to guard against gross
errors in coding, while not exhaustive of all possibilities, is
at least persuasive to me. Besides, the algorithm mashes
together 1024 bits in each loop trip, distilling out one 128-bit
drop to count up, not to mention that the dregs of previous
iterations is an input to each successive iteration, so even
an unattainable number like 128 bits doesn't even start to
describe the complexity of the algorithm. It's explained fairly
well in the second link I gave in the seminal post of this
thread. Check it out and see what's going on here.

The Wikipedia entry on Hamming weight is pretty good for the
wrap stage of the algorithm, but the code there is almost
identical to your example, so it probably doesn't help you.


Yes I found:
http://www.hackersdelight.org/HDcode/newCode/pop_arrayHS.c
To put those in assembler as you did is quite some work.
Perhaps you may find and eliminate dependency chains in your assembler code,
that's the normal way to optimize exisiting assembler, but sometimes it's
rather a puzzle.
My popcnt is part of an integer library, just like GMP, and it's linear so
its performance is not so critical, but for some other special applications
it may be critical. Amazing that you can get so far with optimizing tricks.
Regards, Maarten.

.



Relevant Pages

  • Re: Time this code, please!
    ... as you've created the term "HLL Pre-Parser" to specifically ... Now if you're comparing the speed of the HLA assembler against other ... magnitude faster than the division algorithm. ... What is way less funny, is that preaching Strategy Optimization, ...
    (alt.lang.asm)
  • Re: gcc optimization changes results
    ... my program comprises a new algorithm to determine the objects ... enabling optimization could screw it up. ... you can still enable debugging info while ... If you really understand assembler well and your code isn't ...
    (comp.os.linux.development.system)
  • The Case Against RosAsm (#7) (LONG)
    ... "Branch Displacement Optimization" is one of these ... The first assembler I recall that provided ... when a wanted short jump was longer than required, ...
    (alt.lang.asm)
  • Re: HLA v1.93 is now available
    ... Optimization is Strategy Optimization. ... You're claiming that an assembly language programmer using HLA ... Only in *your* assembler. ... FASM does branch displacement optimization. ...
    (alt.lang.asm)
  • The never ending assembly vs. HLL war
    ... > branch instruction excluded is not particularly effective optimization. ... > CPU, chances are pretty good it *won't* be optimal on a different CPU. ... > discussing the futility of optimizing it in *assembler* when the ... careful and consider the code the compiler is emitting (and adjust your ...
    (comp.lang.asm.x86)