Re: Implement memcmp function with help from gcc's inline asm



Netocrat wrote:
> On Mon, 27 Jun 2005 07:35:36 +0000, Matt wrote:
[...]
> Bottom line is then - no guarantees about how it will compile - if you
> want to fully optimise, get down to the post-compile assembly.

Exactly.

[...]
>> Unrolling loops, not functions. Unrolling a loop is reducing the
>> number of iterations by duplicating the loop body. There is a
>> tradeoff between size and speed. Eventually you reach the point of
>> diminishing returns, and the increase in code size causes more harm
>> than help. You would have to find that point by testing. I would
>> guess that duplicating the loop body 4 times and reducing the number
>> of iterations by a factor of 4
>> would be a good general point to start.
>
> OK, but it would seem that you don't gain very much - maybe one
> instruction * number of unrolls? I mean you still have to test for
> the end of the loop, it's just that you save a jump to the beginning
> if it hasn't ended... is that single jump statement saved 4 times
> really noticeable anyway?

There is much more involved than eliminating a couple jumps. For example:

int a = 1, b = 1;
for(int i = 0; i < n; i++)
{
int c = a + b;
a = b;
b = c;
}

The final 2 assignments in that loop (Fibonacci sequence) will be eliminated
when the loop is unrolled. That is one example of how loop unrolling can
eliminate more than just a couple jump statements. A similar example is
this:

for(int i = 0; i < n; i++)
{
array[i] = 0;
}

This loop is rewritten like so:

int i = 0;
while(i < n)
{
array[i] = 0;
i++;
}

The unrolled version eliminates an increment operation and becomes this:
int i = 0;
while(i < n)
{
array[i] = 0;
array[i + 1] = 0;
i += 2;
}

This would apply to your case. Also, scheduling can play a significant
factor as the memory loads could be hoisted, moved up to the top of the loop
body, which allows you to start a memory load earlier and thus spend less
time waiting on memory.

[...]
>> My indexing trick works regardless of sign so long
>> as the index is the same size as a pointer, and with size_t it is.
>
> No it doesn't work regardless of sign: it _only_ works when the
> variable in question can represent a negative value. The size of
> index is irrelevant; only the sign is relevant. I've looked at your
> new function and it's still wrong (I haven't compiled or tested it,
> but there's no point because it's broken).
>
> The relevant lines are:
> const unsigned int *pia = ((const unsigned int *) a) + ilen;
> ...
> size_t ilen = len / sizeof(int);
> ...
> for(ilen = 0 - ilen; ilen != 0; ilen++) {
> int ia = pia[ilen];
>
> Consider: ilen is unsigned - we both agree on that. Now your
> assignment in the for loop initialiser attempts to reverse the sign
> of ilen so that it is now negative. But this is impossible as it is
> unsigned, so instead of being set to negative ilen it will be set to
> a very large positive number. Now when you index pia with this
> variable you are in fact not indexing with a negative value as you
> intended, but with a very large postive value. This has nothing to
> do with assembly or machine-specific interpretations - it is purely
> due to how the C compiler interprets unsigned vs signed integers.

It is no coincidence that x86 add & subtract instructions are agnostic to
sign. Two's complement forms the basis of x86 processors (and obviously all
others) because the same adder which does unsigned arithmetic does signed
arithmetic too. Subtracting 0 - 1 produces the same result whether signed or
unsigned. Adding -1*4 to the pointer value likewise produces the same result
whether -1 is -1 or 4294957295 (0xFFFFFFFF). Due to arithmatic wrap-around,
adding 0x100000000 is the same as adding 0 since the result has only
32-bits. Therefore adding 0xFFFFFFFF and then adding 1 should be a no-op.
That makes 0xFFFFFFFF semantically identical to -1 which is why they are the
same value.

The only way this computation could be wrong is if the index variable were
smaller than the size of a pointer. An unsigned quantity would be
zero-extended while a signed quantity would be sign-extended. However, that
cannot be the case because sizeof(size_t) == sizeof(void *).

[...]
>> Indexing is generally faster than pointer arithmetic. In x86, all
>> memory references have the general form [base+index*scale+disp].
>> Most x86 processors have special hardware to compute this in
>> parallel, so it is generally recommended that you use
>> [base+index*scale] instead of [base] and incrementing it on each
>> iteration through the loop. It may differ on the P4 since, as I
>> understand, it lacks that hardware.
>
> Confirm that I have interpreted you correctly - you seem to be saying
> that for my method:
>
> int *p = somearray;
> int *pmax = p + count;
> while (p < pmax) {
> do something with *p;
> p++;
> }
>
> the p++ and the access of *p will be two synchronous instructions in
> general whereas for your method, which boils down to:
>
> int *p = somearray;
> p += count;
> count = -count; /* assumes count is a signed integral type */
> while (count != 0) {
> do something with p[count];
> count++;
> }
>
> the increment of count and the access of p[count] can occur
> simultaneously - though possibly not for P4. Yes? If this is what
> happens then you have a compelling argument.

No, what happens is that &p[count] is computed in parallel with other
computations. Neither increment is free. In memcmp(), you are using 2
buffers, and so your version of the loop has to perform 2 increments. The
advantage to this method is that it only needs to perform a single
increment. Every CPU since the original Pentium has been able to compute
&p[count] in 1 cycle completely in parallel. On Athlon, you incur a 1 cycle
latency for doing this. (I am not as familiar with other CPUs, though I
presume it's the same.) Because each pass through the loop is independent,
the additional cycle of latency does not impact the performance of the code.
Using the AGUs frees up the ALUs to do other computations and so this ends
up helping performance. It is a recommendation made by AMD for Athlon in
their optimization guide. Intel's wording is vague, but by my reading they
also recommend it for P4 (P4 Optimization Manual, Ch. 2, Instruction
Selection / Address Calculations).

Here are the annotated versions of the above C code in assembly, modified to
access 2 buffers:

; edi = p1
; ebx = p2
; esi = pmax
cmp edi, esi
jae .done

..top:
mov al, [edi]
inc edi
mov dl, [ebx]
inc ebx
cmp edi, esi
jb .top

..done:

; edi = p1
; ebx = p2
; ecx = count
test ecx, ecx
jz .done

..top:
mov al, [edi+ecx]
mov dl, [ebx+ecx]
inc ecx
jnz .top

..done:

My version of this code has 2 fewer instructions in the loop body. The
complex mov instructions still execute in the same amount of time as the
simple ones. The two increment instructions were replaced with a single
increment. The negative indexing shaves off the compare instruction since
inc sets the flags and they can be immediately used in a branch.

> I suspect that this is not the case, because there is also a test in
> there (count != 0). After count is incremented we cannot
> automatically use it as an index into memory as the description you
> have given suggests can occur - we first have to test that it is not
> zero. This would seem to require that _three_ instructions occur
> simultaneously for your description to be true...
>
>> If I were hand-optimizing the assembly code, I would treat the index
>> variable as a byte offset. Instead of dividing by 4 and multiplying
>> by 4 on each memory reference,
>
> This isn't occurring on each memory reference as far as I can see. The
> division by four occurs only at function entry.

Yes, I meant that I would simply mask off the bits instead of shifting
right. Here is the difference in assembly:

; The way GCC compiles the code
; ecx = len
mov edx, ecx
shr ecx, 2 ; ilen = len / 4
and edx, 3 ; len = len % 4
; ...
mov eax, [esi+ecx*4] ; eax = p1[ilen]
inc ecx ; ilen++
; ...

; The way I would write this code
; ecx = len
mov edx, ecx
and ecx, ~3 ; ilen = len - (len % 4)
and edx, 3 ; len = len % 4
; ...
mov eax, [esi+ecx] ; eax = p1[ilen]
add ecx, 4 ; ilen++
; ...

[...]
>> Be sure to use -march in your optimization options as it can
>> significantly affect the optimizer. Prior to using -march=athlon-mp
>> on
>> my machine, memequ() was beating memcmp() by 25%.
>
> Doesn't seem to have much effect on my machine using -march=i686.

You should be able to have it optimize for P4 while remaining binary
compatible with i686. It used to be that there were both -march and -mcpu
options. One restricted the use of newer instructions, and the other
affected instruction selection and scheduling. My GCC binary (3.4.2) says
something about -mcpu being deprecated, though. Anyway, the scheduling could
make a fair difference, and as I said it remains binary compatible with
older CPUs.

[...]
>> Generally bswap is fast: 1 cycle on Athlon and on Pentium-4E CPUs, 2
>> cycles (?) on Pentium-3. Older P4s like yours require 7 cycles.
>
> I find that hard to believe - really? 7 cycles just to swap 4 bytes?
> Why would mine be slower than Pentium-3? Isn't a P4 later than a
> Pentium-3?

P4 makes slow many operations which were very quick on P3. As with all
engineering fields, there are trade-offs, and Intel decided that they wanted
to optimize their architecture for certain cases. As a result, the cache
latency is 2 cycles instead of 3, and it can perform back-to-back arithmatic
and logical operations at an incredible speed.

When analyzing an algorithm to try and optimize the assembly implementation,
you look for the critical path through the algorithm: the longest chain of
dependent operations. You focus on completing that critical path as quickly
as possible, and you squeeze in other paths which can execute in parallel
where you can. Parallel execution of code can only reduce the total latency
of the algorithm to the latency of its critical path. P4 is able to execute
that path twice as quickly, assuming it only contains simple ops like
add/sub. P4 can do dependent adds nearly 3 times faster than the fastest
Pentium-M (same architecture as P3) or Athlon. However, other operations
like shifts and bswap had to suffer due to physical limitations.

[...]
>> This is not comparing apples to apples since we do not have a common
>> platform, but here are the results I get:
>
> Out of interest, what is your platform? Also, what are the inputs to
> the functions for the tests below? Are they the same as for my tests
> ie the two buffers are equal for len-1 bytes and the last byte is
> different?

Initially I was changing only the last byte, but IIRC these tests were all
run with buf1 == buf2. The timing difference should be negligible. My
platform is GCC 3.4.2 (MinGW) on Windows XP on a dual Athlon-MP 1600+ (1.4
GHz).

[...]
>>> Using a word-based comparison for the memcmp function on x86 is
>>> warranted when the length of initially equal bytes is expected to be
>>> greater than around 9 to 12 bytes. Of course this condition is
>>> unknown at function entry so word-based optimisation is better left
>>> out of a general library function and rather placed in a separate
>>> function which is only called based on the knowledge that generally
>>> comparisons will be equal below this number of bytes.
>>
>> Better yet, the library function (which on my platform is an inline
>> function in a header file) could be an inline function which checks
>> and does a byte compare at small sizes and calls the other routine
>> at large sizes. It is common enough, I think, for the length
>> parameter to be a constant, and the compiler could thus omit that
>> check.
>
> What check are you referring to?

int memcmp(const void *a, const void *b, size_t count)
{
if (count < 16)
return memcmp_small(a, b, count);
else
return memcmp_fast(a, b, count);
}

If you write memcmp(buf1, buf2, 1024) then the compiler can optimize that
into a call to memcmp_fast(). I would bet that in a significant number of
cases the count is fixed.

> In any case I don't think your suggestion is a good one, because the
> length does not necessarily relate to how many bytes are initially
> equal. eg. The length of the buffers may be 1Gigabyte yet they differ
> in the first byte. So calling the word-based routine in such a case
> would be unwarranted even though the length is large.

Well, that is true. I was thinking about that when I looked at
memcmp_matt(). It may be faster to compare the initial 4 bytes and only if
they are all equal start comparing words. The setup overhead is not too
large, though, so without testing I can't say. I would estimate the overhead
at around the same amount of time as a single iteration of a byte comparison
loop.

-Matt

.



Relevant Pages

  • Re: LOOP - Why so slow?
    ... of complicated ones with multiple side effects, so "loop" wasn't ... loop instructions that branch forward). ... This is no big deal for a compiler. ... BS that Intel made up. ...
    (comp.lang.asm.x86)
  • Re: Measurement Accuracy & ANOVA
    ... the number of instructions that had to be executed. ... and in the compiler -- reduce the worth of those direct measures, ... done in that loop that has any variables that are used later, ... For a practical benchmark, you need something that looks like a ...
    (sci.stat.math)
  • Re: Seymour remembered
    ... > manage to get a loop into this buffer, ... I wasn't certain from the article which Fortran compiler they were ... Instructions were 15, 30, or 60 bits long so theoretically the stack ...
    (comp.lang.fortran)
  • Re: Implement memcmp function with help from gccs inline asm
    ... Unrolling a loop is reducing the number ... >>> of iterations by duplicating the loop body. ... > test ecx, ecx ... > My version of this code has 2 fewer instructions in the loop body. ...
    (comp.lang.asm.x86)
  • Re: Automatically transform or expand do loop in a subroutine
    ... Compaq Visual Fortran. ... CONSTANT integer 'd' in its argument list when the compiler is doing ... loop length. ... instructions on installing the compiler as well. ...
    (comp.lang.fortran)