Re: Implement memcmp function with help from gcc's inline asm
- From: Netocrat <spamtrap@xxxxxxxxxx>
- Date: Mon, 27 Jun 2005 17:54:13 +0000 (UTC)
On Mon, 27 Jun 2005 07:35:36 +0000, Matt wrote:
> Netocrat wrote:
> [...]
>>> Now when you dereference the pointers, just write bswap(*ua) >
>>> bswap(*ub). Actually, the better thing to do is this: long cmp =
>>> bswap(*ua) - bswap(*ub);
>>> if (cmp < 0) return -1;
>>> else if (cmp > 0) return 1;
>>> // else continue
>>
>> Due to the above noted requirement for unsigned quantities, I can't use
>> this method. However, I would expect an optimising compiler to generate
>> the same code for what you presented as for:
>>
>> if (bswap(*ua) > bswap(*ub))
>> return 1;
>> else if (bswap(*ua) < bswap(*ub))
>> return -1;
>> else
>> return 0;
>
> Maybe, but the compiler may generate a load for each *ua and *ub even if
> it is optimizing. There are no guarantees, and pointer analysis is one of
> the hardest problems for C/C++ compilers.
Bottom line is then - no guarantees about how it will compile - if you
want to fully optimise, get down to the post-compile assembly.
>>> With GCC 3.4.2 at -O3 -march=athlon-xp -fomit-frame-pointer, memequ()
>>> is a good 10% faster than memcmp(). There is room for optimization,
>>> too, as both functions can be unrolled.
>>
>> Unrolling functions is what?
>
> Unrolling loops, not functions. Unrolling a loop is reducing the number
> of iterations by duplicating the loop body. There is a tradeoff between
> size and speed. Eventually you reach the point of diminishing returns,
> and the increase in code size causes more harm than help. You would have
> to find that point by testing. I would guess that duplicating the loop
> body 4 times and reducing the number of iterations by a factor of 4
> would be a good general point to start.
OK, but it would seem that you don't gain very much - maybe one
instruction * number of unrolls? I mean you still have to test for the
end of the loop, it's just that you save a jump to the beginning if it
hasn't ended... is that single jump statement saved 4 times really
noticeable anyway?
>> Again, memcmp needs to compare memory as unsigned. So the first
>> problem with your functions is that all memory comparison variables are
>> signed. Secondly you're relying on llen being signed for your negative
>> indexing trick; unfortunately size_t is unsigned. Who knows what
>> memory is actually being referenced. Anyhow there may or may not be
>> other problems but I didn't check further - modifying and then plugging
>> your excellent bswap function into the memcmp_word function that I
>> first presented worked as expected; compiling with -O3 is required for
>> full benefit.
>
> size_t is not signed.
Agreed - that's what I said.
> My indexing trick works regardless of sign so long
> as the index is the same size as a pointer, and with size_t it is.
No it doesn't work regardless of sign: it _only_ works when the variable
in question can represent a negative value. The size of index is
irrelevant; only the sign is relevant. I've looked at your new function
and it's still wrong (I haven't compiled or tested it, but there's no
point because it's broken).
The relevant lines are:
const unsigned int *pia = ((const unsigned int *) a) + ilen;
...
size_t ilen = len / sizeof(int);
...
for(ilen = 0 - ilen; ilen != 0; ilen++) {
int ia = pia[ilen];
Consider: ilen is unsigned - we both agree on that. Now your assignment
in the for loop initialiser attempts to reverse the sign of ilen so that
it is now negative. But this is impossible as it is unsigned, so instead
of being set to negative ilen it will be set to a very large positive
number. Now when you index pia with this variable you are in fact not
indexing with a negative value as you intended, but with a very large
postive value. This has nothing to do with assembly or machine-specific
interpretations - it is purely due to how the C compiler interprets
unsigned vs signed integers.
>> Other general comments re efficiency in your code. You are using
>> indexing on l1 and l2 whereas I was using pointer increments and a test
>> for a max pointer. My method requires an extra variable but I would
>> have thought that indexing was slower than incrementing pointers.
>> Perhaps compiler optimisations make them equivalent (untested comments
>> but observations lead me to believe that for -O3 this is true).
>
> Indexing is generally faster than pointer arithmetic. In x86, all memory
> references have the general form [base+index*scale+disp]. Most x86
> processors have special hardware to compute this in parallel, so it is
> generally recommended that you use [base+index*scale] instead of [base]
> and incrementing it on each iteration through the loop. It may differ on
> the P4 since, as I understand, it lacks that hardware.
Confirm that I have interpreted you correctly - you seem to be saying that
for my method:
int *p = somearray;
int *pmax = p + count;
while (p < pmax) {
do something with *p;
p++;
}
the p++ and the access of *p will be two synchronous instructions in
general whereas for your method, which boils down to:
int *p = somearray;
p += count;
count = -count; /* assumes count is a signed integral type */
while (count != 0) {
do something with p[count];
count++;
}
the increment of count and the access of p[count] can occur simultaneously
- though possibly not for P4. Yes? If this is what happens then you have
a compelling argument.
I suspect that this is not the case, because there is also a test in there
(count != 0). After count is incremented we cannot automatically use it
as an index into memory as the description you have given suggests can
occur - we first have to test that it is not zero. This would seem to
require that _three_ instructions occur simultaneously for your
description to be true...
> If I were hand-optimizing the assembly code, I would treat the index
> variable as a byte offset. Instead of dividing by 4 and multiplying by 4
> on each memory reference,
This isn't occurring on each memory reference as far as I can see. The
division by four occurs only at function entry.
> I would simply mask off the lower bits and
> then increment/decrement by 4 on each pass through the loop. Making that
> optimization, both snippets of code should be identical on P4. However,
> I don't think GCC can do that.
That sounds like a good description of what my code is doing...
simply incrementing by 4 on each pass. It is your code that requires
multiplication - that is one reason why I would suggest that assuming no
optimisations have been performed by the compiler, my code requires less
operations. Of course I am not as intimate with the hardware as most in
this newsgroup so I could be wrong.
>> Finally comments on the prototype: yours is: long memcmp(char *p1, char
>> *p2, size_t len); The actual library version is: int memcmp(const void
>> *s1, const void *s2, size_t n);
>>
>> Note the int return and const void * arguments.
>
> Yes, I see. I switched everything to long because I figure that "long"
> will typically be the largest atomic type on a platform. If you're only
> using x86, then they are the same. On x86-64, you would want to work
> with 64-bit quantities since it would improve performance by a factor of
> 2 when comparing cached data.
All valid points, however I am specifically looking at the library
function as defined by standard C, so I am not at liberty to change the
types.
>> Yes it looks very compact bar that extra mov. How I would go about
>> viewing this generated assembly on my system?
>
> gcc -S <optimization options> input.c -o input.s
Wonderful. :)
> Be sure to use -march in your optimization options as it can
> significantly affect the optimizer. Prior to using -march=athlon-mp on
> my machine, memequ() was beating memcmp() by 25%.
Doesn't seem to have much effect on my machine using -march=i686.
>> I would have expected that for small lengths the library function would
>> have been significantly faster than the asm-enhanced version since the
>> overhead of converting from bytes to words and reversing word order
>> would wipe out any performance gained by comparing words-at-a-time vs
>> bytes-at-a-time. Instead I found that in _all_ cases the asm-enhanced
>> version was faster than the library version.
>
> Generally bswap is fast: 1 cycle on Athlon and on Pentium-4E CPUs, 2
> cycles (?) on Pentium-3. Older P4s like yours require 7 cycles.
I find that hard to believe - really? 7 cycles just to swap 4 bytes? Why
would mine be slower than Pentium-3? Isn't a P4 later than a Pentium-3?
> You should check the implementation of memcmp() and see what it's doing. I
> would not be surprised if somebody made it an intrinsic rep cmpsd.
I may do that. Right now I don't have the motivation to unpack and
then hunt through the glibc source code, but if I do it later I'll report
back.
>> The naive implementation behaved as I expected both it and the library
>> version to behave with respect to the asm version - it was faster than
>> the asm version for small lengths but the optimisation kicked in for
>> greater lengths and the asm version pulled away.
>
> This is not comparing apples to apples since we do not have a common
> platform, but here are the results I get:
Out of interest, what is your platform? Also, what are the inputs to the
functions for the tests below? Are they the same as for my tests ie the
two buffers are equal for len-1 bytes and the last byte is different?
> 16 bytes 256 bytes 1024 bytes
> memcmp_word_asm: 123517 1401589 5391195
> memcmp_matt: 128157 1409557 5355056
> memcmp: 266286 2841574 11088633
> memcmp_naive: 282878 3590412 14046262
>
> (All 3 results are for 4KB)
> memcmp_matt: 5512565 5368664 5362735
> memcmp_word_asm: 6370227 6651406 6778280
> memcmp: 11015478 11079905 11151678
> memcmp_naive: 15475814 21041857 23042974
>
> I again did not really test memcmp_matt() to verify correctness, but it
> seems to do the right thing.
For the reasons I have previously explained, I very much doubt that it
does.
> It also consistently beats all the other
> routines except memcmp_word_asm() on small blocks, and on small blocks
> it still ties. I suspect that on your P4, memcmp_matt() will
> significantly outperform memcmp_word_asm() in all cases.
I suspect that it may for the reason that since it is indexing the wrong
memory, it is probably finding a non-equal comparison early and returning
much sooner than all the other functions. Of course it may still be
faster even when this problem is fixed. I won't presume to fix your
function; I'll leave it to you to decide how to fix it, but I'd be
interested to see the results once it is fixed.
> By the way,
> memcmp() on my platform is rep cmps.
>
> int memcmp_matt(const void *a, const void *b, size_t len) {
> size_t ilen = len / sizeof(int);
> const unsigned int *pia = ((const unsigned int *) a) + ilen; const
> unsigned int *pib = ((const unsigned int *) b) + ilen; const unsigned
> char *pca = (const unsigned char *) pia; const unsigned char *pcb =
> (const unsigned char *) pib;
>
> len %= sizeof(int);
>
> for(ilen = 0 - ilen; ilen != 0; ilen++) {
> int ia = pia[ilen];
> int ib = pib[ilen];
>
> if (ia != ib)
> {
> if (bswap(ia) < bswap(ib))
> return -1;
> else
> return 1;
> }
> }
> }
> while(len > 0)
> {
> int cmp = (int) *pca - (int) *pcb;
>
> if (cmp != 0)
> return (cmp < 0 ? -1 : 1);
>
> pca++;
> pcb++;
> len--;
> }
> }
> return 0;
> }
> }
>> I also expected that under all circumstances the library version would
>> be faster than my naive code; which after all uses only
>> standards-compliant C and uses no optimisation tricks which the library
>> implementors are free to avail themselves of. For large lengths this
>> is true but surprisingly for small lengths the naive code is much
>> faster than the library code.
>
> If the the buffers differ within the first few bytes, then the set-up
> overhead for comparing words *could* be more costly than simply
> comparing bytes.
Actually in the paragraph you quoted I was comparing the library version
to my naive version, not to my asm version, but assuming that you are
referring to the former comparison I agree - I'd even go so far as to
replace '*could*' with 'will for most architectures'.
>> Conclusion:
>> The glibc implementation of memcmp on x86 platforms could slightly
>> better optimised for the case where none or only a small number of
>> initial bytes compare equal between the two inputs.
>
> If they use rep cmps then it could be significantly optimized. In my
> results, memcmp_matt() is twice as fast in all cases. My figures are a
> bit surreal, however, since I did not run any test cases where the
> initial bytes of the buffers differed. I think that is probably the
> typical case, and in that case memcmp_naive() is probably the fastest.
> If not the fastest, then it is at least not much slower than
> memcmp_matt() or memcmp_word_asm().
Yes, that seems to be true and as I said it surprised me - surely the
implementors would have realised that the instruction they were using was
actually slower than a simpler approach?
>> Using a word-based comparison for the memcmp function on x86 is
>> warranted when the length of initially equal bytes is expected to be
>> greater than around 9 to 12 bytes. Of course this condition is unknown
>> at function entry so word-based optimisation is better left out of a
>> general library function and rather placed in a separate function which
>> is only called based on the knowledge that generally comparisons will
>> be equal below this number of bytes.
>
> Better yet, the library function (which on my platform is an inline
> function in a header file) could be an inline function which checks and
> does a byte compare at small sizes and calls the other routine at large
> sizes. It is common enough, I think, for the length parameter to be a
> constant, and the compiler could thus omit that check.
What check are you referring to?
In any case I don't think your suggestion is a good one, because the
length does not necessarily relate to how many bytes are initially equal.
eg. The length of the buffers may be 1Gigabyte yet they differ in the
first byte. So calling the word-based routine in such a case would be
unwarranted even though the length is large.
.
- Follow-Ups:
- References:
- Prev by Date: Re: 68x86 HLAOOP Fabio Bizzetti
- Next by Date: Re: 8253 Timer: Intel Celeron Processor/D865GBF Motherboard
- Previous by thread: Re: Implement memcmp function with help from gcc's inline asm
- Next by thread: Re: Implement memcmp function with help from gcc's inline asm
- Index(es):
Relevant Pages
|