Re: Implement memcmp function with help from gcc's inline asm
- From: "Matt" <spamtrap@xxxxxxxxxx>
- Date: Mon, 27 Jun 2005 07:35:36 +0000 (UTC)
Netocrat wrote:
[...]
>> Now when you dereference the pointers, just write bswap(*ua) >
>> bswap(*ub). Actually, the better thing to do is this: long cmp =
>> bswap(*ua) - bswap(*ub);
>> if (cmp < 0) return -1;
>> else if (cmp > 0) return 1;
>> // else continue
>
> Due to the above noted requirement for unsigned quantities, I can't
> use this method. However, I would expect an optimising compiler to
> generate the same code for what you presented as for:
>
> if (bswap(*ua) > bswap(*ub))
> return 1;
> else if (bswap(*ua) < bswap(*ub))
> return -1;
> else
> return 0;
Maybe, but the compiler may generate a load for each *ua and *ub even if it
is optimizing. There are no guarantees, and pointer analysis is one of the
hardest problems for C/C++ compilers.
[...]
>> With GCC 3.4.2 at -O3 -march=athlon-xp -fomit-frame-pointer,
>> memequ() is a good 10% faster than memcmp(). There is room for
>> optimization, too, as both functions can be unrolled.
>
> Unrolling functions is what?
Unrolling loops, not functions. Unrolling a loop is reducing the number of
iterations by duplicating the loop body. There is a tradeoff between size
and speed. Eventually you reach the point of diminishing returns, and the
increase in code size causes more harm than help. You would have to find
that point by testing. I would guess that duplicating the loop body 4 times
and reducing the number of iterations by a factor of 4 would be a good
general point to start.
[...]
> Again, memcmp needs to compare memory as unsigned. So the first
> problem with your functions is that all memory comparison variables
> are signed. Secondly you're relying on llen being signed for your
> negative indexing trick; unfortunately size_t is unsigned. Who knows
> what memory is actually being referenced. Anyhow there may or may
> not be other problems but I didn't check further - modifying and then
> plugging your excellent bswap function into the memcmp_word function
> that I first presented worked as expected; compiling with -O3 is
> required for full benefit.
size_t is not signed. My indexing trick works regardless of sign so long as
the index is the same size as a pointer, and with size_t it is.
> Other general comments re efficiency in your code. You are using
> indexing on l1 and l2 whereas I was using pointer increments and a
> test for a max pointer. My method requires an extra variable but I
> would have thought that indexing was slower than incrementing
> pointers. Perhaps compiler optimisations make them equivalent
> (untested comments but observations lead me to believe that for -O3
> this is true).
Indexing is generally faster than pointer arithmetic. In x86, all memory
references have the general form [base+index*scale+disp]. Most x86
processors have special hardware to compute this in parallel, so it is
generally recommended that you use [base+index*scale] instead of [base] and
incrementing it on each iteration through the loop. It may differ on the P4
since, as I understand, it lacks that hardware.
If I were hand-optimizing the assembly code, I would treat the index
variable as a byte offset. Instead of dividing by 4 and multiplying by 4 on
each memory reference, I would simply mask off the lower bits and then
increment/decrement by 4 on each pass through the loop. Making that
optimization, both snippets of code should be identical on P4. However, I
don't think GCC can do that.
> Finally comments on the prototype: yours is:
> long memcmp(char *p1, char *p2, size_t len);
> The actual library version is:
> int memcmp(const void *s1, const void *s2, size_t n);
>
> Note the int return and const void * arguments.
Yes, I see. I switched everything to long because I figure that "long" will
typically be the largest atomic type on a platform. If you're only using
x86, then they are the same. On x86-64, you would want to work with 64-bit
quantities since it would improve performance by a factor of 2 when
comparing cached data.
[...]
>> The mov at the end of the loop is spurrious, but GCC tends to do
>> stupid things like that. Anyway, it should still be quite fast. The
>> only way to achieve better performance would be to unroll the loop
>> or make more
>
> Unrolling loops is what?
>
>> assumptions as in memequ(). As I said earlier, memequ() as posted
>> above is about 10% faster on my Athlon, and it can be made faster
>> still through the use of MMX instructions.
>
> Yes it looks very compact bar that extra mov. How I would go about
> viewing this generated assembly on my system?
gcc -S <optimization options> input.c -o input.s
Be sure to use -march in your optimization options as it can significantly
affect the optimizer. Prior to using -march=athlon-mp on my machine,
memequ() was beating memcmp() by 25%.
[...]
> I would have expected that for small lengths the library function
> would have been significantly faster than the asm-enhanced version
> since the overhead of converting from bytes to words and reversing
> word order would wipe out any performance gained by comparing
> words-at-a-time vs bytes-at-a-time. Instead I found that in _all_
> cases the asm-enhanced version was faster than the library version.
Generally bswap is fast: 1 cycle on Athlon and on Pentium-4E CPUs, 2 cycles
(?) on Pentium-3. Older P4s like yours require 7 cycles. You should check
the implementation of memcmp() and see what it's doing. I would not be
surprised if somebody made it an intrinsic rep cmpsd.
> The naive implementation behaved as I expected both it and the library
> version to behave with respect to the asm version - it was faster
> than the asm version for small lengths but the optimisation kicked in
> for
> greater lengths and the asm version pulled away.
This is not comparing apples to apples since we do not have a common
platform, but here are the results I get:
16 bytes 256 bytes 1024 bytes
memcmp_word_asm: 123517 1401589 5391195
memcmp_matt: 128157 1409557 5355056
memcmp: 266286 2841574 11088633
memcmp_naive: 282878 3590412 14046262
(All 3 results are for 4KB)
memcmp_matt: 5512565 5368664 5362735
memcmp_word_asm: 6370227 6651406 6778280
memcmp: 11015478 11079905 11151678
memcmp_naive: 15475814 21041857 23042974
I again did not really test memcmp_matt() to verify correctness, but it
seems to do the right thing. It also consistently beats all the other
routines except memcmp_word_asm() on small blocks, and on small blocks it
still ties. I suspect that on your P4, memcmp_matt() will significantly
outperform memcmp_word_asm() in all cases. By the way, memcmp() on my
platform is rep cmps.
int memcmp_matt(const void *a, const void *b, size_t len)
{
size_t ilen = len / sizeof(int);
const unsigned int *pia = ((const unsigned int *) a) + ilen;
const unsigned int *pib = ((const unsigned int *) b) + ilen;
const unsigned char *pca = (const unsigned char *) pia;
const unsigned char *pcb = (const unsigned char *) pib;
len %= sizeof(int);
for(ilen = 0 - ilen; ilen != 0; ilen++)
{
int ia = pia[ilen];
int ib = pib[ilen];
if (ia != ib)
{
if (bswap(ia) < bswap(ib))
return -1;
else
return 1;
}
}
while(len > 0)
{
int cmp = (int) *pca - (int) *pcb;
if (cmp != 0)
return (cmp < 0 ? -1 : 1);
pca++;
pcb++;
len--;
}
return 0;
}
> I also expected that under all circumstances the library version would
> be faster than my naive code; which after all uses only
> standards-compliant C and uses no optimisation tricks which the
> library implementors are free to avail themselves of. For large
> lengths this is true but surprisingly for small lengths the naive
> code is much faster than the library code.
If the the buffers differ within the first few bytes, then the set-up
overhead for comparing words *could* be more costly than simply comparing
bytes.
> Conclusion:
> The glibc implementation of memcmp on x86 platforms could slightly
> better optimised for the case where none or only a small number of
> initial bytes compare equal between the two inputs.
If they use rep cmps then it could be significantly optimized. In my
results, memcmp_matt() is twice as fast in all cases. My figures are a bit
surreal, however, since I did not run any test cases where the initial bytes
of the buffers differed. I think that is probably the typical case, and in
that case memcmp_naive() is probably the fastest. If not the fastest, then
it is at least not much slower than memcmp_matt() or memcmp_word_asm().
> Using a word-based comparison for the memcmp function on x86 is
> warranted when the length of initially equal bytes is expected to be
> greater than around 9 to 12 bytes. Of course this condition is
> unknown at function entry so word-based optimisation is better left
> out of a general library function and rather placed in a separate
> function which is only called based on the knowledge that generally
> comparisons will be equal below this number of bytes.
Better yet, the library function (which on my platform is an inline function
in a header file) could be an inline function which checks and does a byte
compare at small sizes and calls the other routine at large sizes. It is
common enough, I think, for the length parameter to be a constant, and the
compiler could thus omit that check.
> All of this is predicated on my functions being correct, which anyone
> is free to attempt to prove is wrong, and on my machine representing
> general x86 behaviour, which is probably a long shot.
That is definitely a long shot. P4 is a huge aberration in the otherwise
uniform world of x86 processors. However, you could make the argument that,
because P4 is what most people have, it is the general case.
-Matt
.
- Follow-Ups:
- References:
- Prev by Date: Get the FAQs
- Next by Date: Re: 8253 Timer: Intel Celeron Processor/D865GBF Motherboard
- Previous by thread: Re: Implement memcmp function with help from gcc's inline asm
- Next by thread: Re: Implement memcmp function with help from gcc's inline asm
- Index(es):
Relevant Pages
|
|