Re: The never ending assembly vs. HLL war

jukka@xxxxxxxxxxxx wrote:
> Again, I don't see the problem with measuring the performance on one
> system, note that I am not drawing conclusions how the code shall
> perform on other, different systems. I am extrapolating, and with a
> good reason that the performance shall be relatively similiar or in
> similiar ballpark.

The problem is that you then generalize your results and say "see,
using assembly isn't worthwhile!" You understand the fallacy here,

> with this "information" or presenting anything groundbreaking.
> I was mostly interested in dispelling the "assembler myth",

And that's the argument that I'm not buying. The fact that in some
controlled situation you can cajole a C++ compiler into producing code
about as optimal as one can expect does not imply that the compiler
will do this all the time. You are dispelling no myth, I'm afraid.

> /*
> hat being said, all you're
> doing is saying that *some* HLL implementations beat *some* assembly
> optimizations. Do you really think this is something new?
> */
> In general? Nope. When replying to Hutch? Yeah, I did, actually.

Yet earlier in your post you talk about the results on your machine vs.
the results on Hutch's machine. Exactly how are you dispelling any
myths here? Each machine requires an independent optimization. The fact
that an optimization on Hutch's machine isn't as valid on your machine
should prove to be no surprise here. It's one of the main reasons I
quit "counting cycles" when the Pentium first arrived -- there's no
sense in in anymore.

> /*
> for a *different* processor, what do you expect? And, of course,
> CPU-dependent HLL code that does a decent job is going to walk all over
> some sample code that was written by someone who (1) isn't very good,
> (2) doesn't understand the characteristics of the CPU you're running
> with, or (3) both.
> */
> pssst... the assembly output from compiler for the innerloop was
> *identical* to the original assembly code. That sort of means no matter
> what x86 implementation the code being the same I am not surprised the
> timings being nearly same, too, the differences come from the constant
> overhead mostly from the code at the end.

And how does this dispell any myths? No one is claiming that a compiler
*never* produces code that could be as good as a human code,
particularly for short code sequences. Just that as the programs get
larger, the compilers tend to fall flat on their faces. Again, it's the
issue of "brilliant code sometimes" plus "bonehead code other times"
equalling mediocre code overall. Sure, you can "reverse engineer" an
assembly algorithm in C++ (a perfectly fair thing to do), but how often
do you see this in practice? And is the result any better (readable,
understandable, maintainable, robust, etc.), than the corresponding
assembly code. How many people, for example, will find the C++ strlen
function you've written to be any more understandable than the assembly
version (from an algorithmic point of view, obviously)?

> /*
> Unfortunately, your plan doesn't scale up very well. The problem with
> C++ is *not* that you can't write efficient code if you're *very*
> careful and consider the code the compiler is emitting (and adjust your
> C++ source code appropriately). The problem is that no one writes
> anything but trivial little (and often non-portable) code this way.
> */
> Mostly applies for instances where I expect to reuse the code and don't
> want worst possible case runtime characteristics to be easily invoked.
> If possible by design, not at all.

Ultimately, the way to write faster programs is to *skip* the C
mentality. IOW, if you want really fast programs that manipulate
strings, you dont' get in the habit of using C standard library
routines (regardless of how well they are implemented). This is the HLL
trap, not the lack of compiler code-generation quality. If the compiler
produced the best possible code that could be generated and you turned
around and did things like "strlen" or any of a host of other stdlib
functions to achieve your goals, you'd wind up with slower running code
than would be possible if you were completely aware of what was going
on in the program. C++ (and the STL) take this problem to a new
extreme. It's so easy to use things like sets, lists, maps, or other
containers that people do so without thinking about the costs
associated with them. Even in plain C, you get performance problems
when people do things like:

strcpy( a, b );
strcat( a, c );
strcat( a, d );

The problem, of course, is that you wind up computing the length of the
strings over and over again when it is completely unnecessary (as each
string function call above internally produces a pointer to the end of
the string that could be used by the next function call). This may seem
like a trivial example, and easily worked around, but it's typical of
the kinds of problems that sap performance out of HLL programs; it's
also the kind of thing that you don't see in low-level assembly
programs because the programmer sees what is going on when writing
their code (by "low-level" assembly, I mean that you're not simply
writing the code with a HLL mindset and calling these same sorts of
functions from your assembly code).

> /*
> IOW, C++ really doesn't have much benefit over assembly when you go to
> all this trouble other than it might be able to run on different CPUs
> (but you often lose the optimizations when you do this).
> */
> A list of platforms I am working on omitted here, because you don't
> give a rat's ass.

When you put it that way, I guess I have to agree with you.

> /*
> I have no doubt that you can take a trivially small function like
> strlen and coerce your HLL code to emit stuff about as good as a
> hand-optimized assembly version of the same code. But as I mentioned
> earlier, this trick doesn't scale up to larger programs. From a
> */
> Nor it should.
> /*
> And the tricks you pull to get the good code emission often won't carry
> over to other
> architectures (or even different CPUs in the same family).
> */
> Which tricks might those be? I used addition, subtraction, bitwise or,
> bitwise and and other pretty fundamental operations and very small
> number of local variables, the compiler must work really hard to come
> up with more than 8 registers needed to hold all that up. Unless we go
> to compile out code for Commodore64, or maybe something z80 based we
> won't run out of registers for such trivial function very easily,
> atleast.
> I mention 8 above because it takes some effort to think what 32 bit
> architechture, for example, would have less registers and I might be
> actually someday compiling for it.
> /*
> What happens when you compile this on a 16-bit CPU?
> */
> God forbid, or 8-bit CPU! There are practical limitations on what I
> assume the system I will use the code on, will have. I don't assume
> this will work very well on PDP's either!

Precisely my point. The optimizations are not portable. For this
particular example, you're limited to 32-bit processors.

> /*
> On a 64-bit CPU? Maybe the code works fine on various 32-bit CPUs, but
> the
> optimization hardly portable across different CPUs.
> */
> It does, on some. MIPS, PPC and x86-86 spring to mind. If there is
> 64-bit CPU where sizeof(int) == sizeof(long long) == 8, where char is 8
> bits (sizeof(char) == 1, always) then it won't work and the
> configuration will be unknown or not supported, and the headers will
> fail compilation at #error ).

IOW, the "trick" isn't portable and you're suffering from some of the
same problems as assembly language.

> If it is just one or two functions that fail, such as this case, have
> to put #ifdef / #endif kludge there to always use the "char*" version,
> if that doesn't work either then no support. So far the codebase has
> been very useful, though.

Sure, and we can write multiple assembly routines for different
processors, too. Granted, more than you'd need with C (or any other
HLL), but the idea is just the same.

> I would surmise the code is order of magnitude more useful than x86
> specific assembly snip.

By what reasoning?
Given that about 90% of the world's computers today are x86 CPUs, I
don't see how having the code in portable C++ is going to make it an
order of magnitude more useful. Certainly we can find *some* people
for whom portability to other processors is necessary, but an order of
magnitude (IOW, they need to run the code on ten different processors)?
I don't question your claim; from a mathematical perspective I'm sure
we could find a group of people amongst whom the need to have a
portable strlen function that compiles on 10 different 32-bit (non-x86)
processors is important, but...

When you look at the number of people (end users) who will actually
benefit from the code, however, it becomes real clear that the choice
of HLL or x86 assembly is *mostly* irrelevant because most end-users
are running x86 boxes.

> > So you're proof is *one* example versus another?
> It's not proof, it's an example. Here is the post I replied to, maybe
> you should afterall read what is being discussed?

You are "debunking a myth". You don't debunk a myth (that is, prove
your point) with one example. You might be able to prove that assembly
isn't *always* better with one example, buy you cannot make a claim
that there is no need to use assembly language on the basis of one

> "Don't be overawed by compilers, assembler coding is not restricted to
> the architecture of a C compiler. The following code is a modification
> of Agner Fog's DWORD string length routine that aligns the start and
> tests the length 4 bytes at a time. It has no stack frame and conforms
> to the normal register preservation rules under windows so it preserves
> ESI and EDI but trashes the rest. "
> It is clearly stating that such optimization is exclusive to assembly,
> which apparently isn't the case. Now you know the *context* of the
> discussion, atleast.
> >Hmmm... Hardly convincing.
> Convincing enough to debunk the implication that such optimization is
> only achievable though the holy assembly.

And you're assuming that Hutch got it right? That his example is the
absolute pinnacle of what can be done in assembly?

> >I can provide you with a whole slew of strlen programs in assembly that
> >run much slower than:
> >
> >t = s;
> >while( *s ) ++s;
> >return s-t;
> Just one is plenty, please do by all means.

mov edi, src
mov ecx, -1
mov al, 0
rep scasb

That will probably run slower than the output of a good compiler for
the above C code. Yes, scasb is *that* bad.

> /*
> Indeed, if compilers are *so* great, why can't they convert this code
> into something as wonderful as what you've presented? After all, doing
> so is really just an induction step (albeit, a complex one).
> */
> Now it is all of a sudden "wonderful", do I sense sarcasm?

Sarcasm or not, it's quite clear that a compiler working on your code
produces a faster result than one might expect from a compiler working
on the simple C code above.

> What I mean, is, that when you write assembly you generally use the
> fully qualified register names. You maintain register names manually.

Yes, one advantage of using assembly is that you have complete access
to the low-level machine facilities, including the registers.

> Labourous and error prone process.

Changing the subject?
I'm certainly not questioning the fact that, in general, writing
assembly language is more "labourous and error prone" than writing in a
HLL. OTOH, I'll also point out that if you write your HLL code the way
you've written that xstrlen function, you will find writing the code
fairly "labourous and error prone". Optimization is a painful process,
regardless of the language.

> Enter ANSI C. You can write your
> intention with named variables, which are then at compilation
> translated and assigned to real registers (add spilling for flavour in
> this so.-called register allocation stage). And on and on.

And sometimes the compiler is brilliant when doing this, and sometimes
it is real bone-headed. Your point?

> Because most microarchitechtures are different, the pragmatic approach
> taken is to find a common subset of operations the language supports. A
> no-brainer, as you well know, you just wanted to nit-pick, well good
> job! Congrats!

Again, we're back to the argument of "this makes life so much easier
for the programmer" rather than "the compiler does as good a job of
this as the programmer could do himself."

> > which makes it a jack of all trades but master of none.
> As you quoting this, I hope you also read this! Guess what!? The above
> quote means same thing just without all the flair and nitpicking going
> about!

Sorry, you've lost me along the road somewhere. Perhaps you could be
more articulate. It really seems to me that all you've done here is
switch from "compilers produce code as good as humans do" to "it isn't
cost-effective for humans to write code this way, so we live with what
the compilers produce." A very different argument. But for some reason,
that's where this argument always winds up. I guess that means we've
reached the end of the debate.

> > You make this claim with just one example?
> Well, mostly the claim was based on 10+ years of professional
> experience (and nearly 20 years of programming, total) and the opinion
> that comes with that. I'm sure you also have a lot of experience, so
> you know what I am talking about.

Well, maybe that's the difference between us. You see, I've got about
25 years' experience as a professional programmer and I've worked both
in the times when assembly was mandatory (to get any kind of
performance at all) and I've been around during the past 15 years when
compilers became efficient enough to be usable for the larger
percentage of projects. Most of my real (professional) work is done in
languages like C, C++, and Delphi. So it's not like I'm unaware of what
these languages are good for. OTOH, I don't go around claiming that
there is no reason to use assembly because compilers today are as good
as humans. I may very well say that it makes *economic* sense to use
HLLs, but it's not the case that compilers are as good as human beings.

> >Gee, I'd argue that it's going to be real hard for an assembly language
> >programmer to beat the code that a C compiler produces for the
> >following:
> >
> >i = 0;
> Okay. Gee, that can be completely eliminated if the result isn't ever
> used in the current scope, as I don't even see function call so any
> possible side effect can easily be determined to be non-existent in
> this case. It's random if assembler coder will "see" this or not, it is
> more deterministic if a compiler will see this or not. But if don't
> know the compiler in advance, then, it isn't.


> I dunno what to make of that. Was that a kind of ridiculous example to
> show my actions in a "different light", so to speak? If so, ummmm...
> right.

The point I'm making is that xstrlen is a ridiculous example to use to
debunk the myth that assembly isn't useful. Hutch may have overspoken,
but attempts on your part to show that using assembly for this task
aren't quite making the point. xstrlen is actually one of the easier
things to code efficiently in C. Just like "i=0;" is pretty easy to
code efficiently in C. If you *really* want to debunk the myth that
assembly has no advantage over HLLs, you need to move beyond strlen.

As an aside:

A few years back (okay, maybe decades at this point) some research at
Berkeley, or thereabouts, demonstrated that most strings processed in
HLL programs (written by students, granted) were 10 characters or less
in length. If that still holds today, it's almost a no-brainer that the
trivial strlen function (byte at a time) will outperform the craziness
embodied in the examples in this thread, because of the intrinsic
overhead. Sure, you can feed your code thousands of long strings and
demonstrate how much better one algorithm is than another, but the
bottom line is that in the real world, the data sets in use may
completely invalidate the test set you're using. IOW, how well does
your test data model the real world data that an average program will
see? This is one reason why I argue that xstrlen is a ridiculous

So when Hutch talks about saving all the function setup and tear-down
code, this is not an insignificant matter. It reduces the overhead of
the function call, thus vastly improving performance for small strings
(which this older research suggests is a common situation). Now the
truth is, some compilers can generate code that doesn't require setting
up and tearing down stack frames too, so Steve's proclaimations aren't
all *that* impressive, but for the common case, reducing function call
overhead can produce dramatic results (assuming, again, that short
strings are common).

> >That doesn't prove C compilers are as good as assembly programmers by
> >any stretch of the imagination. You're example is a bit more complex,
> >but nowhere near sufficient to "prove" the point.
> C compilers aren't better than assembly programmers, they are just more
> time and money -efficient. When there isn't choise, there isn't choise,
> ask Tom Duff.

Again, that's a different argument. Few people question the economic
aspects. Then again, if people wrote C code the way xstrlen has been
written, the economic advantage of C over assembly would be greatly
diminished. Again, *optimization* is an expensive process, regardless
of the language used. Assembly generally has a bad reputation in terms
of programmer efficiency because people who write assembly code tend to
write more (locally) optimized code than those working in HLLs. Ergo,
it's more expensive. If you write assembly code without regard to
minimizing resource use, then it's far less costly to use assembly.

> But that wasn't my point, even though you seem to have that
> disillusion... I was showing how that particular assembly code snip
> doesn't "beat" HLL code, not the other way around.. a subtle
> difference...maybe too subtle if haven't even read the thread...

Oh, I've read it. That's not what you said earlier. But I'll allow you
to back out of that gracefully. This is, after all, USENET and we have
to allow for considerable "unstateds" and "misreads".

> >for( myclass::iterator si = s.begin(); si != s.end(); si++ ) {...}
> >
> >And they have no idea what the compiler is doing with their code. Take,
> >for example, that innocuous "si++" at the end of the for argument list.
> ++i vs. i++, gee-whizz, now we're getting to the ABC's and 101's of C++
> programming.. and you blame me for going too basic? ;-)

Amazing, isn't it? Something so *basic* trips up 99% of the program out
there. Exactly the point I'm making. You won't see mistakes like this
made in a typical assembly program.

> Yeah yeah, i++ creates temporary object because it has to return the
> *current* value, before returning from ++ operator (postfix) we have to
> increase the current value, we cannot return it.. so we return
> temporary object created before the increment.
> /*
> someone who knows exactly what's going on behind the scenes probably
> wouldn't write code this way, but how often do you see people writing
> standard C++ programs the way you wrote your xstrlen?
> */
> I wouldn't know, I suppose been in a professional community for far too
> long. My attitude isn't professional as I am a bit childish, you might
> have noticed.. but that's my problem, thank you for not making funny
> remark about that in advance.

And when we look in your code, we'll never see an example of this,
That's the only point I'm making- HLL abstractions, the things that
make it easier and faster for programmers to write code, also hide the
things that can cost them dearly. Even when they've got the experience
to know better.

> > Do you honestly write *all* your C++ code that way?
> I need clarification on this, what you mean "that way?",

As in the way you've written xstrlen.

> what
> specificly strikes odd in "that way"-- I don't get it, yes, I do write
> code "that way" a lot of times, it comes from the backbone. Is it that
> bad, if so, show me the error in my ways and I'll learn.
> What took me so much effort was that first I reverse engineered the asm
> snip, but I wasn't happy with it as I would *never* actually, go out,
> and write code that was off the bat. I got some idiosynchronies, I
> admit, which I follow as I found them a sound practise, and I keep
> myself trim and up-to-date what works, and what doesn't.

The point I'm making here is that writing code like xstrlen is a good
example of something that gets you into trouble down the road. Written
in assembly, we *expect* to rewrite it for later processors. Written in
C? No, we expect to be able to recompile it and have it work fine, no
matter what comes along. And we curse the guy who wrote C code like
that. Other than "why did this idiot use assembly?", few people would
question the use of that crazy strlen algorithm in assembly; indeed,
they would expect it to be written that way.

> > Or do you just write code that way
> >when you're trying to prove that C++ compilers can emit code that's as
> >good as assembly programmers? And when you *do* write code that way, is
> >it any faster or easier than using assembly?
> Well, shit,, go to Fusion page, download the "latest
> version", decompress the sourcecode. The sourcecode is 750 kB
> compressed (with some minor data inside), feel free to go through every
> line if you have to.

Well, I went through enough lines to know that you don't write your
code the way you wrote xstrlen. Which is *good* from a
readable/maintainable/robustness point of view, but it also means that
someone who writes assembly code to do the same job is generally going
to get much more efficient results. Whether this is important or not is
a different question, of course.

> And no, I don't write code "that way" to be faster than assembly.

Of course not. Most people writing C++ code don't write their code to
be faster than assembly. Indeed, "fast" is rarely a factor, other than
fast development or easy development.

> That's "the way" I write code.

And it's not a bad style (though I'd suggest more comments :-) ).
But people who write code that way (and I'm no different) are not
writing their C++ code in a manner than compilers can efficiently
translate into machine code. And if someone were doing the same
operations in assembly, even if they weren't the *greatest* assembly
programmer around, they'd probably produce better output than the
compiler. It all has to do with thinking in assembly language rather
than thinking in a HLL. That's the crucial difference. Your xstrlen
function is a good example of thinking in assembly (even when writing
in C). I've seen lots of assembly code where the author was thinking in
a HLL rather than assembly (and the result isn't very good). But when
someone thinks in assembly, the result is often quite good. This is why
assembly programs are generally better than HLL programs. Assembly
programmers often think in assembly wherease HLL programmers think in
their HLL.

> I don't know if you trying to insult, be
> polite or just being sceptic.

Label me a sceptic. I've heard it *many* times before. And the argument
always boils down to (as this one has) that it's more economical to
develop in a HLL, which is what makes the HLL better. No argument
there. But the economics don't imply that the compilers can do a
better, or even as good a job as the assembly programmer. Sure, in a
few specialized cases, it can. But the results don't scale up to large
systems (for, quite frankly, the same reasons using assembly language
doesn't scale up).

As for insulting, please check your own post. There are a few too many
profanities and inferences on your part for you to be able to play this
card here.

> Whatever, dude, if you don't have time or
> will to verify what I write here, good, I wouldn't care what you think
> about me.

That makes us even. I don't care what I think about you either :-)

> But that's some work I been doing. Want my resume? I don't
> have one. I always have job offers on my inbox and it been that way
> since 1996 or so.

Good for you. You escaped the problems of our industry over the past
four years. But discussions of your experience and how long you've been
employed are not particularly good supporting arguments for your
hypothesis that there is no need to use assembly language because
compilers generate code as good as a hand coder.

> >Bottom line is that most C++ programmers would just write:
> >
> >t = s;
> >while( *s ) ++s;
> >return s-t;
> >
> >(or something similar) and move on.
> Guess what? That's precisely what I wrote, too, and moved on.

And that's exactly the point I'm making. Most HLL programmers (myself
included) will often write code like this and just move on. Assembly
language programmers (myself included, when working in assembly)
generally *wouldn't* do this. Oh, they might do it on the first pass,
but then they'd see how ugly the result is and decide to do something
about it. Sometimes, particularly with inexperienced programmers, the
ugliness might not be discovered until someone points out that a HLL
call is faster than their assembly gem (witness this thread), sometimes
they can just tell that the solution isn't very good. But the bottom
line is that an assembly language programmer is more prone to do
something about the ugly code rather than thinking "well, that's the
best I can do" and move on. How many C/C++ programmers, for example, do
you think could come up with your xstrlen function on their own?

> /*
> clue what's going on. Before you get in a tiff, I *do* realize that
> *you* probably do know what's going on. But you don't write all the
> world's HLL code.
> */
> Most of the world's HLL code doens't need to be "fast", most of the
> times I would be glad if it "worked", which it generally does, if not
> before a patch or two atleast after.

And for code that doesn't need to be fast (or small, or otherwise
resource limited) there is no need to use assembly. We can agree on

> "fast" is not a goal, "fast enough" is.

Of course. And just as in every other "assembly vs. HLL" thread that
has ever existed, we wind up with "okay, so what if HLL code isn't as
fast as assembly; CPUs are so fast we don't need it to be." The fact
that we may not need all programs to be efficient does not tell us that
compilers are doing as good a job as assembly programmers. It simply
tells us that the CPU manufacturers have been doing a decent job and we
can get away with a lot of slopiness on the part of the compilers
without it affecting our ability to deliver code that meets performance

> If code is "fast enough",
> that's it, job done.

Unfortunately, code often gets used (and reused) in ways the original
programmer (or specification) doesn't expect. How fast is "fast enough"
for the xstrlen function, for example? No doubt, it's great for the
application you're writing today. But how about tomorrow? Some
routines, like generic library routines, should be *as fast as
possible* because there is no way to predict how they will be used. If
they're overkill for a beginning student's "number guessing game" then
that's no big deal, but if they're too slow for a database application,
uh-oh. How many programmers have the time to go in and rewrite the
stdlib when their application runs a little too slowly?

> I seen some guy optimizing keyboard interrupt
> handles in assembler for MS-DOS, maybe he thought someone would press
> keys really, really fast and that would slow his program down, go
> figure. Or maybe he was scared that he would miss a few keystrokes,
> again, go figure. Such strange characters are not my specialty (you may
> say myself excluded... ?=)

Again, one example of idiocy does not imply that all attempts to write
fast code are worthless. And you never know -- It could turn out that
this person you're talking about has a real-time foreground application
that could allow more than a few (hundred) microseconds' interruption.
In which case having a fast keyboard interrupt handler is a *very* good

Even if that person didn't need the performance for his/her current
app, perhaps the next user of that ISR would. I've got a *little* bit
of experience working in real-time systems, and I can assure you that
in most real-time OSes, minimizing time spent in an ISR is a *very*
critical thing (not that MS-DOS qualifies in this respect, but you get
the idea).

> >Yes, you do not have access to the low-level machine. As I said, C is
> >not an assembly language. Believe me, you don't have access to a *lot*
> >of things that might be useful on occasion.
> I don't have to believe YOU, I believe my own EXPERIENCE.

And, in your experience you call C an assembly language. That speaks
volumes about your experience, I'm afraid.

> >The #2 thread (after strlen) is memcpy. My alternative is to simply use
> >the movsb instruction.
> Are you trying to insult my intelligence?

No, I'm simply pointing out that this thread is second only to the
memcpy threads that pop up. What this would have to do with your
intelligence is beyond me.

> Look at the string.hpp, you
> might see std::memcpy() being invoked here, and there. I don't even
> *consider* the alternative!

Good for you.

> If you see meta::vcopy(), it is a different beast, it does check if
> type is pod (uses traits) and does memcpy, or object-by-object copy so
> that corresponding copy constructors and what not are invoked correctly
> in the process. Mostly I use that construct with templates.

I think you completely missed the point of my comment. Allow me to
explain it better and forgive me if I sound patronizing:

(In order of occurrence):

FAQ #1: what's the fastest block copy code we can write
Answer #1: Take a look at the AMD optimization guide and quit posting
routine after routine here. Any attempt to do better than that is going
to fail on different architectures.

FAQ #2: Here's my strlen function, how can I make it faster.
Answer #2: Check out the AMD optimization guide (or Agner Fog's page)
and use that code. Again, unless you're writing the code for a specific
CPU, you aren't going to do substantially better.

FAQ #3: Aren't compilers as good at generating machine code as human
Answer #3: No they are not. It's *easier* and more *economical* to
write code in a HLL, but the results are often much bigger and slower
than an equivalent program written in assembly. Most of the time,
slower and bigger is no problem, so go ahead and use your HLL. But
don't go around thinking that the code produced by your compiler is as
good as the stuff a decent assembly language programmer will write.

> >> Also, I could unroll the C++ innerloop but I won't do it because I
> >> think it is not a particularly good idea in this case.
> >
> >That depends entirely on the CPU and memory architecture.
> Context: competing with the specificly mentioned assembly code, which
> was unrolled. If you take that into consideration you have not-so-much
> to nitpick about.
> Since it is x86 assembly, I'm assuming some contemporary x86
> implementation will be running the code.

Okay, that's good for today. What about next week's CPU?
This is the thing that killed me 10-15 years ago. I was carefully
hand-optimizing code for the 486 and then the Pentium came out and
changed all the rules. Then the PII, then the PIII, then the PIV. Up to
the 486, whatever rules you applied on one CPU tended to work well on
the next generation. This stopped with the 486.

And "contemporary" doesn't even cut it. The optimization rules for the
PIV are quite a bit different from those for the AMD chips (and, the
PIII, upon which the PM is built). Better just to ignore all the
CPU-specific stuff and go with the general principles that work across
all CPUs. The differences you are talking about (e.g., loop unrolling)
are good examples of things that fall into the CPU-specific categories.

> I don't think unrolling the
> C++ code will do much good. Maybe on 386 or older processor it might
> pay off, hell, most likely it would. But I'm not too much interested in
> 386 these days...

I'm not suggesting that you unroll your code. I'm simply stating that
your argument that unrolling code is bad because of your experiences
with your particular CPU is not wise. On other contemporary CPUs, or on
future CPUs, the rules may be different. And the rules could also
changed based on the memory alignment of the code (I've seen some
pretty big differences in performance based on the position of the code
in a program). And let us not forget caching effects. When you run
1,000,000 strings through your xstrlen function, you're hammering on
the same code over and over again and even the data access (usually
sequential) is pretty good as far as a cache is concerned. What happens
when you call xstrlen from within a real program when the code isn't
cached up and the data isn't in cache? You'll probably get quite
different results based on whether the code is unrolled or not. I don't
know which would be better (for a given CPU, of course), but I do know
that claims of "this isn't better" or "this is better" tend to melt
away when the environment changes on you. Bottom line is that assuming
that code that works great on today's CPU is no guarantee that this
will be true on tomorrow's CPUs. A lot of assembly language programmers
discovered this fallacy when going from the PIII to the PIV.
Randy Hyde