The never ending assembly vs. HLL war
- From: "randyhyde@xxxxxxxxxxxxx" <spamtrap@xxxxxxxxxx>
- Date: Thu, 27 Oct 2005 01:43:24 +0000 (UTC)
jukka@xxxxxxxxxxxx wrote:
> /*
> Unrolling code won't *kill* branch prediction; it might not take
> advantage of branch prediction, but it doesn't kill it (i.e., make it
> less effective when it is used).
> */
>
> Unrolling 8 times while having essentially verbatim the same innerloop,
> branch instruction excluded is not particularly effective optimization.
> The biggest optimization is that the code is bigger.
This is true, but it still doesn't *kill* branch prediction.
>
> /*
> Of course, the main problem I'm seeing here is that people are running
> their programs and measuring the results on different CPUs. And all
> */
>
> It's not really a problem, I'm running MIPS, PPC, x86-64 and IA32 and
> also working with ARM (various versions) and generally, higher-level
> optimizations are "portable", there is ofcourse small fluctuations
> based on architechtural differences but generally like I just wrote
> higher level code, which does less "work", is usually faster on any
> platform.
>
> Biggest differences come from floating point to integer and vice-versa
> interaction and dependent issues, also lack of floating point processor
> mixes things about but that isn't the case here. The code is fairly
> trivial integer-only very RISC-like ALU only is required. Number of
> registers needed is small here, too.
All the more reason that counting cycles isn't a good approach for a
strlen optimization.
>
> /*
> bets are off when you do that. This is why it's almost a worthless
> exercise to "count cycles" these days. If you optimize the code on one
> CPU, chances are pretty good it *won't* be optimal on a different CPU.
> */
>
> First, I am not counting cycles.
Yes, you are.
> I am measuring average performance
> over large number of runs on data that isn't same on each successive
> run. Slowest and fastest runs are always dropped so that random
> fluctiation is eliminated. Additionally, each test string is tested
> with different start offsets so that different alignment cases are
> tested and different "leftover" cases are tested.
And this is averaged across all possible x86 CPUs (and others, too),
right? If not, you're counting cycles on *one* particular CPU. And
that's a waste of time if you're trying to produce generic results.
>
> The same code works as regression test, too. :)
???
>
> /*
> So unless you're writing the code to run on a specific rev of a
> specific CPU, shaving one or two clock cycles off the execution time of
> an algorithm is a real waste of time.
> */
>
> In production code I rarely do this, I write pretty good code already
> off the bat anyway. But this is for fun, dude, you think Usenet is
> where my work is at? ;-)
Yes, rewriting strlen algorithms seems to be one of the funnest things
ever. Seems like someone new is doing it every other week.
>
> /*
> over and over again, the more reasonable question to ask is "why aren't
> all these people using a better data structure that makes string
> operations more efficient?" (e.g., including the length as part of the
> */
>
> Apperently you haven't read the whole thread.
Why bother?
Do you really think you're adding anything new to the discussion that
hasn't been said a thousand times already? Do you really think this is
new territory around here (or anywhere else)? The better question for
you to ask is "why bother responding to a thread that is probably the
#1 FAQ in a newsgroup such as this one?"
> I think it was me, who
> suggested that storing length of the string into the string object is
> way to go, multiple times infact. Here's the string class, actually:
>
> http://www.liimatta.org/misc/string.hpp
Do you think you're the first to make this suggestion? Do you think my
response is directly specifically at you? Do you think....
No matter what you said before, it's always important to inject reality
checks into threads such as this one. And this thread is definitely in
need of a reality check. And this isn't a direct response to you, it's
a comment to anyone reading this thread who thinks that there might be
something to what's being said here.
The truth is, *memory architecture* is the biggest impediment to memory
scanning algorithms like strlen. Even if *every* (x86) CPU had exactly
the same timing characteristics, measurements would still be all over
the map based on memory controllers and access times (unless, of
course, you cheat and assume that all your strings are in cache before
you run your strlen function, which isn't exactly fair).
[lots of unrelated stuff snipped]
>
> /*
> the idea of simply recompiling old programs that don't know any better
> and having them run faster, but the truth is that if people would
> simply stop using zstrings (except for interface with OSes and other
> code that uses them) and employ a better internal data structure, the
> */
>
> The internal data structure still needs initialization, string's
> lengths are very often needed.
Well, let's see. You initialize the string's length once. You use it
many times. Sounds like a pretty good tradeoff to me.
> Lazy Evaluation of them invokes branch
> for every instance of use of the information,
Who said anything about lazy evaluation? And even if we *do* use lazy
evaluation, it's still a *whole* lot cheaper than running strlen
everytime we want the length, no?
> which I already explained
> in this thread. There's tradeoff being made between different usage
> patterns.
Obviously, if you know something about your data, you can improve on
the algorithm in use. One thing you seem to know is that people use the
length quite frequently. So cache it rather than compute it (that is,
store the length as part of the data structure). While there may be
some *rare* cases where the extra work to save away this length is more
expensive than recomputing it each time, I'm not sure I can think of
any off-hand.
>
> /*
> whole issue of the "fastest strlen" algorithm would go away. If people
> would use reference counters and pointers, then worrying about the
> "fastest block move operation" would diminish, too.
> */
>
> I am not interested in "fast strlen" per-se, I am interested in
> discussing the futility of optimizing it in *assembler* when the
> performance difference is neglible.
Yes. It is futile to optimize bad algorithms to begin with. This is
true no matter *what* language you use. That being said, all you're
doing is saying that *some* HLL implementations beat *some* assembly
optimizations. Do you really think this is something new? The bottom
line is that on the variety of x86 processors available today, if
you've got a C compiler that generates code for *that specific*
processor and you compare this against assembly code that was optimized
for a *different* processor, what do you expect? And, of course,
CPU-dependent HLL code that does a decent job is going to walk all over
some sample code that was written by someone who (1) isn't very good,
(2) doesn't understand the characteristics of the CPU you're running
with, or (3) both.
> I first said that it's not such a
> bright idea, just using kinder words. Then I went out my usual trouble
> to put my code where my mouth is, done, as stated I don't talk *** and
> can backup my opinion with facts.
Why are you wasting your time? :-)
>
> /*
> As the old saying goes - get the algorithm (and, presumably, data
> structure) right *FIRST*.
> */
>
> Well, ofcourse, you are talking to the guy who proposed that three days
> before you..
No, I'm talking to the guy who proposed it in a repeat of a very long
running thread. I can assure you that my response to this question in
1996 was the same as it is today.
>
> After that is done, order of doubling the performance is still pretty
> cute for being just petty implementation detail. I want to emphasize
> the fact that I find it mostly cute that someone wants to stick to
> assembly code, which isn't portable and for all practical purposes
> equivalent performance to C++ code that does the same thing.
Unfortunately, your plan doesn't scale up very well. The problem with
C++ is *not* that you can't write efficient code if you're *very*
careful and consider the code the compiler is emitting (and adjust your
C++ source code appropriately). The problem is that no one writes
anything but trivial little (and often non-portable) code this way.
IOW, C++ really doesn't have much benefit over assembly when you go to
all this trouble other than it might be able to run on different CPUs
(but you often lose the optimizations when you do this).
>
> For the record, I'm not a big fan of writing software in assembly, I am
> a big fan in understanding assembly on platforms I work on so that I
> understand how to craft my code.
This is good.
> The assumption is that the
> optimizations the compiler implements don't suck, such as constant
> propagation, dead code elimination, common subexpression elimination,
> spilling and what not. That is, when such things matter.
Then the compilers turn around and generate brain-dead sequences for
other stuff. Believe me, I've spent a lot of time studying optimizing
compilers and their code emission (as I'm sure you have). I've seen
them produce some *brilliant* optimizations. I've *learned* some really
neat tricks by doing this. But for every brilliant optimization I've
seen in a particular compiler, I've also found a slew of bone-headed
code sequences that reduce brilliance to mediocrity.
I have no doubt that you can take a trivially small function like
strlen and coerce your HLL code to emit stuff about as good as a
hand-optimized assembly version of the same code. But as I mentioned
earlier, this trick doesn't scale up to larger programs. From a
software engineering perspective, it's just as hard to write C/C++ (or
other HLL code) this ways as it is to write assembly. And the tricks
you pull to get the good code emission often won't carry over to other
architectures (or even different CPUs in the same family). Though the
code *may* be portable (in the sense that the semantics are preserved
across compilations on different CPUs), the optimizations themselves
often are not. For example, IIRC your strlen function is working on
dwords. What happens when you compile this on a 16-bit CPU? On a 64-bit
CPU? Maybe the code works fine on various 32-bit CPUs, but the
optimization hardly portable across different CPUs. (BTW, just for the
record, I first saw this optimization in the BSD C standard library
code back around 1995, I wonder when the trick was first created?
Probably for the VAX?)
>
> They rarely do, as a matter of fact, as only fraction of code is time
> critical. I'd go as far as to say that if strlen() speed is time
> critical there is something seriously wrong with the design. However,
> that isn' the POINT of this thread. It is (for me) about higher level
> languages not necessarily implying inefficient, or slow.. or assembly
> automatically being synonymous with fast.
So you're proof is *one* example versus another?
Hmmm... Hardly convincing.
I can provide you with a whole slew of strlen programs in assembly that
run much slower than:
t = s;
while( *s ) ++s;
return s-t;
Indeed, if compilers are *so* great, why can't they convert this code
into something as wonderful as what you've presented? After all, doing
so is really just an induction step (albeit, a complex one).
>
> The way I see it is that C is portable assembler,
C provides no access to low-level machine facilities. Therefore, it is
not an assembly language. It may be lower-level than languages like C++
or Java, that does not make it an assembly language. In order for it to
be an assembly language, you need access to all the features of the
CPU.
> which makes it a jack
> of all trades but master of none. But in this case it doesn't hurt the
> performance, as can be seen the beginning of the code is nearly
> identical when compiled.
You make this claim with just one example?
Gee, I'd argue that it's going to be real hard for an assembly language
programmer to beat the code that a C compiler produces for the
following:
i = 0;
That doesn't prove C compilers are as good as assembly programmers by
any stretch of the imagination. You're example is a bit more complex,
but nowhere near sufficient to "prove" the point.
The real problem with HLLs like C++ is that they encourage people to
write code like:
for( myclass::iterator si = s.begin(); si != s.end(); si++ ) {...}
And they have no idea what the compiler is doing with their code. Take,
for example, that innocuous "si++" at the end of the for argument list.
Written in standard C style (++ as a suffix, which is bad style to
begin with), this simple statement may wind up creating a large
temporary object and then immediately destroy it. How ugly. Sure,
someone who knows exactly what's going on behind the scenes probably
wouldn't write code this way, but how often do you see people writing
standard C++ programs the way you wrote your xstrlen? Do you honestly
write *all* your C++ code that way? Or do you just write code that way
when you're trying to prove that C++ compilers can emit code that's as
good as assembly programmers? And when you *do* write code that way, is
it any faster or easier than using assembly?
Bottom line is that most C++ programmers would just write:
t = s;
while( *s ) ++s;
return s-t;
(or something similar) and move on. (This is assuming, of course, that
the function wasn't in the stdlib.) It doesn't matter how good the
compiler is (or should be), it's not going to be able to deal well with
code that is written in this fashion.
Though assembly programmers rarely do any better when writing a lot of
code (that is, they don't count cycles either), the bottom line is that
they are usually *forced* into looking at the abominations they create,
it's generally not hidden from them. As a result, they tend to do a
better job of implementing an algorithm that executes efficiently on
the underlying hardware than does a HLL programmer who doesn't have a
clue what's going on. Before you get in a tiff, I *do* realize that
*you* probably do know what's going on. But you don't write all the
world's HLL code.
> The only real difference is in the end,
> because I cannot express "subtract with borrow" concept in C,
Yes, you do not have access to the low-level machine. As I said, C is
not an assembly language. Believe me, you don't have access to a *lot*
of things that might be useful on occasion.
> therefore
> I crafted alternative which is nearly same performance in practise.. it
> ONLY adds little constant overhead, but the innerloop is, well, same.
The #2 thread (after strlen) is memcpy. My alternative is to simply use
the movsb instruction. When it's all said and done, it's not that much
slower than doing *really* fancy stuff involving SSE instructions,
unrolled loops, and other nonsense that doesn't help improve memory
bandwidth much. I won't argue that I've got the fastest memcpy around,
but the differences are rarely worth the effort.
>
> Also, I could unroll the C++ innerloop but I won't do it because I
> think it is not a particularly good idea in this case.
That depends entirely on the CPU and memory architecture.
Cheers,
Randy Hyde
.
- Follow-Ups:
- Re: The never ending assembly vs. HLL war
- From: jukka@xxxxxxxxxxxx
- Re: The never ending assembly vs. HLL war
- References:
- improve strlen
- From: Claudio Daffra
- Re: improve strlen
- From: spamtrap
- Re: improve strlen
- From: hutch--
- Re: improve strlen
- From: spamtrap
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: hutch--
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: randyhyde@xxxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- improve strlen
- Prev by Date: Re: Why gcc translate a c program into assemble as follow
- Next by Date: Re: improve strlen
- Previous by thread: Re: improve strlen
- Next by thread: Re: The never ending assembly vs. HLL war
- Index(es):