Re: The never ending assembly vs. HLL war
- From: "jukka@xxxxxxxxxxxx" <spamtrap@xxxxxxxxxx>
- Date: Thu, 27 Oct 2005 05:42:00 +0000 (UTC)
> This is true, but it still doesn't *kill* branch prediction.
Here's what I wrote:
"Unrolling like that kinda kills branch prediction, I suppose. And
burns
code/tracecache for no real gain.. "
See, that when I am not sure, I express it (", I suppose."). I don't
see the problem here, I didn't correct the correction now did I? I
clarified what I mean and did acknowledge the correction as due, still
it seems to be issue somehow. I'm missing something?
/*
All the more reason that counting cycles isn't a good approach for a
strlen optimization.
*/
What "metric" would you, sir, use to measure which code is more
"efficient", when it comes to runtime performance? Timing is a good
practise, Pentium M isn't precisely off-the-charts and/or wildly
different for the timings for other contemporary x86 implementations.
I wanted to know how each perform, on *my system*, which I mention
carefully nearly every single time I post what it is (Pentium M 1.8
GHz). I post the results and Hutch is free to post his, which he has
done. This gives already three different x86 implementations to compare
with.
> Yes, you are.
I see. I thought you meant as-in "count clockcycles", when it was still
possible, like, with Pentium or older x86 implementations, memory
performance excluded. I thought I were *timing* the code, which is
slightly different but apparently not Enough.
Again, I don't see the problem with measuring the performance on one
system, note that I am not drawing conclusions how the code shall
perform on other, different systems. I am extrapolating, and with a
good reason that the performance shall be relatively similiar or in
similiar ballpark.
/*
And this is averaged across all possible x86 CPUs (and others, too),
right? If not, you're counting cycles on *one* particular CPU. And
that's a waste of time if you're trying to produce generic results.
*/
And it wouldn't be waste of time to run out and use a day or two to
test this code on 20 or even 40 different x86 implementations? I think
it is a fair comparison, that's how the code performs on Pentium M 1.8
GHz. That was stated many times in this thread.
Ofcourse, you admit later in your reply that you haven't READ the whole
thread so how could have you known?
/*
Yes, rewriting strlen algorithms seems to be one of the funnest things
ever. Seems like someone new is doing it every other week.
*/
I didn't rewrite anything. I reverse engineered the C++ source from
assembly source. I believe Agner Fog wrote the assembly code in
question.
>> Apperently you haven't read the whole thread.
>
>Why bother?
To know what I was replying to, for starters? At first you seem to hold
misconception that I am somehow advocate of strlen(), etc.. you seem
grossly misinformed in your assumptions of what my stance was in the
first place. But why bother?
/*
Do you really think you're adding anything new to the discussion that
hasn't been said a thousand times already?
*/
Sir, I been seeing these "discussions" a thousand times already, too,
and I seen bizarre things like "100% Windows Assembler Programmers",
and what not. But this discussion amuses me and quite honestly, I don't
have better things to do at this time as I am traveling the world with
my laptop and credit card. It is not a waste of time in a sense that it
is a good way to pass time.
You think, that I think, that I am adding a lot of new technology and
research innovation into the topic? You think that I am so stupid, that
I think *I* am creating something new, when I am merely reverse
engineering asm source into C++ ? Excuse me, either you must think I am
an idiot, or you are just trying to teach me something, like, a lesson?
As rational guy, those are things that spring to mind first. I could be
wrong, which wouldn't be the first time in history.. but either way I'm
going to force myself thinking you are simply much more experienced
than I am. No problem there, really!
>Do you really think this is
>new territory around here (or anywhere else)?
Now that you mention it, please give me references into the topic as I
am curious what conclusions the previous thousand similiar debates came
into? Let me guess, either brilliant new techniques were invented, or,
everyone agreed that , ***, it doesn't make a frigging difference! (I
am inclined towards the later proposal but that's just me)
/*
The better question for
you to ask is "why bother responding to a thread that is probably the
#1 FAQ in a newsgroup such as this one?"
*/
You seem to know the answer to that one as you asked it from yourself,
then promptly proceed to reply to this thread. No, wait, you aren't
even interested in the topic and reply anyway, why would you do that?
(I'm assuming you're not interested as you give impression that this
has been beaten to death a thousand times already, etc... if I'm wrong
and you have interest my bad).
> Do you think you're the first to make this suggestion?
Well, in this thread, yeah. You *thought* *you* were first to make this
suggestion in this thread? Yes, you did.
> Do you think my response is directly specifically at you?
Ofcourse not, since this is a public forum.. however, when you quote
me, it is still a comment on something I wrote, no? Or what's your
point? That I don't know how to discuss in the Usenet? Maybe I don't,
so?
/*
a comment to anyone reading this thread who thinks that there might be
something to what's being said here.
*/
That's fair, I agree with that without any problem whatsoever. Not that
you need my approval but I let you know anyway.
/*
The truth is, *memory architecture* is the biggest impediment to memory
scanning algorithms like strlen. Even if *every* (x86) CPU had exactly
*/
That's why I was confident that no assembler tomfoolery won't be
substentially more efficient. Infact, if you go and scan 64 bits per
iteration with MMX you still get roughly the same performance. The
"application" is bandwidth limited there is no way around that.
You ask me before if I think I am bringing some new amazing innovation
into the table. I return the question to you.
>Who said anything about lazy evaluation? And even if we *do* use lazy
>evaluation, it's still a *whole* lot cheaper than running strlen
>everytime we want the length, no?
I did. In the first reply to this thread, I believe (why bother even
knowing what my stance on the issue is before jumping the gun, right?).
The alternative I had in mind was calling the strlen() every time
string was created from zstring, not every time the length was queried,
so no, wrong assumption.
>expensive than recomputing it each time, I'm not sure I can think of
>any off-hand.
Me neither, hence, I store the length. std::string, Pascal, etc. all
seem to come to the same conclusion, so no, I don't think I am "alone"
with this "information" or presenting anything groundbreaking.
I was mostly interested in dispelling the "assembler myth", you seem to
be more interested in setting the record straight from "misinformation"
I been spreading (or whatever, wasting my own time, other's
time...bandwidth... storage.. increasing signal-to-noise ratio, all of
the previous :).
/*
hat being said, all you're
doing is saying that *some* HLL implementations beat *some* assembly
optimizations. Do you really think this is something new?
*/
In general? Nope. When replying to Hutch? Yeah, I did, actually.
So something new to whom? You? I think not. To me? I think not.
/*
for a *different* processor, what do you expect? And, of course,
CPU-dependent HLL code that does a decent job is going to walk all over
some sample code that was written by someone who (1) isn't very good,
(2) doesn't understand the characteristics of the CPU you're running
with, or (3) both.
*/
pssst... the assembly output from compiler for the innerloop was
*identical* to the original assembly code. That sort of means no matter
what x86 implementation the code being the same I am not surprised the
timings being nearly same, too, the differences come from the constant
overhead mostly from the code at the end.
> Why are you wasting your time? :-)
I got time!
> 1996 was the same as it is today.
Mine in 2005 is that it doesn't make much of a difference to me, the
code snip could have been anyting under the sky. Doesn't make any
difference to me what it was, I was mainly keen on the ASM vs HLL
aspect.
/*
Unfortunately, your plan doesn't scale up very well. The problem with
C++ is *not* that you can't write efficient code if you're *very*
careful and consider the code the compiler is emitting (and adjust your
C++ source code appropriately). The problem is that no one writes
anything but trivial little (and often non-portable) code this way.
*/
Mostly applies for instances where I expect to reuse the code and don't
want worst possible case runtime characteristics to be easily invoked.
If possible by design, not at all.
/*
IOW, C++ really doesn't have much benefit over assembly when you go to
all this trouble other than it might be able to run on different CPUs
(but you often lose the optimizations when you do this).
*/
A list of platforms I am working on omitted here, because you don't
give a rat's ass.
/*
I have no doubt that you can take a trivially small function like
strlen and coerce your HLL code to emit stuff about as good as a
hand-optimized assembly version of the same code. But as I mentioned
earlier, this trick doesn't scale up to larger programs. From a
*/
Nor it should.
/*
And the tricks you pull to get the good code emission often won't carry
over to other
architectures (or even different CPUs in the same family).
*/
Which tricks might those be? I used addition, subtraction, bitwise or,
bitwise and and other pretty fundamental operations and very small
number of local variables, the compiler must work really hard to come
up with more than 8 registers needed to hold all that up. Unless we go
to compile out code for Commodore64, or maybe something z80 based we
won't run out of registers for such trivial function very easily,
atleast.
I mention 8 above because it takes some effort to think what 32 bit
architechture, for example, would have less registers and I might be
actually someday compiling for it.
/*
What happens when you compile this on a 16-bit CPU?
*/
God forbid, or 8-bit CPU! There are practical limitations on what I
assume the system I will use the code on, will have. I don't assume
this will work very well on PDP's either!
/*
On a 64-bit CPU? Maybe the code works fine on various 32-bit CPUs, but
the
optimization hardly portable across different CPUs.
*/
It does, on some. MIPS, PPC and x86-86 spring to mind. If there is
64-bit CPU where sizeof(int) == sizeof(long long) == 8, where char is 8
bits (sizeof(char) == 1, always) then it won't work and the
configuration will be unknown or not supported, and the headers will
fail compilation at #error ).
If it is just one or two functions that fail, such as this case, have
to put #ifdef / #endif kludge there to always use the "char*" version,
if that doesn't work either then no support. So far the codebase has
been very useful, though.
I would surmise the code is order of magnitude more useful than x86
specific assembly snip.
> So you're proof is *one* example versus another?
It's not proof, it's an example. Here is the post I replied to, maybe
you should afterall read what is being discussed?
"Don't be overawed by compilers, assembler coding is not restricted to
the architecture of a C compiler. The following code is a modification
of Agner Fog's DWORD string length routine that aligns the start and
tests the length 4 bytes at a time. It has no stack frame and conforms
to the normal register preservation rules under windows so it preserves
ESI and EDI but trashes the rest. "
It is clearly stating that such optimization is exclusive to assembly,
which apparently isn't the case. Now you know the *context* of the
discussion, atleast.
>Hmmm... Hardly convincing.
Convincing enough to debunk the implication that such optimization is
only achievable though the holy assembly.
>I can provide you with a whole slew of strlen programs in assembly that
>run much slower than:
>
>t = s;
>while( *s ) ++s;
>return s-t;
Just one is plenty, please do by all means.
/*
Indeed, if compilers are *so* great, why can't they convert this code
into something as wonderful as what you've presented? After all, doing
so is really just an induction step (albeit, a complex one).
*/
Now it is all of a sudden "wonderful", do I sense sarcasm?
>C provides no access to low-level machine facilities. Therefore, it is
>not an assembly language. It may be lower-level than languages like C++
>or Java, that does not make it an assembly language. In order for it to
>be an assembly language, you need access to all the features of the
>CPU.
Oh, gee-whizz, you found something to pick on, good for you! Maybe I
need to clarify my stance on this?
What I mean, is, that when you write assembly you generally use the
fully qualified register names. You maintain register names manually.
Labourous and error prone process. Enter ANSI C. You can write your
intention with named variables, which are then at compilation
translated and assigned to real registers (add spilling for flavour in
this so.-called register allocation stage). And on and on.
Because most microarchitechtures are different, the pragmatic approach
taken is to find a common subset of operations the language supports. A
no-brainer, as you well know, you just wanted to nit-pick, well good
job! Congrats!
> which makes it a jack of all trades but master of none.
As you quoting this, I hope you also read this! Guess what!? The above
quote means same thing just without all the flair and nitpicking going
about!
> You make this claim with just one example?
Well, mostly the claim was based on 10+ years of professional
experience (and nearly 20 years of programming, total) and the opinion
that comes with that. I'm sure you also have a lot of experience, so
you know what I am talking about.
>Gee, I'd argue that it's going to be real hard for an assembly language
>programmer to beat the code that a C compiler produces for the
>following:
>
>i = 0;
Okay. Gee, that can be completely eliminated if the result isn't ever
used in the current scope, as I don't even see function call so any
possible side effect can easily be determined to be non-existent in
this case. It's random if assembler coder will "see" this or not, it is
more deterministic if a compiler will see this or not. But if don't
know the compiler in advance, then, it isn't.
I dunno what to make of that. Was that a kind of ridiculous example to
show my actions in a "different light", so to speak? If so, ummmm...
right.
>That doesn't prove C compilers are as good as assembly programmers by
>any stretch of the imagination. You're example is a bit more complex,
>but nowhere near sufficient to "prove" the point.
C compilers aren't better than assembly programmers, they are just more
time and money -efficient. When there isn't choise, there isn't choise,
ask Tom Duff.
But that wasn't my point, even though you seem to have that
disillusion... I was showing how that particular assembly code snip
doesn't "beat" HLL code, not the other way around.. a subtle
difference...maybe too subtle if haven't even read the thread...
>for( myclass::iterator si = s.begin(); si != s.end(); si++ ) {...}
>
>And they have no idea what the compiler is doing with their code. Take,
>for example, that innocuous "si++" at the end of the for argument list.
++i vs. i++, gee-whizz, now we're getting to the ABC's and 101's of C++
programming.. and you blame me for going too basic? ;-)
Yeah yeah, i++ creates temporary object because it has to return the
*current* value, before returning from ++ operator (postfix) we have to
increase the current value, we cannot return it.. so we return
temporary object created before the increment.
/*
someone who knows exactly what's going on behind the scenes probably
wouldn't write code this way, but how often do you see people writing
standard C++ programs the way you wrote your xstrlen?
*/
I wouldn't know, I suppose been in a professional community for far too
long. My attitude isn't professional as I am a bit childish, you might
have noticed.. but that's my problem, thank you for not making funny
remark about that in advance.
> Do you honestly write *all* your C++ code that way?
I need clarification on this, what you mean "that way?", what
specificly strikes odd in "that way"-- I don't get it, yes, I do write
code "that way" a lot of times, it comes from the backbone. Is it that
bad, if so, show me the error in my ways and I'll learn.
What took me so much effort was that first I reverse engineered the asm
snip, but I wasn't happy with it as I would *never* actually, go out,
and write code that was off the bat. I got some idiosynchronies, I
admit, which I follow as I found them a sound practise, and I keep
myself trim and up-to-date what works, and what doesn't.
For example,
while ( x-- ) { ... }
vs.
do { ... } while ( --x );
vs.
for ( ; x<xmax; ++x ) { ... }
And so on and so on. I assume very little, I check, but I still keep in
mind the semantics and how they can be expected to behave if thinking
in serial way about how the IP moves over the code. Example, the first
while loop might not do what expected if x has certain value ;) ;) the
do-while is always executed atleast ONCE, might be a Bad Thing.. the
for-loop is more equivalent to the while than do-while if want to look
at the pass-through-branch aspect and so on... there is no end to these
things, really.
> Or do you just write code that way
>when you're trying to prove that C++ compilers can emit code that's as
>good as assembly programmers? And when you *do* write code that way, is
>it any faster or easier than using assembly?
Well, ***, www.liimatta.org, go to Fusion page, download the "latest
version", decompress the sourcecode. The sourcecode is 750 kB
compressed (with some minor data inside), feel free to go through every
line if you have to.
And no, I don't write code "that way" to be faster than assembly.
That's "the way" I write code. I don't know if you trying to insult, be
polite or just being sceptic. Whatever, dude, if you don't have time or
will to verify what I write here, good, I wouldn't care what you think
about me. But that's some work I been doing. Want my resume? I don't
have one. I always have job offers on my inbox and it been that way
since 1996 or so.
>Bottom line is that most C++ programmers would just write:
>
>t = s;
>while( *s ) ++s;
>return s-t;
>
>(or something similar) and move on.
Guess what? That's precisely what I wrote, too, and moved on. It was
until recently, I think October 24, 2005 that I, for fun, wrote the C++
version and as it proved to be alright added it to the source
repository. I wrot regression tests only for the reason that the code
was added and I don't take changes lightly, otherwise I'd move on
again. Infact, I moved on already, now what is left is this
discussion..
/*
clue what's going on. Before you get in a tiff, I *do* realize that
*you* probably do know what's going on. But you don't write all the
world's HLL code.
*/
Most of the world's HLL code doens't need to be "fast", most of the
times I would be glad if it "worked", which it generally does, if not
before a patch or two atleast after.
"fast" is not a goal, "fast enough" is. If code is "fast enough",
that's it, job done. I seen some guy optimizing keyboard interrupt
handles in assembler for MS-DOS, maybe he thought someone would press
keys really, really fast and that would slow his program down, go
figure. Or maybe he was scared that he would miss a few keystrokes,
again, go figure. Such strange characters are not my specialty (you may
say myself excluded... ?=)
>Yes, you do not have access to the low-level machine. As I said, C is
>not an assembly language. Believe me, you don't have access to a *lot*
>of things that might be useful on occasion.
I don't have to believe YOU, I believe my own EXPERIENCE.
>The #2 thread (after strlen) is memcpy. My alternative is to simply use
>the movsb instruction.
Are you trying to insult my intelligence? Look at the string.hpp, you
might see std::memcpy() being invoked here, and there. I don't even
*consider* the alternative!
If you see meta::vcopy(), it is a different beast, it does check if
type is pod (uses traits) and does memcpy, or object-by-object copy so
that corresponding copy constructors and what not are invoked correctly
in the process. Mostly I use that construct with templates.
>> Also, I could unroll the C++ innerloop but I won't do it because I
>> think it is not a particularly good idea in this case.
>
>That depends entirely on the CPU and memory architecture.
Context: competing with the specificly mentioned assembly code, which
was unrolled. If you take that into consideration you have not-so-much
to nitpick about.
Since it is x86 assembly, I'm assuming some contemporary x86
implementation will be running the code. I don't think unrolling the
C++ code will do much good. Maybe on 386 or older processor it might
pay off, hell, most likely it would. But I'm not too much interested in
386 these days...
.
- Follow-Ups:
- Re: The never ending assembly vs. HLL war
- From: randyhyde@xxxxxxxxxxxxx
- Re: The never ending assembly vs. HLL war
- References:
- improve strlen
- From: Claudio Daffra
- Re: improve strlen
- From: spamtrap
- Re: improve strlen
- From: hutch--
- Re: improve strlen
- From: spamtrap
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: hutch--
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: randyhyde@xxxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- The never ending assembly vs. HLL war
- From: randyhyde@xxxxxxxxxxxxx
- improve strlen
- Prev by Date: Re: compiler generated output
- Next by Date: Re: Combining two MMX registers into one SSE register?
- Previous by thread: The never ending assembly vs. HLL war
- Next by thread: Re: The never ending assembly vs. HLL war
- Index(es):