Re: improve strlen
- From: "jukka@xxxxxxxxxxxx" <spamtrap@xxxxxxxxxx>
- Date: Thu, 27 Oct 2005 00:05:58 +0000 (UTC)
What I don't understand is that if you are willing to write this in
assembler, why stick to 32-bit alu code? Why not write this doing 64 or
128 bits at a time, the change would be trivial if you can guarantee no
buffer overflow in the end.
If you align to 8 byte boundary, you can safely read 64 bits at any
given time as the blocks would be aligned so granularity of memory
allocation unit. I bet that would double the performance if you are so
inclined.
--> misc ranting ..
My goal was to demonstrate that a modern C++ compilers are pretty darn
good: they come to about same ballpark in performance as handwritten
assembly. The other message was to promote more sound programming
techniques (see the notes about storing string length to begin with, I
fail to see how you could beat THAT with the following usage:)
void foo(const string& bar)
{
int size = bar.length();
// do something with length...
}
Better software isn't about better implementation but better software
engineering practises. I on purpose don't say *faster* but *better*,
because performance critical areas are usually not in string
processing. There is just so much string data you can find to process
to begin with, unless it is very special application (I bet it' s not
your usual run of the mill desktop application, think Google, SQL, ..)
The string processing was just a good example of code, that can be
written in generic, portable C++ which can be compiled to nearly any
platform without severe performance penalties even on the x86 platform
where the code has been handcrafted with performance in mind (why else
assembly?)
That's why it is pointless to do performance comparison between
assembly and higher level language implementations, usually, the
assembly-heavy software is light on algorithms and data structures,
that is usually because such are a real pain to maintain in assembly.
Not impossible, been there, done that.. when I was a young&wild..
Here's example of my stupidity from 1997:
www.liimatta.org/misc/_alphablend.asm
More optimal implementation recognizes the pattern (well which this is
based on anyway):
xyzw
wwww
~x~y~z~w
~w~w~w~w
what we notice is that we basicly mean two mode bits:
1. w replicate
2. complement
These create four permutations we see above, so we could easily
re-construct the code sequence with four basic loop start blocks and
"build" the code on-fly without resorting to very comprehensive
techniques. Now we have 100 unique innerloops, which we could generate
(!) easily with a simple algorithm.
Even better, we could construct more complex sequences adding very
simple rules and still use fixed register allocation.
But I rather write this logic in C++, have the frontend and interface
to be C++ (or ANSI C) and then only use the (in this case, MMX
assembly) machine specific binary as output format for performance
reasons mainly.
The rationale for this has many aspects, here are a few:
- we cannot write all permutations at compilation time in ANY language
if number of permutations is too high
- so, we will use branching and possible also calls to subroutines to
mix simpler pieces for more complex "whole", but this is inefficient:
in the context this code is used, one function call per fragment is one
too many
- we could use templates to generate the code at compilation time
(which is approach I use i one library for pixelformat conversions and
blitting) but that is ONLY because the number of permutations is still
manageable
- we could use macros, and keep including same header with different
values for macros to generate different functions with different
parameters, but same problem as before: too many permutations and we
end up with 120 MB executable very easily
- realtime generated code doesn't even HAVE TO BE fully optimized, what
matters is that it is REASONABLY efficient as the alternative would be
very, very slow code
- even better if the designer and implementor of such system has
experience in compiler design and implementation (shouldn't be too
difficult hurdle, I think this is standard material for universities
around the world for programmer's)
- if no experience, it is possible to gain it my practise just slows
down the progress initially :)
Et cetera. That's just my thoughts in "assembler" -- bytecode is where
the action is at, and it's hard to find faster bytecode than platform
native binaries.. now the problem is twofold:
- writing a good optimizing compiler for intermediate code generation
- developing a good instruction selection implementation :)
That's where the action is, don't miss it by fumbling assembly by hand,
mostly waste of time. Just my $.02
.
- References:
- improve strlen
- From: Claudio Daffra
- Re: improve strlen
- From: spamtrap
- Re: improve strlen
- From: hutch--
- Re: improve strlen
- From: spamtrap
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: hutch--
- Re: improve strlen
- From: jukka@xxxxxxxxxxxx
- Re: improve strlen
- From: hutch--
- improve strlen
- Prev by Date: Re: improve strlen
- Next by Date: Re: Why gcc translate a c program into assemble as follow
- Previous by thread: Re: improve strlen
- Next by thread: Re: improve strlen
- Index(es):
Relevant Pages
|