Re: improve strlen



What I don't understand is that if you are willing to write this in
assembler, why stick to 32-bit alu code? Why not write this doing 64 or
128 bits at a time, the change would be trivial if you can guarantee no
buffer overflow in the end.

If you align to 8 byte boundary, you can safely read 64 bits at any
given time as the blocks would be aligned so granularity of memory
allocation unit. I bet that would double the performance if you are so
inclined.

--> misc ranting ..

My goal was to demonstrate that a modern C++ compilers are pretty darn
good: they come to about same ballpark in performance as handwritten
assembly. The other message was to promote more sound programming
techniques (see the notes about storing string length to begin with, I
fail to see how you could beat THAT with the following usage:)

void foo(const string& bar)
{
int size = bar.length();
// do something with length...
}

Better software isn't about better implementation but better software
engineering practises. I on purpose don't say *faster* but *better*,
because performance critical areas are usually not in string
processing. There is just so much string data you can find to process
to begin with, unless it is very special application (I bet it' s not
your usual run of the mill desktop application, think Google, SQL, ..)

The string processing was just a good example of code, that can be
written in generic, portable C++ which can be compiled to nearly any
platform without severe performance penalties even on the x86 platform
where the code has been handcrafted with performance in mind (why else
assembly?)

That's why it is pointless to do performance comparison between
assembly and higher level language implementations, usually, the
assembly-heavy software is light on algorithms and data structures,
that is usually because such are a real pain to maintain in assembly.
Not impossible, been there, done that.. when I was a young&wild..

Here's example of my stupidity from 1997:

www.liimatta.org/misc/_alphablend.asm

More optimal implementation recognizes the pattern (well which this is
based on anyway):

xyzw
wwww
~x~y~z~w
~w~w~w~w

what we notice is that we basicly mean two mode bits:

1. w replicate
2. complement

These create four permutations we see above, so we could easily
re-construct the code sequence with four basic loop start blocks and
"build" the code on-fly without resorting to very comprehensive
techniques. Now we have 100 unique innerloops, which we could generate
(!) easily with a simple algorithm.

Even better, we could construct more complex sequences adding very
simple rules and still use fixed register allocation.

But I rather write this logic in C++, have the frontend and interface
to be C++ (or ANSI C) and then only use the (in this case, MMX
assembly) machine specific binary as output format for performance
reasons mainly.

The rationale for this has many aspects, here are a few:

- we cannot write all permutations at compilation time in ANY language
if number of permutations is too high

- so, we will use branching and possible also calls to subroutines to
mix simpler pieces for more complex "whole", but this is inefficient:
in the context this code is used, one function call per fragment is one
too many

- we could use templates to generate the code at compilation time
(which is approach I use i one library for pixelformat conversions and
blitting) but that is ONLY because the number of permutations is still
manageable

- we could use macros, and keep including same header with different
values for macros to generate different functions with different
parameters, but same problem as before: too many permutations and we
end up with 120 MB executable very easily

- realtime generated code doesn't even HAVE TO BE fully optimized, what
matters is that it is REASONABLY efficient as the alternative would be
very, very slow code

- even better if the designer and implementor of such system has
experience in compiler design and implementation (shouldn't be too
difficult hurdle, I think this is standard material for universities
around the world for programmer's)

- if no experience, it is possible to gain it my practise just slows
down the progress initially :)

Et cetera. That's just my thoughts in "assembler" -- bytecode is where
the action is at, and it's hard to find faster bytecode than platform
native binaries.. now the problem is twofold:

- writing a good optimizing compiler for intermediate code generation
- developing a good instruction selection implementation :)

That's where the action is, don't miss it by fumbling assembly by hand,
mostly waste of time. Just my $.02

.



Relevant Pages

  • Re: Catagories of Assembly Languages
    ... >>an assembler, despite the fact that all the other facilities are ... That's not what "preprocessing" has traditionally meant. ... "preprocessing" occurs *before* compilation. ... the *source input* language. ...
    (alt.lang.asm)
  • Re: "Faster, More Powerful . 64 Bits!"
    ... > efficient as native code compilation. ... > written in raw assembler by hand. ... Giving all this control to Microsoft ... that survey and it's questions though-- perhaps another future survey ...
    (borland.public.delphi.non-technical)
  • Re: moving from vb.net to c#
    ... I'm assembling a compilation of all the ... VB language hides a lot from the developer and allows for and automatically ... >>> Why not just code in Assembler then? ... > I bet there WERE programmers who were upset. ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: translating Python to Assembler
    ... program in machine code, which is all 1's and 0's. ... assembler has to be compiled in order to convert the opcodes to ... IMHO this "compilation" if trivial compared to HLL ... since it's just a translation from opcodes to numbers ...
    (comp.lang.python)
  • Re: improve strlen
    ... The biggest optimization is that the code is bigger. ... each test string is tested ... The object on right side is type tree, ... The way I see it is that C is portable assembler, ...
    (comp.lang.asm.x86)