Re: improve strlen



/*
Unrolling code won't *kill* branch prediction; it might not take
advantage of branch prediction, but it doesn't kill it (i.e., make it
less effective when it is used).
*/

Unrolling 8 times while having essentially verbatim the same innerloop,
branch instruction excluded is not particularly effective optimization.
The biggest optimization is that the code is bigger.

/*
Of course, the main problem I'm seeing here is that people are running
their programs and measuring the results on different CPUs. And all
*/

It's not really a problem, I'm running MIPS, PPC, x86-64 and IA32 and
also working with ARM (various versions) and generally, higher-level
optimizations are "portable", there is ofcourse small fluctuations
based on architechtural differences but generally like I just wrote
higher level code, which does less "work", is usually faster on any
platform.

Biggest differences come from floating point to integer and vice-versa
interaction and dependent issues, also lack of floating point processor
mixes things about but that isn't the case here. The code is fairly
trivial integer-only very RISC-like ALU only is required. Number of
registers needed is small here, too.

/*
bets are off when you do that. This is why it's almost a worthless
exercise to "count cycles" these days. If you optimize the code on one
CPU, chances are pretty good it *won't* be optimal on a different CPU.
*/

First, I am not counting cycles. I am measuring average performance
over large number of runs on data that isn't same on each successive
run. Slowest and fastest runs are always dropped so that random
fluctiation is eliminated. Additionally, each test string is tested
with different start offsets so that different alignment cases are
tested and different "leftover" cases are tested.

The same code works as regression test, too. :)

/*
So unless you're writing the code to run on a specific rev of a
specific CPU, shaving one or two clock cycles off the execution time of
an algorithm is a real waste of time.
*/

In production code I rarely do this, I write pretty good code already
off the bat anyway. But this is for fun, dude, you think Usenet is
where my work is at? ;-)

/*
over and over again, the more reasonable question to ask is "why aren't
all these people using a better data structure that makes string
operations more efficient?" (e.g., including the length as part of the
*/

Apperently you haven't read the whole thread. I think it was me, who
suggested that storing length of the string into the string object is
way to go, multiple times infact. Here's the string class, actually:

http://www.liimatta.org/misc/string.hpp

It is using template meta programming for string operations, I wrote
this as experiment on the topic in 2003, the basic principle is that
when we have expression:

string bla = a + b + c + d + e ...;

The object on right side is type tree (look at expr_string template),
the right and left nodes are trees, which store the operans of each
side. This tree will be "unbalanced" as we don't have parenthesis,
however that isn't a problem as the tree is traversed only once (the
assignment evaluates the tree) and inserts are always O(1).

What this enables is that the "length" of the whole tree is known at
evaluation time in linear time O(n) (this could be possible to improve
but I weren't arsed as I don't believe optimizing what doesn't
require).

After the length is known, each node in the tree is basicly
block-copied into the slot in the left sidehand argument object. No
temporary string objects are created for each successive + operator
like with traditional string (object) implementation.

/*
the idea of simply recompiling old programs that don't know any better
and having them run faster, but the truth is that if people would
simply stop using zstrings (except for interface with OSes and other
code that uses them) and employ a better internal data structure, the
*/

The internal data structure still needs initialization, string's
lengths are very often needed. Lazy Evaluation of them invokes branch
for every instance of use of the information, which I already explained
in this thread. There's tradeoff being made between different usage
patterns.

/*
whole issue of the "fastest strlen" algorithm would go away. If people
would use reference counters and pointers, then worrying about the
"fastest block move operation" would diminish, too.
*/

I am not interested in "fast strlen" per-se, I am interested in
discussing the futility of optimizing it in *assembler* when the
performance difference is neglible. I first said that it's not such a
bright idea, just using kinder words. Then I went out my usual trouble
to put my code where my mouth is, done, as stated I don't talk *** and
can backup my opinion with facts.

/*
As the old saying goes - get the algorithm (and, presumably, data
structure) right *FIRST*.
*/

Well, ofcourse, you are talking to the guy who proposed that three days
before you..

After that is done, order of doubling the performance is still pretty
cute for being just petty implementation detail. I want to emphasize
the fact that I find it mostly cute that someone wants to stick to
assembly code, which isn't portable and for all practical purposes
equivalent performance to C++ code that does the same thing.

For the record, I'm not a big fan of writing software in assembly, I am
a big fan in understanding assembly on platforms I work on so that I
understand how to craft my code. The assumption is that the
optimizations the compiler implements don't suck, such as constant
propagation, dead code elimination, common subexpression elimination,
spilling and what not. That is, when such things matter.

They rarely do, as a matter of fact, as only fraction of code is time
critical. I'd go as far as to say that if strlen() speed is time
critical there is something seriously wrong with the design. However,
that isn' the POINT of this thread. It is (for me) about higher level
languages not necessarily implying inefficient, or slow.. or assembly
automatically being synonymous with fast.

The way I see it is that C is portable assembler, which makes it a jack
of all trades but master of none. But in this case it doesn't hurt the
performance, as can be seen the beginning of the code is nearly
identical when compiled. The only real difference is in the end,
because I cannot express "subtract with borrow" concept in C, therefore
I crafted alternative which is nearly same performance in practise.. it
ONLY adds little constant overhead, but the innerloop is, well, same.

Also, I could unroll the C++ innerloop but I won't do it because I
think it is not a particularly good idea in this case.

.