Re: Hex to ascii



randyhyde@xxxxxxxxxxxxx wrote:
Terje Mathisen wrote:
a) Are you doing this in a loop, with non-random inputs? Are smaller
numbers much more likely than those that have 9-10 significant digits?

I do an exhaustive test, with all possible values for 32 bits. In the
real world, of course, smaller numbers appear much more often than
large numbers, but I don't see that running all $1_0000_0000 different
values is going to favor one algorithm over the other.

If you test the values in order, then it makes a _huge_ difference, in favor of your version, due to minimizing the number of branch misses.

If you run with (say) 1000 random 32-bit value, then your version should be significantly slower.

Anyway, how fast can your code be?

Subtracting to detect digits will average 5 iterations/digit, right? (The alternative is a binary search which is even less predictable, and which has about 3 X the number of branches in the code).

If all iterations except the final one is correctly predicted, then we get an absolute minimum of 5 cycles + a branch miss costing anywhere from 5 to 20 cycles (depending upon the current cpu architecture).

With 10 digits, this comes out to 50 + 10 * [5-20] = [100 - 250]

Since my code needs 4 MUL operations, each of which costs 10-11 cycles on a P4, the entire code will take about 60-80 cycles on the same P4, while on any AMD or P6 style cpu we're down in the 30-50 cycle range.

b) You time my code while introducing a Partial Register Stall, costing
10-20 cycles directly after the end, by branching on ECX when my code
have just updated CL.

I will switch to ESI and see if it makes a big difference. But quite
frankly I see a bigger gain to be made by refactoring all the
memory-copy code. I'm thinking of having 10 variants of your algorithm,
each optimized for one through ten digits of output.

I really don't know/understand how you can know up front how many digits a number will need? Are you only ever using this as part of a formatted, fixed-field, output function?

c) You run on a P4, which has by far the (relatively) slowest MUL speed
of any modern x86-compatible cpu, while running the inner logic core at
double rate.

Yep. But that's still something I have to take into account when
designing general-purpose library code. I can't hand-tweak the
instruction scheduling for a particular CPU and call it the fastest
code around. As a general rule, I stay away from the cycle counting
stuff that varies from CPU to CPU and try to stick mainly with the
algorithmic choices.

Since the P4 is the outlier, and all other x86 cpus run MUL quite fast, it is unfortunate for my code that you happen to have a P4. :-(

It is still interesting that really naive code can outperform the
textbook "div/mod by 10" approach by such a great margin. :-)

It blew me away. I don't know what motivated me to try it in the first
place other than something along the lines of "I wonder how slow this
one would run." As I said, my attempt at "divide by multiplication"
(which didn't work out well because I didn't think of first dividing by
100,000) turned out to be slower. That's my I started this line of
questioning in the first place.

When working with 64-bit numbers, my assumption would be that really big numbers are quite rare, and that an approach based on a higher power of 10, but maybe not 100000, could be useful:

I.e. a repeated reciprocal multiplication to generate 3 digits/iteration, as a fraction after division by 1000. Back-multiplication by 1000 could be handled as (1024 - 8*3).

The modulo-1000 value can be converted to three digits using a single MUL to convert it into a binary-scaled decimal fraction: 9.99 * 2^n.

This approach would require two MUL operations per three digits, and it would stop as soon as the most significant triplet had been output.

Instead of doing back-multiplication and decimal fraction scaling, a more precise reciprocal, of at least 76 bits or so, would allow us to calculate both the results of the division by 10^n, and the decimal fraction correctly rounded up so that it could be used to output the digits directly. This would be easy to do by having two MULs, then shifting the lower half down before adding to the top half.

Terje
--
- <Terje.Mathisen@xxxxxxxxxxxxx>
"almost all programming can be viewed as an exercise in caching"

.



Relevant Pages

  • Re: LibTomMath forked [SSE2 addons]
    ... Some more timing stuff... ... Timing Multiplying: ... digits: 1010 cycles ...
    (sci.crypt)
  • Re: LibTomMath forked [SSE2 addons]
    ... Tom St Denis wrote: ... |>Timing Multiplying: ... |> 4 digits: 400 cycles ...
    (sci.crypt)
  • Re: Array multiplication algorithm
    ... AEI with a pointer of 16-bit integer type and then multiply and carry ... Chances are that your CPU ... Is that the fixed number of digits for your number? ... modulo operation and the upper one the carry, ...
    (comp.lang.c.moderated)
  • Re: Powers and logic
    ... quasi wrote: ... The problem was to find the sum of its digits. ... a terabyte of cycles is what -- half an hour of computing time? ...
    (sci.math)
  • Re: LibTomMath forked [SSE2 addons]
    ... > Timing Multiplying: ... > Timing Squaring: ... digits: 464 cycles ...
    (sci.crypt)