Re: multibyte subtraction using AAS and ASCII Decimals




"cr88192" wrote:
[..]
the assembler versions are just horrors.
I don't think so...there is a lot of power in it.

well, I was referring mostly to "fully expanded" multiplies (ie: all the
loops are unrolled into a mass of fixed-purpose code...).
in C, this results in huge masses of code, and in assembler it is far
worse > (for an autogenerated version, several kloc for a single function,
....).
multiplies can be scary...

Loop unrolling may not always speed things up.
Many fetch and cache penalties may destroy the gained saving on
cc-branches.

[]
well, loops involve a mass of compares and conditional jumps, so if it is
flattened out into a single mass of instructions, it should go faster.
still, this is only reasonable for smaller sizes it seems.

64 bits, yes, fairly simple (4 multiplies for a 64->128 bit expansion).
128 bits, at least, still reasonable to type out.
256 bits, errm, a chore to type even in C.
512 bits, errm, no. this beast comes out at 3 kloc.
1024 bits, quite a bit worse.

also, for the 256 and greater, I end up loosing as much register density,
so it makes more sense to just do it in C and hope the compiler can do it
good enough.

Might be interesting to compare performance ...

some operations seem terribly dubious though, such as the large
multiplies:
they are so bulky I have doubt as to what the processor will do with
them
(like 3k lines of assembler for a 512-bit multiply, insane...).

Look at my homepage example of how to do fast MUL on any integer size.
It easy can be expanded to act on more than 512 bits.
ok.

it is dense and small, but would involve more operations/per multiply than
my versions do.
basically, my versions have 2 memory accesses per mul, wheras this version
has 4 (and a few other accesses to memory).

With only two memory accesses ?
I see you work just on 32-bit junks and add edx before the next mul.

it also has the overhead of implementing loops (IME, conditional jumps are
unreasonably expensive, and thus I prefer to avoid them whenever
reasonable).

then again, I begin to wonder if for large types (above 128 or 256 bits),
if it makes more sense to just use loops.

Right, I haven't checked the break-even point [fetch vs. branch penalties],
I just found it more convenient to have only one short routine to cover
all MUL above 64*64-bits.

problem with multiplies is that they get larger at an n^2 rate...
unless there is some factoring teqnique of which I am not aware
(ie: as there is with the DCT...).

I'm not familiar with the term "DCT".

just me remembering, the DCT cam be implemented with a complexity of O(n
log2 n) or such (even though cannonically, it takes O(n^2)).

part of the reason is that when expanded (and the coefficients are
evaluated), there are some amount of shared terms, which can be factored
giving a fairly dense calculation.

however, I am not sure if anything like this is possible for multiplies, I
doubt it.

I got a partial LOG LUT but it's limited to 256 bits resolution,
it's fine for num-Ascii->bin conversion, divide and a few more,
but it wont speed up large-by-large MUL.

[...]
ended up burning out on this.
[...]
but oh well...).
actually, I still kind of doubt that the code will work reliably.

Sooner or later you will find a solution which fits your goal :)

__
wolfgang



.



Relevant Pages

  • Re: Apple II graphics techniques
    ... Unrolling loops (using X,Y or a RAM location to loop), ... I can hardly recall even a handful of 3D apps/games for the Apple II. ... HBCC 3D has a both - the math was much harder than the code to draw the screen, but the screen code was a bunch of hand-unrolled loops for each column type. ... unrolling and self-modifying code types of optimizations are rarely beneficial on modern superscalar machines. ...
    (comp.sys.apple2.programmer)
  • Re: Spill code and unrolling
    ... You may be interested in this piece of work by Vivek Sarkar - it ... Optimized Unrolling of Nested Loops. ... Intuitively, I would assume that in a system with data caches, the ...
    (comp.compilers)
  • Re: cycles per instruction
    ... Did you make sure compiler optimization is off? ... those loops. ... With only one Div ... or Mul per loop, the overhead of the for-loop itself will severely ...
    (borland.public.delphi.language.basm)
  • Re: cycles per instruction
    ... I did some quick measurements with a big loops: ... With only one Div or ... Mul per loop, the overhead of the for-loop itself will severely distort the ...
    (borland.public.delphi.language.basm)
  • Re: Use aggregates (Was: Allocation question)
    ... (Did you try this with C99's variable length arrays?) ... at unrolling that into loops. ... point of using Ada, if you have to write C in Ada to match the performance of C. ...
    (comp.lang.ada)