Re: AMD's 3DNow! imprecise
- From: kuratkull <spamtrap@xxxxxxxxxx>
- Date: Thu, 3 Apr 2008 22:45:12 -0700 (PDT)
On Apr 4, 5:23 am, Waldek Hebisch <spamt...@xxxxxxxxxx> wrote:
kuratk...@xxxxxxxxx <spamt...@xxxxxxxxxx> wrote:
Anyway, I am trying to build the very-fast-prime-counter i assembly
and I succeeded when using only the regular instructions, but a 6.5
secs for a 100k primes is way too slow, so I looked around and
profiled my code...and of course, a major bottleneck was the idivl.
I am not sure what "prime-counter" mean -- I you want just generate
primes than sieve of Eratosthenes is the standard method (much faster
than method using division).
In case you really want to use division, you may consider storing
recipracals -- you are using the same divisor many times.
So
I decided to use the 3DNow! instructions.
ecx = ecx+2 (begins from 3; 2 is already in the stack)
edx = all found primes less than current ecx, are looped through it.
MOVD %ecx, %MM0 /* divisor */
PI2FD %MM0, %MM0 /* divisor to float */
MOVQ %MM0, %MM2 /* copy of divisor */
PFRCP %MM0, %MM0 /* 1/divisor */
MOVD %edx, %MM1 /* prime_candidate */
PI2FD %MM1, %MM1 /* prime. to float */
MOVQ %MM1, %MM3 /* copy of prime. */
PFMUL %MM0, %MM1 /* 1/divisor * prime */
PF2ID %MM1, %MM1 /* answer to int ...*/
PI2FD %MM1, %MM1 /* ... and back to float */ /* NOTE1 */
PFMUL %MM1, %MM2 /* roundeddown(1/divisor*prime.)*divisor
PFCMPEQ %MM2, %MM3
MOVD %MM3, %eax /* if prime check "good", then != */
It will run perfectly until ecx=9 and edx=3. After "NOTE1", the MM1
should be 3.0, but it seems to get rounded down to 2 on the toint-
tofloat round. It seems the previous calculation before the converting
stays a bit below 3, and then gets rounded down.
3DNow! PFRCP intruction give only approximate result (unlike 387 FPU
the result in _not_ correctly rounded). Worse, according to AMD
documentation PFRCP has only 12 bits of accuracy, so you are limited
to really tiny numbers. So, you probably should add PFRCPIT1 and
PFRCPIT2 instructions, which together should give you 24 bits of
accuracy (however, the documentation I have is pretty unclear what
accuraccy is guaranteed, so I would not trust in the last 1-2 bits).
Given inaccuracy of reciprocal and rounding of multiplicatin you
may get result which is smaller than true quotient. If the true
result is an integer and rounded result is smaller then rounding
down give you wrong result. To avoid this error you must make
sure that approximate result is _always_ bigger than true result.
Simple way to to multiply approximate a/b by a number slightly
bigger than 1, say 1+2^(-20) (this should be enough if the
approximate a/b has 21 bits of accuracy). Of course, there is
a danger that real a/b is smaller than an integer, but
after biasing up may get bigger and you may get the result
which is bigger than correct result. However, this can not
happen if the numbers are small enough. Namely, if a and b
are positive integer and a/b in not an integer, than distance
from a/b to the nearest integer is at least 1/b. As long as
absolute error of the calculation is smaller than 1/b there
is no risk that after rounding you get number bigger than
true result. Error bounds for floating point computations
are relative, so you really want relative error smaller than
1/a. Which means that you are limited to 18-19 bit a.
If you want bigger numbers than it is hard to beat IDIV.
One may try the same trick used by PFRCP, PFRCPIT1, PFRCPIT2 --
first get rough approximantion (say, via table lookup) and
refine using Newton iteration. But it is not clear if
table lookup and several aritmetic instructions needed
for Newton iteration will be faster than IDIV...
--
Waldek Hebisch
hebi...@xxxxxxxxxxxxxxxx
Blah, our posts are falling out of sync with eachother.
Sorry Sebastian, I mixed up: the gas doc. says:
"Currently, as does not support Intel's floating point SIMD, Katmai
(KNI)."
->
SSE works, but the examples I tried didn't work for some reason, after
I had "converted" them from Intel to AT&T. Enough about this :)
Waldek:
By prime-counter I meant trial division with every previously found
prime. (I have looked into number theory as a personal interest ).
Thanks for the detailed info on the workaround, but I am more
staggered to find that the 3DNow! actually IS imprecise :/ So it is
basically useless, because as I said, it generates ~500 primes more
for 1000,000 than it should. (I previously wrote 10 mil, sorry).
The solution you offered is very nice, but I was hoping the solution
to be without limits, though I appreciate your effort. It just looks
like I have to dump 3DNow! altogether. (I'm going to get myself a new
IntC2D laptop, so I will be able to use SSE upto SSSE3).
.
- Follow-Ups:
- Re: AMD's 3DNow! imprecise
- From: Sebastian Biallas
- Re: AMD's 3DNow! imprecise
- References:
- AMD's 3DNow! imprecise
- From: kuratkull@xxxxxxxxx
- Re: AMD's 3DNow! imprecise
- From: Waldek Hebisch
- AMD's 3DNow! imprecise
- Prev by Date: Re: AMD's 3DNow! imprecise
- Next by Date: Re: AMD's 3DNow! imprecise
- Previous by thread: Re: AMD's 3DNow! imprecise
- Next by thread: Re: AMD's 3DNow! imprecise
- Index(es):
Relevant Pages
|