Re: from elsewhere, an assembler




Wannabee skrev

Well. I find it hard to trust the manuals.
It reports :

intel AMD

for|
Latency thoughput Latency thoughput
mul 10, 1 3 -

I measure 11

Nothing wrong here, a off by one may came from RDTSC itself
but in the examples below you added two 'push' instructions
within the test.


div 66-80 30 39 -

I measure 44

fild -- -- 16 -

FILD converts an integer into a 80-bit FP expression

see below

fdiv
32 32 18/22/26? 6

I messure _16_ actually
if done like this

TestCode:
push edi
mov D$FPU_Mem32 0-1
fild D$FPU_Mem32
CPUID | rdtsc | push eax edx
mov D$FPU_Mem32 1000
fdiv D$FPU_Mem32
rdtsc | pop ecx ebx
sub eax ebx
sbb edx ecx ; not needed
int 3
pop edi
ret


but like this:

TestCode:
push edi
CPUID | rdtsc | push eax edx
mov D$FPU_Mem32 12345678
fild D$FPU_Mem32
mov D$FPU_Mem32 1000
fdiv D$FPU_Mem32
rdtsc | pop ecx ebx
sub eax ebx
sbb edx ecx ; not needed ***if NZ yet it's a (more) indicator.
int 3
pop edi
ret

now I measure 1800++ cycles! repeatable ! And it varies a lot! I saw even
6000 once.

Here you again measure windoze noise,
as said in the previous post: "don't CAll" tests,
the call itself may invoke a (4KB exceeding) stack page fault
or at least one ore more cache-burst penalties.

Same for the two 'push eax edx'

You modified my test pattern with this ...

Whats that about?

to make sure not crossing page nor cache bounds with your test
you can use ALIGN 4096

And why cant I find the specs for FILD in intel docs?

FPU timing depends just on too many things so you wont find a single
figure for it, find more details in the optimisation guides.

the only way for correct code-snip benchmarking is:
* diasble all IRQ 'cli/sti' during test (needs master Admin on NT/XP)
* run it twice at least, the first may fill code and stack cache.
* test ALIGNed to avoid page faults and additional cache burst penalties
* never run a test loop, as you then just measure windoze background
activities (for sure more cycles than your code under test..).

What do you measure for this code, running under KeSys?

DIV clock-cycles heavy depend on operand values ...
a DIV by 1 or 2^n matching figures are 'fast' (about 25 cycles)
while any odd divisor or div/modNZ-result may need up to 42 cycles
on K7.

AMD K7 reported latency:

DIV 24 to 40 +LD
FILD 4 + LD
FDIV 16/20/24 +LD (depends on precision/round setting)

And that's exactly what I see when single stepping it on HEXEDIT (K7)
Your XP (=K8 ?) FPU should/may be a bit faster at least for throughput.

I haven't mounted KESYS tools on AMD64 yet, I like to go long mode
there for all routines anyway, 16 registers just offer more ...


Now, in every test I do, involving some useful FP operations, I get _very_
high cyclecounts. But, in code that performs _alot_ of them, so many that
adding the test
result of each should yield 100s of thousands, if not million of cycles,
actually turn out to run in a very "few" cycles, on average (1000-1500),
when I perform the many operations many times and divide.

You cant perform a divide in the FP, unless you load something there
first.

Sure. Even I see this as a disadvantage ...

This I measure at 21 cycles....


TestCode:
push edi
mov D$FPU_Mem32 12345678
fild D$FPU_Mem32
CPUID | rdtsc | push eax edx
mov D$FPU_Mem32 1000
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
fdiv D$FPU_Mem32
rdtsc | pop ecx ebx
sub eax ebx
sbb edx ecx
int 3
pop edi
ret

my CPU is an AMDXP. I find that none of my test gives anything
consclusive that is actually useful.

I haven't got an AMDxp, so if you measure just 21 cycles,
I'd assume the Fdiv got an higher throughput than on K7.
My minimum estimation would be at least 25+3 cycles.


Reality does not care about the intel or AMD manual.
I have 1000-1500 cycles per particle, for code that does _ALOT_ of FP
operations.
40 + FP instructions per particle, and also does a lot of integer
operations, plus drawing, plus plus plus.

(I think that is way to much, and that the speed of my particles suck, but
has not cared that much, because it still doesnt come close to the penalty
for using GDI, plus, I have other things I want to focus on now.

Whats you thinking about this. The cyclecounting?
I mean, to what extent can you actually trust theese figures?
And how can you claim that you function runs in 45 cycles, when
I am 100% unable to measure it at such?

To compare two code pieces the test environment and test address
should be equal, otherwise we may just measure unwanted OS-responds
and additional cache burst penalties.

Testcode:
push edi
CPUID | rdtsc
*** | push eax edx
mov eax 0-1
WolfGang_BinToAscci
rdtsc
*** | pop ecx ebx
sub eax ebx
sbb edx ecx
int 3
pop edi
ret

This code inlines you function. And 308 cycles is the lowest I can find.

Testcode:
push edi
CPUID | rdtsc
*** | push eax edx***
mov eax 0-1
Betov_Hex
rdtsc
*** | pop ecx ebx
sub eax ebx
sbb edx ecx
int 3
pop edi
ret

this is 129-130 here.

***) you modified my test with the push eax edx !
this again may cause the same as a CALL.

both mesasurement wore made running at realtime priority in usermode,
with everything else closed down, including explorer.

Try to run both with ALIGN 4096,
or try with comment out the other while testing one.

__
wolfgang



.



Relevant Pages

  • Re: from elsewhere, an assembler
    ... But, theres allready a push edi above, + shouldnt a page expansion ... prevent any behind the scenes cycles. ... CPUID | rdtsc | push eax edx ... fdiv D$FPU_Mem32 ...
    (alt.lang.asm)
  • Note about CopyRect
    ... Wolfgang suggested my measurement was optimitic, but I didnt want to belive him, as I had measured it at 5 cycles. ... rdtsc | pop ecx ebx ...
    (alt.lang.asm)
  • Re: No difference on my machine
    ... show better reproducable timing values. ... No, the serialising itself may take 190...+++ cycles, ... "Have you ever seen an odd value from rdtsc?" ...
    (alt.lang.asm)
  • Re: No difference on my machine
    ... show better reproducable timing values. ... Apparently 300 cycles on my machine. ... My understanding is that without a serializing instruction, ... come from after the rdtsc in our code, ...
    (alt.lang.asm)
  • Re: i got errors on michael abrashs code PZTIMER.ASM
    ... milliseconds to execute. ... you can time code by using the RDTSC instruction. ...
    (alt.lang.asm)