Re: No difference on my machine




Frank Kotler replied:

***______this two add to timing value
push edx
push eax
***______

Certainly. Got a way to avoid it?
The perhaps new used stack space may not be cached ahead.
I experienced previously [before CPUID/RDTSC] accessed variables
show better reproducable timing values (perhaps just a paging issue).
When a Linux exe starts, "argc", and the parameters (program name, if
nothing else), and environment variables are on the stack... so it's
apparently been "touched" recently. We *could* hit a "new page"
boundary, but mostly not, it seems. I'm getting quite "regular" timing
using the stack. Haven't (yet) compared with static variables...

So L'unix may behave much better than windoze here.

[...second CPUID]
No, the serialising itself may take 190...+++ cycles,
that's why you may have measured ~380 cycles on an empty test.
Yeah. Apparently 300 cycles on my machine. I have removed it from the
current version, but I remain unconvinced that it's "always okay" to
omit...

If you don't like to measure code duration of CPUID ...

Why in hell Intel didn't make rdtsc "serializing" is a mystery to me.
Useless without it, isn't it?
Not at all, sometimes we want to measure the time for serializing
or ie: WBINV

I'm not sure what you mean. WBINV is another serializing instruction,
right? My understanding is that without a serializing instruction, an
"out-of-order execution" CPU (which???) can dispatch micro-ops which
come from after the rdtsc in our code, *before* the rdtsc actually
executes. Perhaps I misunderstand the issue. In any case, *this* code,
on *my* machine, seems to work fine without the second cpuid.

from AMD-docs: (you see how rare I use it: I missed one letter :)
WBINVD writes all modified cache lines in the internal caches back...

CPUID,IRET,LGDT,LLDT,LTR,INVD,INVLPG,MOV CRn,MOV DRn,RDMSR,WBINVD and
WRMSR are serializing instructions, while RDPMC and RDTSC aren't.
WBINVD and INVD does not invalidate TLB caches.

rdtsc
***: the RDTSC itself needs time
ie: 11 cycles on a K7, 13 on K8 and AMD64, P4 ??
Looks like 80, on my machine (if this code's even "right").

I wont believe in 80 cycles for RDTSC alone ... not even together with
the few (3..9) cycles lost for the PUSHes

...find the machine constant in TFM ;)
I prefer doing the experiment to reading TFM. "Both" would be ideal.

Right. it's always better to prove than to rely ...

At which point in these 11 or so cycles do we actually "read" the
count?

I haven't checked, but isn't this a 'don't care' anyway ?
we measure the difference between two TSC reads ... :)

Right. It was sort of a hypothetical question...

:)

*** also SUB/SBB your machine constant for RDTSC
and the time for the two PUSHes as well
Okay... if it *is* constant. I though of running "duplicate loops"
(shouldn't call 'em "loops"... "duplicate timed sections"), one "empty"
and one with "code to time", and subtracting one result from the other,
to get a "zero based" number.

Well, I didn't do the "measure an empty loop and subtract it", but I now
subtract an equate, determined empirically. This'll have to be adjusted
per machine, of course, but produces zero, exactly, *almost* all the
time. I see (rare, apparently) outliers... by a much smaller amount than
I'd expect. Might be interesting to run the thing in a loop, "sifting
out" the outliers for examination...

Yes, I also just use an empiric evaluated value from my debugger to finally
determine and/or interprete the difference in RDTSC-values.

I always use the same (4K-aligned, page present) test-field for code
parts timing, so comparisions become more reproducable and reliable.

Now 'edx:eax' should contain zero for an empty test,
Right... unless...

(run it twice, because the first run will imply code fetch)

Which would give us a less-than-zero result for an empty test. Okay,
run it three times...

a deviation of +/-1 cannot be avoided due to the micro-steps,>>>

This brings us back to a question that was raised here some time ago:
"Have you ever seen an odd value from rdtsc?" (I have not, I don't
think).

Yes, I've seen all possible values.

Must have taken a long time. :)

OK, you got me. So let me it say again: I've seen varaiations in timing
wihtin a range of +/-1 to several bounds to the above :)

Still all even on my P4. (because it dispatches two micro-ops at a time,
I think we concluded, earlier...)

Cannot corfirm, because I don't buy Intel CPUs since the first AMD K7
was available. And this wasn't only a matter of price/cost ...

to not measure background noise (IRQ actions) I'd disable
interrupts for the whole test.
Yeah, perhaps Windows will allow you that option... or allow you to
believe you have that option. :) Probably KESYS... But Linux, no - not
from userland. Really should be doing this on "bare metal"...
In this case a rough estimation of code duration may help to
figure out if it were delayed by interrupt or not.

Right. As I mentioned, the outliers I've *seen* are unexpectedly small.
My "zero calibration" reported 28 once, I think... (make it twice...
28???)

No. Subtract only the 'overhaed' once per measure ...
Means one RDTSC and the two Pushes is all you need to subtract from the
final difference in RDTSC.

Don't know if your environment got a single 'RUN-until breakpoint'-key
like KESYS-Hexedit and RosAsm have in their integrated debuggers.
Here it's easy to run the same code piece as often we hit a key
and immediate can see all regs and the time variables.

The command line is my environment. "Run until breakpoint" is available,
of course. I'm not aware of anything that displays any "time variables".
Do Hexedit/RosAsm do this? I would be *very* suspicious of any "timing
code" in this context!


So at least you can do the same ...
the KESYS debugger reports a delta TSC, while RosAsm needs a predefined
variable to show it as of immediate.


We should say, "lest the newbies be misled", that it's pretty pointless
to be doing it at all! The performance of an instruction or sequence of
instructions in *this* particular context gives us almost *no*
information about how it'll perform in its "native habitat". We just
wanna see what we *do* see.

Sure, with a "guess where the code will be at runtime"-OS
it's hard to check on true speed in advance :)

Need a crystal ball to do it "in advance". :) 0x08048076 for the first
cpuid. 0x0804807C for a single instruction in the "timing slot". Doesn't
tell us anything about physical address, paging, caching, etc., but I
know "where it is" at runtime.

There are predifened formulas for AMD (perhaps Intels too)
to exact calculate every single instruction before it become executed.
BUT this is an awfull calculation and still may depend on previous pending
or already cached instructions.


But in general the RDTSC-method helps a lot in code parts evalution
and on comparision of algos.
RDTSC is accurate in itself, but the OS may spoil the measurement
and compilers the performance by having the code somewhere else.
Not to forget paging and cache issues which shouldn't be involved
in a RDTSC comparision.

Right. Seems not to be an issue with *this* code, on *my* machine...

Yes, if you can make sure that any surrounding environment isn't part
of the measured values, everthing may work fine.

Santosh asked if rdtsc values would be more accurate. My impression is
that rdtsc is "flukey". Herbert posted that example that showed rdtsc
"deltas" going down, the more "nop"s we added... I'll have to dig that
up and look at it again - I don't think I've tried it on my current
hardware.

Oh yeah, a few inserted NOPs may sometimes speed up the whole thing,
I think because NOPs can free CPU resources, you know there is more
than just one EAX on the chip :)

I found that post - haven't played with it yet. I remembered wrong - an
increasing number of "mov"s runs faster. Same explanation may apply. We
write our instructions, and the CPU does whatever the hell it wants.

My results so far (really haven't played with it too much) indicate
that
two xor's take no cycles at all, but three take four cycles. Ooookay...

Latency and throughput ... depends on what's busy or free.

For "real world" purposes, "gettimeofday" (or equivalent) mak be a more
"meaningful" measure...

I hope my code parts never need 'Seconds' to perform :)

Crank up the loop count. They will. Of course, this *guarantees* that
we'll be interrupted.

call u64toda
[...]
is there no function in L'unix which can display a 64-bit unsigned
integer value as decimal ?

Depends on what you call a "function in L'unix". A system call? ***,
no! Of course "printf" is sitting in memory, waiting for us to call it.
I thought you'd "approve" of getting along without printf! :)

Yes, I really like NoLib-solutions best ;)
Oh, I remember fprint/printf from PowerBasic times ...

Some people think well of PowerBasic. Never tried it. But I agree that I
like the "nolib" solutions better. Why let the library have all the fun?

KESYS got system calls for all supported variable types, my idea
to once emulate Linux is still alive, but I have no clue yet how
to detect and translate library calls into KESYS system functions,
looks like a horrible job with all the LIB-variants around.

A "Linux subsystem" would be cool, but I doubt it it's "worthwhile". I
suspect there are enough fundamental differences between KESYS and Linux
so that you'd either wind up with a poor emulation of Linux, or butcher
KESYS beyond recognition to squeeze it in.

The problem is just to determine and analyse (automatic) any desired
function and reduce it to the formal needs, still a nightmare but possible.

...
Mmh, a new nick to expect? "FK44x86" ?

FBK44x86. Since my dad and grandfather had the same first name, the
middle initial is an important part of my "identity". :)

Oh, now I remember this,
Good Night FrankBoy!

or something ... :)
__
wolfgang
[copied code away..]





.