Re: Optimize

From: The Wannabee (faq_at_.@.@SZMyggenPV.com)
Date: 06/07/04


Date: Mon, 07 Jun 2004 14:13:14 +0200

På Sun, 6 Jun 2004 23:48:11 +0200, skrev wolfgang kern
<nowhere@nevernet.at>:

>
> "The Wannabee" wrote:

First I like to say thanks for such nice explaination. imo. You are much
cooler to read in your more verbose form :-))

> Yes, if you want to strip off the zero.
> No, if you use it like me to get 'the next string ptr' in addition.

Yes. Oki. Get it.

> 16 Kb plain text-strings are to be considered as wrong design here :)

Hey I need 3 rewrites just to stop limping.

> | ok. And we could count add bl 4 if we settle for < 65536 strings ?
>
> No, (not even add bx,4) as the string address would not be aligned
> with 01_0000, So it would miss carry-overs.

ok. I ment bx, or 256. Oki, understood. This would have caused a nasty
bug, if not done carefully? I remember pchar bugs was some of the nastiest
bugs to track down, espesially on win98 and earlier OS.

> Align 64 would be best, but a waste of space too often.

After reading your whole post, this is because those cache lines ?

> Many questions.. I try without reading the docs again yet.

I now have more :-)

> Far from beeing a detailed answer (would need a whole book):

Frightening :-Ø)

> Cache-line size is fixed, my AMD got 64 bytes per line.
> Code prefetch queu is limited by instruction count, type and size,
> ie: 15 NOPs = one clock-cycle,
> predicted branches = zero cycles
> up to three SUB/ADD.. in one fetch
> ... and more.

My first question :You say cache line_s_ . I guess this means that the
cache is divided in cache lines : cpu <-- | | cache | | | | <----- bus
----- | (| = cache line?) or is cache line a term about the seperat
lines for code and data ?

I dont understand, why you say 64 bytes cache lines, but 15 nop will take
one cycle. Why doesnt 63 nops take one cycle ? Is it possible to write
code that test for the cache size and line size ? I mean, that verify that
the sizes are correctly given in the BIOS? If BIOS info, then it will also
be availablke in windows. I will go look for it.

> There are actually two cache-regions:
> the fast but tiny (8..64KB) CPU-internal L1-cache, and
> the not so fast, external (up to 32MB) L2-cache (a fast RAM section).
> Data in L1-cache are operated almost as fast as REG/REG.
> L2-lines need external bus and are slower, but faster than the
> work-area RAM.

Oki. This is good. This means that I can, if I know code to be in cache,
have sort of "extra registers" to help me ? If they are allmost ekvivalent
in speed, then variables can be used as sort of cache registers, in
spesial code, and this make it easier to do proper pairing when one need
it ?

> The CPU contain several 'pipes' which may run in parallel, as long
> only one of them need to access the external bus at the same time.

Oki. I know this. But only certain memonics, instructions can "pair" ?

> It's not a good idea to fill the cache with all your data at once,
> first it needs time to do it, and next any memory access including
> stack operations and RD/WR local variables may overwrite a cache-line.

Hmm. Lets assume I dont use the stack, in code ment to be fast. Can a RD
 from a memorylocation that has been cached, destroy cache line s ?

Look at this loop :

L0: ;RefreshCache (Means in this case also to make sure RAM is not on disk
in pageing OS)
mov eax LowerMemoryBound
mov ebx UpperMemoryBound
while eax < ebx
    mov ecx D$eax
    add eax SizeOf_Record
End_While

Lets assume that the memory between eax and ebx is 64KB. When the code
starts running, how does the CPU proceed with reading the memory into a
cache ? Lets assume the macro is expanded so the jump is usually NOT taken
(I read somewhere, that CPU assumes, not taken, so the mov instruction
will be cached ?
>
> So the trick to work fast is have code and data already cached,
> this caching is done whenever the CPU reads code or data from the bus.

Oki. But how much is cached each time. Are what you saying is that this is
done transparent to the code, and that its no use trying to help the CPU
with this ? this is confusious to me.

> Writeback to memory needs also the external bus pins.
> So a write-action may occure later on the bus without delaying
> the next prefetched instructions (if not dependent to anyway).

I see. So the code above will never leave the cache, as it does not need
to ?
if we _wrote_ ECX to the D$eax how many writes would be safe in cache,
assuming 64K L1 ?
How much would be safe if L2 is 256K and we allow L2 ?

I assume that some of the data is code, so 64K in L1, is used for other
things than my data as well ??

> And while processing/calculating one cached set of data, the next set
> may be 'read-ahead' in time by 'touching' the next alignment boundary.

This was what I thought above. But didnt you actually say it was not so ?
"> It's not a good idea to fill the cache with all your data at once,"
What I ment was not "all data", but only enough to fill a certain portion
(how much) of the cache.

> I figured my AMD needs ~32 clock cycles for a cache-line read,
> so there is some time to perform some (non-bus) code during it.

Oki. This seems very advanced. How do you know when you can safly assume a
cache read takes place ? Not to mention all the thinking of data structure
to allow it. I see theres a lot for me to learn about, which is good. I
now use L0> instead of my long label jumps when I can, but must rewrite
500+KB code to do this throughout. I should have read more beforehand. I
allways took those jmp @LoadNextData to be automatically short. I
misunderstood :-(

> The CPU don't need to search for the cached lines, it gotta know it.

But how can it. How does it map a read from a memory location mov eax
D$ebx ? Is the cache L1 memory, in reality lots of hidden registers, or
cache indises, so that when mov eax D$ebx runs, it will look to the
processor like mov ebx ChacheMemoryLocationXY ?? And as such be allmost as
a register ?

> Cache-size test? My Bios does it for me, but IIRC there are
> some MSR's which can tell about in detail and also let you change
> the overall cache-behaviour. But this MSR's are CPU-specific.

Okey. Many thanks for good explaination.
> Enough for complete confusion? :)

:-) Yes. Enough for experiments, when I get around to it. I need to learn
to limp first. Having info from you, like this ,is 1000 times better for
me than books.

> __
> wolfgang
>
>

-- 
Sender med M2, Operas revolusjonerende e-postprogram: http://www.opera.com/


Relevant Pages

  • Re: Optimize
    ... | My first question:You say cache line_s_. ... Would be nice if 64 single-byte instruction could be processed at once. ... ability of the CPU internal logic. ... mov ebx D$eax;this will 'stall' ...
    (alt.lang.asm)
  • Re: Optimize
    ... |mov edi D$Strptr ... |mov eax D$StrPtr ... |mov ebx eax;dw aligned strings are faster ... The cache is very important to think about? ...
    (alt.lang.asm)
  • Re: Optimize
    ... |mov edi D$Strptr ... |mov eax D$StrPtr ... |mov ebx eax;dw aligned strings are faster ... The cache is very important to think about? ...
    (alt.lang.asm)
  • Re: Caching data
    ... > Is there a way in asm to determine if data will be put in the cache or ... mov eax, ... You can remove data from the cache using clflush. ... memory, nor any way to force the cache to hold on to a piece of memory. ...
    (comp.lang.asm.x86)
  • Re: Caching data
    ... > Is there a way in asm to determine if data will be put in the cache or ... mov eax, ... You can remove data from the cache using clflush. ... memory, nor any way to force the cache to hold on to a piece of memory. ...
    (alt.lang.asm)