Re: Optimize
From: wolfgang kern (nowhere_at_nevernet.at)
Date: 06/06/04
- Previous message: The Wannabee: "Re: Optimize"
- In reply to: The Wannabee: "Re: Optimize"
- Next in thread: The Wannabee: "Re: Optimize"
- Reply: The Wannabee: "Re: Optimize"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 6 Jun 2004 23:48:11 +0200
"The Wannabee" wrote:
| I am too eager to post.... I like to click buttons. :-))
:)
[example..]
| This is like my previous solution, the one I used before yesterday.
| This one :
| [StringManager_GetPcharLengthOLD
| mov edi #1 | mov eax 0 | mov ecx 0_FFFF_FFFF | repnz Scasb | not
| ecx | dec ecx |mov #2 ecx]
| Shouldnt you dec ecx to get correct count?
Yes, if you want to strip off the zero.
No, if you use it like me to get 'the next string ptr' in addition.
| It is less then half the speed of the one you give below. Tested on
| 100_000 iterations of a string of 0_4000 bytes.
| But in reality, because of various stuff like cache ect,
| the results are propably nullified, most of the time,
| and just as betov says, pretty decent considering that most
| strings are much shorter.
16 Kb plain text-strings are to be considered as wrong design here :)
[other example..]
| Ok. Will test this one later.
I'd expect it may be a bit faster on short strings (<128).
| |Your version can be faster with (13 cycles for 4 bytes):
|
| |mov ebx eax ;dw aligned strings are faster
| |L0: ;speed gain by code alignment depends ...
| |mov edx D$ebx
| |cmp dl 0
| |jz L1>
| |cmp dh 0
| |jz L2>
| |test edx 0ff_0000
| |jz L4>
| |test edx 0_ff00_0000
| |jz L3>
| |add ebx 4
| |jmp L0<
| |L4:
| |inc ebx
| |L3:
| |inc ebx
| |L2:
| |inc ebx
| |L1:
| |sub ebx eax
| ok. And we could count add bl 4 if we settle for < 65536 strings ?
No, (not even add bx,4) as the string address would not be aligned
with 01_0000, So it would miss carry-overs.
| :-)). Ok. Thanks. I will try that one too. The following is what I ended
| up with yesterday after posting. Because I needed to be able to spesify
| the registers and make it work with other allready written code. Its FAT
| code, but it runs even sligtly better than the test dl, dl code. They are |
virtualy identical in speed. The align 4 directive, gains me some 10%
| speedup, that is not important(irl), but present in the tests.
Align 64 would be best, but a waste of space too often.
| [StringManager_GetPcharLength
| push #1
| mov #2 #1
| ALIGN 4
| L0:
| MOV edi D$#2
| TEST edi 0_FF | JZ L1>
| TEST edi 0_FF00 | JZ L2>
| TEST edi 0_FF0000 | JZ L3>
| TEST edi 0_FF000000 | JZ L4>
| add #2 4
| JMP L0<
| L4: INC #2
| L3: INC #2
| L2: INC #2
| L1: pop edi
| SUB #2 edi]
| |I think there is no fastest solution,
| |it depends on string-size/cache-bounds ratio.
| |Therefore I use the shortest code, this helps to speed up
| |the surrounding code.
| Oki. The cache is very important to think about?
| Lets assume I have a cache of 64K.
| If I need to perform some operation, on a chunck of 128K
| data, would it be best to split up the data,
| and "tough" the first part of the memory,
| then do stuff, then touch the last part, and then do more stuff ?
| How much is read into the cache each memory access ? 16 bytes ?
| Would it be enough to touch each 16th bytes in the 64K memory area
| to have the 64K moved into the cache ?
| How fast is a register to memory move, when the memory is cached?
| Is it comparable to reg reg move ?
| Is the cache sorted ? Or must the CPU go search for the data ?
| Is this search allways linear time or what?
| How can we test for the chache size ?
| If you want me to read it myself. I go back to beeing OT ;Ø))
Many questions.. I try without reading the docs again yet.
Far from beeing a detailed answer (would need a whole book):
Cache-line size is fixed, my AMD got 64 bytes per line.
Code prefetch queu is limited by instruction count, type and size,
ie: 15 NOPs = one clock-cycle,
predicted branches = zero cycles
up to three SUB/ADD.. in one fetch
... and more.
There are actually two cache-regions:
the fast but tiny (8..64KB) CPU-internal L1-cache, and
the not so fast, external (up to 32MB) L2-cache (a fast RAM section).
Data in L1-cache are operated almost as fast as REG/REG.
L2-lines need external bus and are slower, but faster than the
work-area RAM.
The CPU contain several 'pipes' which may run in parallel, as long
only one of them need to access the external bus at the same time.
It's not a good idea to fill the cache with all your data at once,
first it needs time to do it, and next any memory access including
stack operations and RD/WR local variables may overwrite a cache-line.
So the trick to work fast is have code and data already cached,
this caching is done whenever the CPU reads code or data from the bus.
Writeback to memory needs also the external bus pins.
So a write-action may occure later on the bus without delaying
the next prefetched instructions (if not dependent to anyway).
And while processing/calculating one cached set of data, the next set
may be 'read-ahead' in time by 'touching' the next alignment boundary.
I figured my AMD needs ~32 clock cycles for a cache-line read,
so there is some time to perform some (non-bus) code during it.
The CPU don't need to search for the cached lines, it gotta know it.
Cache-size test? My Bios does it for me, but IIRC there are
some MSR's which can tell about in detail and also let you change
the overall cache-behaviour. But this MSR's are CPU-specific.
Enough for complete confusion? :)
__
wolfgang
- Previous message: The Wannabee: "Re: Optimize"
- In reply to: The Wannabee: "Re: Optimize"
- Next in thread: The Wannabee: "Re: Optimize"
- Reply: The Wannabee: "Re: Optimize"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]