Re: 64-bit shared counter
From: Matt Taylor (para_at_tampabay.rr.com)
Date: 03/30/04
- Previous message: Matt Taylor: "Re: Can you use SSE2 in DOS?"
- In reply to: Terje Mathisen: "Re: 64-bit shared counter"
- Next in thread: Terje Mathisen: "Re: 64-bit shared counter"
- Reply: Terje Mathisen: "Re: 64-bit shared counter"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 30 Mar 2004 03:22:54 +0000 (UTC)
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:c41lkr$so0$1@osl016lin.hda.hydro.com...
> Matt Taylor wrote:
>
> > "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
> >>OK, in that case my code is definitely significantly faster, even if we
> >>allow a pure 2X speedup from 2.8 vs 1.4 GHz (definitely not the case
> >>here, with RAM speed being the limiter), it seems to run 7 times faster
> >>than the various versions where the core code handles the carry
> >
> > propagation.
> > <snip>
> >
> > I let it run for a couple days and then stopped. There was a huge
disparity
> > between run times after I inlined the functions, so it seems the
function
> > call overhead was more significant than I had at first thought. With
that in
> > mind, I dropped the run time to 2^36 locks and added your watchdog
routine.
> >
> > I have placed the C++ and asm versions here:
> > http://rabbithole.cc/ctr.cpp
> > http://rabbithole.cc/ctr.asm
> >
> > My results show your watchdog routine consistently being >50% slower
than
> > the other versions. The 5th is the fastest. I don't understand why.
>
> I do!
>
> Your version of my code is wrong, as well as slow:
>
> UINT64 __forceinline InterlockedIncrement64_1(volatile UINT64 *pn)
> {
> _asm
> {
> mov ecx, [pn]
> mov eax, 1
> lock xadd [ecx], eax
> mov edx, fs:[g_pvHigh]
> cmp eax, fs:[g_pvLow]
> adc edx, 0
> mov fs:[g_pvLow], eax
> mov fs:[g_pvHigh], edx
> }
> }
>
> Firstly, there is no need at all for the fs: segment overrides, the
> shared 64-bit variable can reside anywhere in memory as long as all the
> incrementing threads/processes can _read_ it!
>
> Secondly, you are _writing back_ the updated results on every iteration,
> which is both horribly slow (think ping-pong of the cache line), and
> simply wrong:
Each thread has a unique fs segment pointing to the TEB structure which
contains TLS. This shouldn't produce the ping-pong effect you describe
except when threads are moved between processors--relatively infrequently. I
wondered about the fs override causing decoder delays, but I doubt they
would be that great. Certainly the run time should be less than the other
routines which are both branchy and lengthy.
> This code depends critically on having a single watchdog thread that is
> responsible for all actual updates, and which does them by copying the
> updated 64-bit count to a new memory location, then doing an atomic
> write to the 32-bit pointer to the new location.
>
> Since the watchdog is read-only on every access until the wraparound
> happens, it can work in parallel with the incrementers with low to zero
> impact. On an update, it is the only writer to the 64-bit copies, so it
> will never suffer locking delays caused by the incrementer threads.
>
> With every incrementer writing back, you have a race condition between
> loading te previous 64-bit value, doing the compare (propagating any
> carry), and writing back the updated result.
> >
> > The latest results are:
> > Routine 1: 12698 seconds (Terje's)
>
> 2^36 is 16 times 2^32, taking my 30-38 seconds per epoch as a starting
> point, I'd get about 600 seconds or 10 minutes for the same job!
>
> > Routine 2: 8428 seconds (base)
> > Routine 3: 8121 seconds (Paul's)
> > Routine 4: 8674 seconds (TLS)
> > Routine 5: 8007 seconds (Paul's with early out spin)
> >
> > I have tried splitting the low & high parts into 2 different cache
lines.
> > The results are similar. Most routines are faster with them in the same
> > cache line due to the locked updates on both halves.
>
> That's one of the problems: You do need these numbers to be in different
> cache lines, but that is so that you can have exclusive access to the
> 32-bit counter while doing read-only accesses to the full 64-bit copy.
Yes, but it did not help the numbers. I tinkered with everything a bit more,
and here's what I have:
[Same cacheline]
Total time used was 17148 seconds
Total time used was 11607 seconds
Total time used was 10546 seconds
Total time used was 12946 seconds
Total time used was 13683 seconds
[Split cacheline]
Total time used was 20108 seconds
Total time used was 18300 seconds
Total time used was 17437 seconds
Total time used was 18290 seconds
Total time used was 18237 seconds
The third routine is the one using Paul Hsieh's optimization. I pulled out
the segment reference on fs so the counters are truly not shared. I don't
see how a watchdog thread would even work since it could experience a race.
The split cacheline is split on a 4-byte boundary:
ctr = (UINT64 *) VirtualAlloc(NULL, 128, MEM_COMMIT, PAGE_READWRITE);
// run tests
ctr = (UINT64 *)((BYTE *) pnCtr + 60);
// run tests again
Using VirtualAlloc guarantees 4K alignment.
-Matt
- Previous message: Matt Taylor: "Re: Can you use SSE2 in DOS?"
- In reply to: Terje Mathisen: "Re: 64-bit shared counter"
- Next in thread: Terje Mathisen: "Re: 64-bit shared counter"
- Reply: Terje Mathisen: "Re: 64-bit shared counter"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|