Re: 64-bit shared counter

From: Matt Taylor (para_at_tampabay.rr.com)
Date: 03/30/04

  • Next message: hutch--: "Re: A 233 byte PE-COFF executable"
    Date: Tue, 30 Mar 2004 03:22:54 +0000 (UTC)
    
    

    "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
    news:c41lkr$so0$1@osl016lin.hda.hydro.com...
    > Matt Taylor wrote:
    >
    > > "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
    > >>OK, in that case my code is definitely significantly faster, even if we
    > >>allow a pure 2X speedup from 2.8 vs 1.4 GHz (definitely not the case
    > >>here, with RAM speed being the limiter), it seems to run 7 times faster
    > >>than the various versions where the core code handles the carry
    > >
    > > propagation.
    > > <snip>
    > >
    > > I let it run for a couple days and then stopped. There was a huge
    disparity
    > > between run times after I inlined the functions, so it seems the
    function
    > > call overhead was more significant than I had at first thought. With
    that in
    > > mind, I dropped the run time to 2^36 locks and added your watchdog
    routine.
    > >
    > > I have placed the C++ and asm versions here:
    > > http://rabbithole.cc/ctr.cpp
    > > http://rabbithole.cc/ctr.asm
    > >
    > > My results show your watchdog routine consistently being >50% slower
    than
    > > the other versions. The 5th is the fastest. I don't understand why.
    >
    > I do!
    >
    > Your version of my code is wrong, as well as slow:
    >
    > UINT64 __forceinline InterlockedIncrement64_1(volatile UINT64 *pn)
    > {
    > _asm
    > {
    > mov ecx, [pn]
    > mov eax, 1
    > lock xadd [ecx], eax
    > mov edx, fs:[g_pvHigh]
    > cmp eax, fs:[g_pvLow]
    > adc edx, 0
    > mov fs:[g_pvLow], eax
    > mov fs:[g_pvHigh], edx
    > }
    > }
    >
    > Firstly, there is no need at all for the fs: segment overrides, the
    > shared 64-bit variable can reside anywhere in memory as long as all the
    > incrementing threads/processes can _read_ it!
    >
    > Secondly, you are _writing back_ the updated results on every iteration,
    > which is both horribly slow (think ping-pong of the cache line), and
    > simply wrong:

    Each thread has a unique fs segment pointing to the TEB structure which
    contains TLS. This shouldn't produce the ping-pong effect you describe
    except when threads are moved between processors--relatively infrequently. I
    wondered about the fs override causing decoder delays, but I doubt they
    would be that great. Certainly the run time should be less than the other
    routines which are both branchy and lengthy.

    > This code depends critically on having a single watchdog thread that is
    > responsible for all actual updates, and which does them by copying the
    > updated 64-bit count to a new memory location, then doing an atomic
    > write to the 32-bit pointer to the new location.
    >
    > Since the watchdog is read-only on every access until the wraparound
    > happens, it can work in parallel with the incrementers with low to zero
    > impact. On an update, it is the only writer to the 64-bit copies, so it
    > will never suffer locking delays caused by the incrementer threads.
    >
    > With every incrementer writing back, you have a race condition between
    > loading te previous 64-bit value, doing the compare (propagating any
    > carry), and writing back the updated result.
    > >
    > > The latest results are:
    > > Routine 1: 12698 seconds (Terje's)
    >
    > 2^36 is 16 times 2^32, taking my 30-38 seconds per epoch as a starting
    > point, I'd get about 600 seconds or 10 minutes for the same job!
    >
    > > Routine 2: 8428 seconds (base)
    > > Routine 3: 8121 seconds (Paul's)
    > > Routine 4: 8674 seconds (TLS)
    > > Routine 5: 8007 seconds (Paul's with early out spin)
    > >
    > > I have tried splitting the low & high parts into 2 different cache
    lines.
    > > The results are similar. Most routines are faster with them in the same
    > > cache line due to the locked updates on both halves.
    >
    > That's one of the problems: You do need these numbers to be in different
    > cache lines, but that is so that you can have exclusive access to the
    > 32-bit counter while doing read-only accesses to the full 64-bit copy.

    Yes, but it did not help the numbers. I tinkered with everything a bit more,
    and here's what I have:

    [Same cacheline]
    Total time used was 17148 seconds
    Total time used was 11607 seconds
    Total time used was 10546 seconds
    Total time used was 12946 seconds
    Total time used was 13683 seconds

    [Split cacheline]
    Total time used was 20108 seconds
    Total time used was 18300 seconds
    Total time used was 17437 seconds
    Total time used was 18290 seconds
    Total time used was 18237 seconds

    The third routine is the one using Paul Hsieh's optimization. I pulled out
    the segment reference on fs so the counters are truly not shared. I don't
    see how a watchdog thread would even work since it could experience a race.

    The split cacheline is split on a 4-byte boundary:
     ctr = (UINT64 *) VirtualAlloc(NULL, 128, MEM_COMMIT, PAGE_READWRITE);
     // run tests
     ctr = (UINT64 *)((BYTE *) pnCtr + 60);
     // run tests again

    Using VirtualAlloc guarantees 4K alignment.

    -Matt


  • Next message: hutch--: "Re: A 233 byte PE-COFF executable"

    Relevant Pages

    • Re: SP2 installed itself.
      ... I took the five new updates today and had no problem, ... I don't have a clue why the OP had a problem. ... The total time involved, even ... > Exactly what trojan/virus? ...
      (microsoft.public.windowsupdate)
    • Re: Creating a Time Sheet Entry Form in Access
      ... guess subform) show the activities individually and then total time for each ... Gina Whipp wrote: ...
      (microsoft.public.access.formscoding)