Re: 64-bit shared counter
From: Matt Taylor (para_at_tampabay.rr.com)
Date: 03/26/04
- Next message: Matt Taylor: "Re: x86 architecture questions"
- Previous message: Bob Masta: "Re: newbie about winAPI"
- In reply to: Terje Mathisen: "Re: 64-bit shared counter"
- Next in thread: Terje Mathisen: "Re: 64-bit shared counter"
- Reply: Terje Mathisen: "Re: 64-bit shared counter"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 26 Mar 2004 15:01:08 +0000 (UTC)
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:c3pt5i$aa4$1@osl016lin.hda.hydro.com...
> Matt Taylor wrote:
>
> > "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
> >>Assuming this as the worst case, it would take about 5000 seconds for
> >>the same 128 wraparounds that you are testing with now, compared to 70 K
> >>for your current code. Even if you have significantly slower cpus, your
> >>memory interface shouldn't be too dissimilar. What other stuff is going
> >>on here?
> >
> > I'm running 4 threads on this dual Athlon 1.4 GHz until the counter hits
> > 0x8000000000. The idea is to intentionally simulate a case with high
> > contention. My test bench was logging boundary cases (0, 1, 7FFFFFFE,
> > 7FFFFFFF, 80000000, 80000001, etc.) to make sure no two threads got the
same
> > result. I removed that code, and now it's just calling the test function
in
> > a tight loop to increment the counter. I am calling the test function
using
> > a function pointer, but it should be well-predicted.
> >
> > _asm
> > {
> > mov fs:[high], 0
> >
> > loop_top:
> > push OFFSET ctr
> > call DWORD PTR [CurInterlockedIncrement64]
> >
> > cmp edx, 0x80
> > jb loop_top
> > }
> >
> > The high variable is my TLS slot to store the high half. The ctr
variable is
> > my counter. The counter functions were taken directly from this thread.
>
> OK, in that case my code is definitely significantly faster, even if we
> allow a pure 2X speedup from 2.8 vs 1.4 GHz (definitely not the case
> here, with RAM speed being the limiter), it seems to run 7 times faster
> than the various versions where the core code handles the carry
propagation.
<snip>
I let it run for a couple days and then stopped. There was a huge disparity
between run times after I inlined the functions, so it seems the function
call overhead was more significant than I had at first thought. With that in
mind, I dropped the run time to 2^36 locks and added your watchdog routine.
I have placed the C++ and asm versions here:
http://rabbithole.cc/ctr.cpp
http://rabbithole.cc/ctr.asm
My results show your watchdog routine consistently being >50% slower than
the other versions. The 5th is the fastest. I don't understand why.
The latest results are:
Routine 1: 12698 seconds (Terje's)
Routine 2: 8428 seconds (base)
Routine 3: 8121 seconds (Paul's)
Routine 4: 8674 seconds (TLS)
Routine 5: 8007 seconds (Paul's with early out spin)
I have tried splitting the low & high parts into 2 different cache lines.
The results are similar. Most routines are faster with them in the same
cache line due to the locked updates on both halves.
The only thing I can figure on the watchdog routine is that perhaps the fs
references are slowing it down.
-Matt
- Next message: Matt Taylor: "Re: x86 architecture questions"
- Previous message: Bob Masta: "Re: newbie about winAPI"
- In reply to: Terje Mathisen: "Re: 64-bit shared counter"
- Next in thread: Terje Mathisen: "Re: 64-bit shared counter"
- Reply: Terje Mathisen: "Re: 64-bit shared counter"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|