Re: Lockfree_mpmc and scalability ...



Hello,

Even the following code inside push() method:

if getlength >= fsize
then
begin
result:=false;
exit;
end;

If you hav e noticed getlength() method is sharing
variables between the cores and making the 4 threads
test much slower than the single thread test on
Intel Core 2 Quad Q6600, and i have tested it on my computer,
it makes it much slower cause the cache to cache tranfer is costly
on Intel Core 2 Quad Q6600, but on new architechtures that have
and L3 cache and hypertransport, it gives the same throughput
on single thread and four threads but it doesn't scale with four threads.
.
So the following parts are sharing variables between the cores and making
the 4 threads test much slower than the single thread test.on Intel Core 2
Quad Q6600:

[1] getlength()

[1] newTemp:=LockedIncLong(temp); ...

[2] and CAS(tail,lasttail,newtemp) also...


It's why i have told you that the CAS and sharing variables
are expensive.



Thank you.

Amine Moulay Ramdane.

"aminer" <aminer@xxxxxxxxxxxx> wrote in message
news:jq392a$4mt$1@xxxxxxxxxxxxxxxx
I wrote:
In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads version doens't scale
either.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..


I mean the CAS and the sharing of the variables make the S part of
lockfree_mpmc much bigger than the S part and from the Amadahl equestion
this makes lockfree_mpmc not scalable.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores and you are using CASes, cause sharing variables
and using CAS are so expensive..


Thank you.

Amine Moulay Ramdane.

"aminer" <aminer@xxxxxxxxxxxx> wrote in message
news:jq37jl$qsl$1@xxxxxxxxxxxxxxxx

I have corrected some typos , please read again...

Hello,


I have received the benchmarks from some persons
that have an L3 cache, and i have noticed that lockfree_mpmc
doesn't scale either on with an L3 cache.
Do you know why this lock free fifo doesn't scale, cause
look at the following code on the push() side:

--

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;
result:=true;
newTemp:=LockedIncLong(temp);

lastTail:=newTemp-1;
setObject(lastTail,tm);

repeat

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;

until false;


end;

---


You have two thinks:

[1] newTemp:=LockedIncLong(temp);

[2] CAS(tail,lasttail,newtemp)

In the 4 threads scenario , as you can see in [1] temp has to be
loaded from the L3 cache on computers that have an L3 cache ,
but on my Intel Core 2 Quad Q6600 that doesn't have an
L3 cache(just an L2 cache for every two cores) i think it has to
be loaded from memory, so that will make the four thread test
with an L3 cache a little bit slower than the single thread version
that loads the values from the L1 cache and much slower on a computer
without an L3 cache. That's the same for [2] , tail has to be loaded the
same way.

It's why i am getting a retrograde throughput with four threads
on my Intel Core 2 Quad Q6600 and almost the same thoughput
as the single thread on a computer with an L3 cache.

In the two thread scenario, you have to do a load
from the local L2 cache in [1] and [2] and this loads makes
the S part of the Amadahl equation much bigger than
the P part, it's why the two threads version doens't scale
either.

So in general i think it's not possible to make lockfree
fifo queues to scale when the lockfree code is sharing variables
between the cores, cause sharing variables is so expensive..


Thank you.

Amine Moulay Ramdane.


"aminer" <aminer@xxxxxxxxxxxx> wrote in message
news:jq0kr8$p30$1@xxxxxxxxxxxxxxxx
Hello all,


I have finally found why lockfree_mpmc doesn't scale...

you can get the the source code of lockfree_mpmc from:

http://pages.videotron.com/aminer/

So please follow with me..

If you take a look at lockfree_mpmc object pascal
source code you will read this on the push side:


---

function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;

result:=true;

newTemp:=LockedIncLong(temp);
lastTail:=newTemp-1;

setObject(lastTail,tm);

repeat
if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;
until false;
end;

---

When i have tested the push() side with 4 threads i have noticed that
lockfree_mpmc
doesn't scale at all., in fact i have got a retrograde throughput, that
means that
i got less throughput than on a single thread test.. and i have finally
found
why lockfree_mpmc doesn't scale. When you are using a lockfree_mpmc
on a single thread test the CAS does read and update the variables on
the
level 1 cache, and it's fast, but when you are using 4 threads it does
get
too slow cause we are reading and updating from the L2 and from the
memory.

I have thried to play with the affinity mask and i have found that when
i am
using two threads on my tests and reading and updating from the same
level 2 cache
it does scale a little bit more and i have got more throughput with two
threads
on different cores and on the same level 2 cache than the single
threadtest.


I have also modified lockfree_mpmc to not touch the CAS and
the cache when tail and lasttail are not equal by using the following
code inside
the repeat until loop:

if tail <> lasttail
then
begin
continue;
end;

and it does give better performance with this method

here is the final code of the push() side of lockfree_mpmc..

i think i will modify the pop() side like that...


---
function TLockfree_MPMC.push(tm : tNodeQueue):boolean;
var lasttail,newtemp:longword;
i,j:integer;
begin

if getlength >= fsize
then
begin
result:=false;
exit;
end;

result:=true;

newTemp:=LockedIncLong(temp);
lastTail:=newTemp-1;

setObject(lastTail,tm);

repeat

if tail <> lasttail
then
begin
continue;
end;

if CAS(tail,lasttail,newtemp)
then
begin
exit;
end;
asm pause end;
until false;
end;
---

But as i have said before lockfree_mpmc doesn't scale when we are
using different cores and WE ARE NOT sharing the same cache,
that means that on my Intel Core 2 Quad Q6600 it does scale only
when we are using 2 threads on different cores that shares the same
level2 cache.



Thank you.


Amine Moulay Ramdane.









.



Relevant Pages

  • Re: Lockfree_mpmc and scalability ...
    ... fifo queues to scale when the lockfree code is sharing variables ... between the cores, cause sharing variables is so expensive.. ... doesn't scale either on with an L3 cache. ...
    (alt.comp.lang.borland-delphi)
  • Lockfree_mpmc and scalability ...
    ... you can get the the source code of lockfree_mpmc from: ... why lockfree_mpmc doesn't scale. ... level 1 cache, and it's fast, but when you are using 4 threads it does get ... on different cores and on the same level 2 cache than the single threadtest. ...
    (alt.comp.lang.borland-delphi)
  • Re: Lockfree_mpmc and scalability ...
    ... doesn't scale either on with an L3 cache. ... Do you know why this lock free fifo doesn't scale, ... that doesn't have an L2 cache(just an L2 cache for every two cores), ... L3 cache compared to the single thread version that loads the values ...
    (alt.comp.lang.borland-delphi)
  • Re: Lockfree_mpmc and scalability ...
    ... doesn't scale either on with an L3 cache. ... Do you know why this lock free fifo doesn't scale, ... L3 cache(just an L2 cache for every two cores) i think it has to ... as the single thread on a computer with an L3 cache. ...
    (alt.comp.lang.borland-delphi)
  • Re: ANOTHER PROCESSOR QUESTION
    ... "bottleneck" of any kind. ... It is if 2 or more cores are sharing the same cache. ...
    (alt.comp.hardware.pc-homebuilt)