Re: [Clax86list] Can WXP handle 4 GB of Physical Memory
- From: "James Van Buskirk" <spamtrap@xxxxxxxxxx>
- Date: Mon, 11 Feb 2008 14:26:44 -0700
"Jerome H. Fine" <spamtrap@xxxxxxxxxx> wrote in message
news:47B045FB.8090205@xxxxxxxxxxxxx
Since I have often found that so-called "accepted concepts" can be
incorrect, I thought that it would be a reasonable challenge to
determine if huge chunks of memory are able to speed up a sieve
program.
By the way, in looking at the time to copy various sizes of chunks
of memory, it seems obvious that L1 and L2 cache sizes play an
important part. However, there does not seem to be anywhere near
even an order of magnitude, let alone 2 orders of magnitude, for
using L1 cache. And for L2 cache, the time factor seems to be ONLY
about 2 to 1, nowhere near an order of magnitude. While speeding
up any program by a factor of 2 should never be ignored, I just
don't understand where 2 orders of magnitude are possible unless
a mod 30 compression of the work space is included as part of the
speed improvement.
Here, let's try the stream benchmark http://www.cs.virginia.edu/stream/
C:\Program Files\Microsoft Visual Studio 8\James\Benches\stream>stream
----------------------------------------------
Double precision appears to have 16 digits of accuracy
Assuming 8 bytes per DOUBLE PRECISION word
----------------------------------------------
----------------------------------------------
STREAM Version $Revision: 5.6 $
----------------------------------------------
Array size = 30000000
Offset = 0
The total memory requirement is 686 MB
You are running each test 10 times
--
The *best* time for each test is used
*EXCLUDING* the first and last iterations
----------------------------------------------
----------------------------------------------
Printing one line per active thread....
----------------------------------------------------
Your clock granularity appears to be less than one microsecond
Your clock granularity/precision appears to be 1 microseconds
----------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 5266.3163 0.0914 0.0911 0.0916
Scale: 5258.7959 0.0915 0.0913 0.0918
Add: 5465.1743 0.1320 0.1317 0.1322
Triad: 5462.6986 0.1320 0.1318 0.1322
----------------------------------------------------
Solution Validates!
----------------------------------------------------
So I am getting a puny 5.2 GB/s bandwidth from memory with this
benchmark. From L1 cache, I could have achieved a 16 byte load
and store per clock cycle per core, so that would have been
16*2*2*2.67e9 = 171 GB/s. That's a factor of over 30 right there.
Stream accesses memory in a favorable way allowing the hardware
to take shortcuts. Naive sieving accesses memory in an unfavorable
way, so it's slower. L1 cache doesn't have this same problem,
although there is, I admit, the possibility for in-cache accesses
to collide with one another in a processor-dependent fashion.
The following paragraph assumes that a mod 30 compression of the
work space is being used and an L2 cache of at least 1 MB and
probably 8 Mb.
Note that I am NOT saying that taking advantage of chunks of memory
the size of the L1 cache and then the size of the L2 cache should not
be done for the "smaller primes (up to around 10**8)". However, when
sieving up around 10**18, the program obviously needs to use primes
up to 10**9. Since there are about TEN times as many primes between
10**8 and 10**9 as there are up to 10**8, when L2 cache is no longer
large enough to effectively process the primes larger than about 10**8,
having a huge chunk of memory (perhaps about 1 GB) should be effective
in processing the last 45 million primes (out of 50,847,534) when the
sieve checks for primes up to 10**18.
If you use 4 bytes to hold the value of each prime, that's 203 MB, if
8 bytes, 407 MB. Still not half a GB. That's not going to fit in L1
cache, but the trick in handling this vast quantity of data is to keep
it organized in such a way that accesses are minimized. This requires
you to be constantly rewriting the data set so that the set of primes
required for the next cache block will be directly available without
the necessity to search through primes that are going to miss anyhow.
In addition, another option might be to attempt to use multi-core CPUs
working on the same area of the sieve at the same time. I am not sure
if WXP has the ability to co-ordinate such an algorithm, but that will
also be an interesting challenge.
As I see the potential, all of L1, L2 and actual memory may play a vital
role in the algorithm. The real problem may be to produce self-modifying
code (not the code itself, but the parameters which are used to determine
when to switch from L1 sizes to L2 sizes and finally to the huge chunks
of memory) which can adjust to the various sizes of L1 and L2 cache that
are available. I would appreciate a comment and any actual information
from anyone who has attempted sieving up around 10**18.
Well, you seem to be doing too much planning and not enough prototyping.
In my experience there are many features which must be implemented to
make sieveing fast, and it just takes the time and effort to code them
up and debug them. Going to 64-bit Windows would save you time if you
are worried about running into limitations of 32-bit address spaces.
If you think you are going to need that much memory (I claim you aren't)
it makes absolutely no sense to subject yourself to the irritation of
running into system limits like that. Well, unless you want to do a
distributed computing effort, but in that case it's going to take you
so long to get anything useful coded that there should be a significant
number of potential clients running 64-bit Windows by the time you need
them.
--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
.
- References:
- Re: [Clax86list] Can WXP handle 4 GB of Physical Memory
- From: Charles Crayne
- Re: [Clax86list] Can WXP handle 4 GB of Physical Memory
- From: James Van Buskirk
- Re: [Clax86list] Can WXP handle 4 GB of Physical Memory
- Prev by Date: Re: Display numbers in IEEE format
- Next by Date: GAS-Intel on Windows (MinGW)
- Previous by thread: Re: [Clax86list] Can WXP handle 4 GB of Physical Memory
- Next by thread: GLOBAL_OFFSET_TABLE
- Index(es):
Relevant Pages
|
|