Re: count2.asm
- From: "Beth" <BethStone21@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Tue, 02 Aug 2005 22:19:48 GMT
Randy wrote:
> The advantage of mmapped I/O is that you don't have to worry about
file
> buffer boundaries and the like. The disadvantage is that there are
some
> file length limitations that don't exist with standard file I/O (e.g.,
> files can grow beyond 2GB or some other similar [probably smaller]
> limit using mmapped I/O).
Have you ever read the first chapter of Micheal Abrash's "Graphics
Programming Black Book"?
He makes a myriad points with his simple examples and their _actual_
timings...the examples are mostly to show the "algorithm is King"
principle...
But also shows what is, to many, an "unintuitive" result...
He writes a program in C using "read"...the unbuffered I/O routine in
the standard library...
He then writes a new program using "fread" - which is a buffered I/O -
and this program vastly, vastly improves on the "read" version because
of the buffering...
This is all standard stuff...but then Abrash delibrately throws a
"curveball" - an initially "unintuitive" result - in order to "shake up
that thinking" a little...
He then writes yet another program but _RETURNS_ to using "read" -
unbuffered I/O once more - but simply gets the program to _DO ITS OWN
BUFFERING_...thus, the exact buffering strategy can be "fine tuned" to
the problem at hand...
For example, consider NOT having the OS do the "buffering" but, instead,
you _KNOW_ - from the nature of the program - that it's going to operate
100% "sequentially"...this knowledge allows you to realise that you can
"read ahead"...basic "CPU bound" / "I/O bound" distinction...
So, _DON'T_ let the OS do it...on the contrary, do the "buffering"
yourself...and realise that should the OS provide (as Windows does)
"overlapped I/O" facilities then you can run the processing and the read
of the next "buffer" simultaneously...
Thus, create two buffers...read in data - two "pages" - to both
buffers...process the first buffer...then initiate a "read" on that
first buffer for the third "page" of the file...process the second
buffer...initiate a "read" on the second buffer...process the first
buffer...
And, basically, while the CPU is engaged in the actual processing of the
buffer data, the disk is running _CONCURRENTLY_, picking up the next
bufferful of data...indeed, CPUs are so fast and disks so slow that
"double-buffering" may not be enough...bigger buffers or
"triple-buffering" or something to suit the massive disparity between
the two...
If you leave the OS to do this, then it won't read until you make a
"hit" on an unloaded page...the OS cannot know for sure - as it must use
algorithms as equally appropriate for "sequential" or "random access"
processing - what "page" comes next, so it has to do it
"on-the-fly"...also, goodness knows what "scheme" is really going on
inside the OS regards this kind of thing...there's LRU on the buffers,
no doubt...but does it constantly "allocate / deallocate / allocate /
deallocate" buffers? Going via the "memory manager" all the time?
Messing around with "tables"? All quite unnecessary in this _SPECIFIC_
case, of course...but the OS must operate with complete "genericism" for
whatever weird and wonderful thing is asked of it next...
Find the API (if one exists) that allows asynchronous file reads to be
initiated that doesn't hold up the CPU...then handle the buffering
yourself with two buffers, where you load data asynchronously into one
while processing the other...if you could get the "timing" exactly
right, then there's the prospect of a program that does NOT slow down
even a nanosecond from the disk access...other than the initial "loading
up" of the buffers at the start...it does all the loading asynchronously
and concurrently with the processing of the data, so that - at least,
this is the "aim" of this technique - that the data to be processed next
is already waiting in the other buffer by the time the first buffer has
been processed...
Think, if you will, "production line"...you actually have _TWO_
"workers" in your "factory" here: The CPU and the disk...set up a
"conveyor belt" and then do it "production line" style...while the disk
loads, the CPU processes...while the CPU processes, the disk
loads...overlapping operations which should not require the CPU to
"wait" for any "loading"...
And, indeed, to do this, _DON'T_ use the OS's "mechanisms" at all, if
you can help it...get as "raw" access as possible and then "do it
yourself"...what the OS has for a "general-purpose algorithm" might, in
fact, defeat what you're trying to do...unfortunately, no-one seems to
have heard or comprehends "toolmaker's view" in most modern OSes -
written in "ivory tower" HLLs, following theories that assume "ideals"
and what-not - so you might not even be presented with the choice of,
you know, doing it properly...
Oh, one another idea...no idea if it'd actually work out faster or be
useful...BUT, you know, "analyse the problem domain"...if you're
counting "vowels" and "non-letters" and you know the file size, then
you've implicitly counted the "consonants"...
That is, "vowels + consonants + non-letters = file size" must be true
(assuming no "overlap" in what you're considering "vowel", "consonant"
and "non-letter", of course :)...other "categories", of course, may also
"add up to the whole file size" too, not just the "vowels / consonants /
non-letters" thingy ("whitespace + non-whitespace = file size" too)...
I don't know if this observation could be put to good use in
practice...but, possibly, do a "seek" and "tell" at the end of the file
to find out the "file size" (if the OS doesn't have a direct
"GetFileSize" API, of course :)...then you only need count certain
categories and can "work out" the "negative category" by subtraction at
the end...
So, you can concentrate on simply counting a few categories - "vowels",
"non-letters", etc. - and then just work out the rest later...also, this
saves incrementing lots of variables as you proceed...slightly less RAM,
"skipping" slightly faster than incrementing one or the other variable
for two "categories" that, in fact, "add up to the whole", anyway...just
deal with the one...you can "work out" the other one at the end with
some "simple arithmetic"...
Possibly, the algorithm itself could take advantage of that to speed
things up further: If you're only concerned with "vowels", for example,
there's only 5 of them in English and grabbing a DWORD at a time, then
one of five vowels in one of four positions...just 20 possibilities to
look for...some "look-up table" that contains a set of "masks" to apply
to a value that tries to pick out one of the "vowel" characters a DWORD
at a time? Maybe, though, I'm just thinking this up as I go along here,
so I'd need more thought as to whether this really could be made to work
faster...but there's a possibility that if we stop to think the
_problem_ over a bit more then there are things in the problem itself
which are "always true" that can be taken advantage of...
Anyway, the "DIY buffering" concept should, Hopefully, entirely negate
any influence "disk access" might be having on the program...then it'll
be down 100% to the "algorithm" that's actually doing all this
"counting"...
Beth :)
.
- Follow-Ups:
- Re: count2.asm
- From: randyhyde@xxxxxxxxxxxxx
- Re: count2.asm
- References:
- Re: count2.asm
- From: Frank Kotler
- Re: count2.asm
- From: ¬a\\/b
- Re: count2.asm
- From: randyhyde@xxxxxxxxxxxxx
- Re: count2.asm
- Prev by Date: Re: Assembly Jobs do Exist
- Next by Date: Re: count2.asm
- Previous by thread: Re: count2.asm
- Next by thread: Re: count2.asm
- Index(es):
Relevant Pages
|