Re: "index" efficiency... any help or ideas?

From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 12/18/03


Date: Thu, 18 Dec 2003 16:18:49 -0000

Bx. C wrote:
> Beth wrote:
> > And pray that no-one types in "antidisestablishmentarianism" or
you'll
> > lose half your hard drive!!! ;)
> >
> > Not that this is the worst situation, either...if storing "place
> > names", then the Welsh town called
> > "LLANFAIRPWLLGWYNGYLLGOGERYCHWYRNDROBWLLLLANTYSILIOGOGOGOCH" will
> > require the addition of a number of extra hard drives, merely to
cope
> > with the weight of "padding" required...there's also an
Austrailian
> > aboriginal town which is just two or three letters shy of the
Welsh
> > name above (but I can't, unfortunately, find it on the 'net like
the
> > other one to copy it here...which is a shame because it's highly
> > poetic, like many aboriginal place names...you know, a really long
> > "woggawoggadoogadoobee" kind of name ;)...
>
> well.. CURRENTLY (counting a trailing space)... my longest word is
12
> characters... but that's looking at the first few hundred or so
words...
> right now i'm writing up a perl script that'll convert the words to
> all-caps... remove duplicates... sort... and then convert to
tree-format and
> spit out NASM-compatible files for me... at a later date, i'll work
on
> "tracking" command usage, for pseudo-real-time (please don't take
apart that
> word!!! hehe) performance optimization...

Leaving "pseudo-real-time" to one side, did I hear you say "commands"?
Aha! It was worth asking the questions...and now I see why you were
doing it the way you were doing it...

Because if these are "commands", then they are being
"entered"...either typed or read out of a "batch file"...therefore, I
can see why you were working on a character-by-character basis...as we
have room to "parallelise" here and that could be the biggest winner
of them all (especially if combined with other algorithms)...

Okay, this has to be thought through in more detail but just to begin
with a "rough outline", if users are _typing_ these commands then, in
fact, the slowest thing here is their typing fingers...but, once
"ENTER" is struck, we want these commands to happen as
"pseudo-real-time" fast as possible? Well, the majority of the time -
due to slow typing fingers - is going to be all that "idle time"
sitting between keystrokes...and we don't necessarily have to _wait_
for the whole command to be typed before starting our searching
work...which is where "parallelism" comes into it...

As the user hits the "A" key, we can search for all commands starting
with "A"...as they hit "A" again, we cut down the search again...as
the hit "A" the third time, there's now only 2 possible commands that
it could be...they hit "2" and then if they hit "P", we _already know_
that we've got a "AAA2PP" command and they haven't even finished
typing the command completely yet...if it's a "B" instead then it's
that "AAA2B" command instead...

Note that, basically, by using some "parallelism", the search is now
_completely irrelevent_ in time terms...because the program is
actually _waiting for the user_ to finish and hit ENTER, it's already
worked out and found the command ages back..."come on, user, hit
ENTER!", the machine thinks as it waits what seems like an eon in its
super-fast terms...often, when it's clear that it could only be one
command because no other command has those same letters at the start,
it'll have worked out the command after typing only a few
letters...there's scope here, in fact, to _exploit_ this in order to
provide "command completion" to speed things up as well...that is,
once enough of the command is typed that it could only be one
particular command, those final letters appear in grey after what they
actually have typed and, if it's the command they want, then they hit
TAB and it completes the command and starts running it straight
away...

Even better, if this is a case of waiting for user input, then even
the fastest human typer leaves tons of "idle time" between each
keystroke...therefore, you could actually have a pretty slow search
method and yet you'd _still_ find yourself with what behaves as
"instanteous searching"...the fact that these are _typed_ commands has
made a world of difference to the solution...as I was saying, you
should always ask questions to diagnose the problem before jumping in
with all these "you 'must' use this" solutions...

Because whatever the rest suggest as a search method, then my solution
which is specific to the idea of "typed commands" will beat them
all...it doesn't matter how clever they are (and I'm sure those ideas
_will_ certainly be clever)...the point with my solution is that I'm
using "parallelism" and doing all of the search work completely
_before_ their methods have even started because they are waiting for
the complete word before starting...

Well, that's the basic idea...now comes the tricky part...how do we
format this in the best way we can? The simplest idea is, actually,
what I think you were originally attempting...a "hierarchy" of
tables...the table at the top is the "first letter" table...for each
letter that it could be, there's a pointer to another table...that is,
put simply, the first entry in the table is for "A", the second for
"B", the third for "C", etc....therefore, when the user hits that
first letter, we can immediately use what they typed to index straight
into this table...each entry in the array is simply a pointer to yet
another table...if there are no commands starting with, say, "K" then
that entry is simply "NULL" to say "there is no further table for this
letter"...then you simply repeat this with each letter...at the "leaf"
nodes, the entry can directly be the address to CALL for the
command...

The "search" process simply be self-explanatory...when a user hits a
letter, you use that to directly grab the entry in the table that
relates to that letter in the "current table"...there's an address
there, which becomes your new "current table"...if they hit
"backspace", we back up one table (so, to accomodate this, we have a
"parent table" variable and when "backspace" is hit, we back up to the
"parent table" address...the first letter has a "parent table" of NULL
to signal that there is no parent table, just for completeness :)...if
any entry turns out to be "NULL", then, basically, the user is typing
nonsense here because there's no such commands...but, just leave it
alone and let them keep typing, monitoring if it should become "valid"
again...because it's quite possible that they simply slipped and typed
one too many "A" letters and the very next thing is going to be
"backspace", which corrects the problem and gets us back to valid
commands again (just keep track of how many characters and return to
that "parent table" when they have returned back to the "valid range"
of commands once more :)...I'd highly recommend you _permit mistakes_
and simply wait for them to be corrected and don't falsh up any "This
is invalid!" messages boxes that need to be dismissed...as said, it
could simply be a small typo and nothing would frustrate a user faster
than a program that freaks out of the smallest typing mistake and
flashes warnings and sirens everywhere, as if it were warning of a
jail break or something...remember, as we're doing the "searching"
while they are typing, it's _THEM_ who are the slowest element in the
equation and the machine is mostly sitting around waiting...we can
afford to twiddle our thumbs and wait for the user to get their act
together ;)...

When they've typed _enough_ of the command that there's only one
possible thing they could be trying to type, we're at a "leaf node"
and have the command's procedure address from the table in our
possession already...just sit around - monitoring for any "backspaces"
(well, just in case they type one command and then suddenly change
their mind, deleting it all to type something else ;) - until ENTER is
hit, in order to actually make that final CALL...you can also add some
"auto-completion" thing in order to "hurry them along" too...once
you've got your "unique" command, you can "suggest" that command to
the user and pressing the TAB key can be equivalent to having typed
the rest of the command and hit ENTER all in one go...so, once someone
gets used to all the commands, it's possible for them to type a few
characters and whack TAB...a "feature" for "advanced" users of the
system once they get used to what minimum amount of characters need to
be typed to differentiate one command from another...note that this
also means that the commands could conceivably be made longer and more
"friendly" to read because the user is often NOT going to be typing
the whole thing out anyway...you can support this feature for even
faster entry too, by choosing command names that are somewhat "even"
across the letters so that a _minimum_ need be typed to work out what
command they want...

Now, the greatest "issue" we have is all these tables...they _can_
potentially get very large indeed...but this greatly depends on how
many commands, their distribution, their exact "phrasing", etc....the
problem is that it's "exponential" for each character typed...the "top
table" requires 36 * 4 bytes (each "alphanumeric" has a DWORD address
in the table)...144 bytes...now, that's not bad at all...the problem
is that for each of these entries, there's another 144 bytes
_each_...and so on and so forth for every single character...this
_could_ get very large, very quickly...

BUT, that's why I suggested using "NULL" when there is no table for
that character...and, once we have a "leaf node", there is no need to
bother looking at any other characters, that entry can directly have
the procedure address (oh, if you're wondering, we can differentiate
between "table" and "leaf" nodes by allocating all our tables in a
certain part of memory - say, at the lower addresses - and then we can
just check if the address we have is in "table space"...if it is, it's
a pointer to another table...if it isn't, then it's the address of a
procedure...doing it this way, we don't need any extra information to
tell the difference, like some "is this a leaf node?" boolean in the
table...stick all the tables at the lower end and all the procedures
above this and we can tell this difference with a simple "is address <
end of tables?" comparison...a little trick there in order to save
space and actually speed the whole thing up because the address
doubles up as the "boolean" that tells us whether the address is
another table or is the actual procedure address :)...

And though the more characters there are means there's more tables, at
the same time, the amount of commands "that start with this specific
set of characters" greatly reduces (that is, it could be _any command_
before they start typing...once they hit "A", it can only be a command
starting with "A"...once they hit "A" again, there's only _2_ commands
that start with "AA"...blah-blah-blah :)...so, the further down the
"tree" we get, the more tables will simply be filled with "NULL"
values for most entries (and if it's a "NULL" entry then there _is_ no
table hanging off of that entry)...

The memory requirements for this? Well, many of the other suggestions
given so far are bound to require less memory...but, all depending on
your commands, it ain't necessarily prohibitive...note, if you _did_
have a command for every single combination of 12 characters then,
nope, you don't stand a chance of having the memory for that...it
would be 36^12 * 4 bytes and that size can't even be represented by 32
bit addresses...as I say, there's the _potential_ here for this to
explode into something quite, quite silly...BUT, in practice, you may
have a lot of commands but you ain't got 36^12 commands, that's for
sure...so, in practice, you won't really be getting anywhere near this
theoretical "capacity"...but I can't say that this is the most memory
efficient method...it's taking the trade-off _firmly_ on the side of
"speed", _regardless_ of memory...BUT, at the same time, it
_immediately_ "cuts off" (a "NULL" entry says "no more tables in this
direction!" :) when there's no more choices...so, really, you just
have the "waste" of a whole 36 entries, even if there's only _one_
entry in that table (but, if there's only one entry in that table,
it's _guaranteed_ to be a "leaf node" and there's no more tables after
it :)...so, really, it evens out pretty nicely, in the end...

Right, if the commands, on the other hand, are coming from a batch
file then our "advantage" is slightly lost...you can choose to simply
use a different method - one of those suggested by the others here -
if the input is coming from a batch file...but, let's consider this
algorithm idea with batch file input, just to see how it would fair in
that situation, anyway...because it wouldn't be too bad there,
either...and a couple of things are automatically unnecessary...for
example, the "batch file" doesn't ever need to care about "backspace",
"parent table" and can _immediately_ issue the command once it's got a
"result" from the search, skipping immediately - not even bothering to
look at the final characters - to the next command...this is because
it's "pre-typed" and, therefore, there won't be any tapping backspace
to correct errors (at least, this should happen while editing the
batch file, NOT while we're actually reading input out of the file
:)...you grab the input a byte at a time (although, you can buffer it
so you read off a big chunk of the disk to avoid actually starting and
stopping the drive motor for every byte...which would really be
nasty...best yet, if you can do so, load the whole thing (or as much
as possible) and read all the input out of the memory buffer
:)...then, for each byte, you follow the same procedure as above (but
the "backspace" and "parent table" stuff can be ignored completely and
there's no need for ENTER keys, really, that the commands can be
immediately executed once we have the procedure address from a "leaf
node")...for simplicity, the exact same code can be used and you can
just "redirect" input...the "backspace" checks and so forth will never
get used with the "batch file" but it means the same code can be
used..._IF_ you want to optimise things further, though, then consider
taking some of the other suggestions mentioned by others on board as
an "alternative" when the input is a file, rather than the user (that
is, presuming you even have some "batch file" capability...which
you've actually not mentioned whether you will or not...if you don't,
well, ignore this part completely ;)...

This shouldn't only be "fast" (not that the algorithm itself
necessarily runs faster than some of these others mentioned...but by
the fact that this algorithm is designed specifically so that it
doesn't have to _wait_ for the user to be finished typing before it
starts the search...thereby, starting the work immediately so that
it's doing all the searching at the same time as the user is
typing...the "parallelism" there means that the two activities happen
at the same time - rather than one after the other - and that's where
all our "speed" is really coming from...the actual process of
searching may, indeed, be quite a bit slower than the other
suggestions but it doesn't particularly matter because it all happens
_simultaneously_ with the user actually typing :) but I can even throw
in an "auto-completion" feature with it too because that's only a
minor "extension" of the idea...

In a sense, you could say that my algorithm here will "win", not by
being the fastest algorithm in speed terms but because I've exploited
the fact that they are "typed commands" in order to simply "cheat"...I
start running _before_ the starter's gun has fired and will have run
the entire race before the other runners have even started...yes,
that's NOT particularly "fair" but, in this case, the rules actually
don't prohibit me from "cheating" like this...it's a perfectly "legal"
thing in this particular race so, if it can be done, it should be done
;)

You could think of it as a variation on Aesop's "the Tortoise and the
Hare" fable...the tortoise wins, not because he's fastest but because
the hare - who is, undoubtedly, faster - isn't moving (he's decided to
go to sleep half-way through the race being run, being somewhat
"arrogant" that he's clearly the "superior" runner that he's "bound to
win" ;)...the moral of the fable is "slow and steady wins the
race"...something I'm beginning to demonstrate on more than one
level...all the "hares" rushed in with solutions but "Beth the
tortoise" remembered that "slow and steady wins the race" so, I didn't
panic, I took the time to ask a few questions...I took the time to
write a long post rather than rush out some quick "concise"
thing...and my solution here is a "tortoise" in comparison to all the
super-quick rushing algorithms but, basically, I simply "cheat"...by
definition of how my algorithm should work, I've completed the race
before the others have even began...to the user, it's effectively
"instanteous" because, very often, it's a case of the program waiting
for the user to "catch up" rather than the other way around...I've
warned repeatedly agains the practice of "rushing" because you could
end up running straight passed what you really should have been
doing..."more haste, less speed" / "slow and steady wins the
race"...unless someone else also "cheats" - copying my idea of
starting to run before the starter's gun has gone off - then my "slow
and steady" has won this race, for sure...and Aesop's wisdom yet again
proves itself to be timeless ;)...

Just to make sure I don't get ahead of myself (almost did and broke
the "tortoise" rule, rushing to give you example code ;), I'd better
ask the obvious first: DOS or Windows or Linux? MASM or TASM or NASM
or HLA? I already started some DOS / MASM / .COM file code but then
realised that you said "a really large dictionary" and, well, that
sounds more like a 32-bit kind of deal to have the memory for all of
this (_plus_, as Frank noted, all the room for the commands' code as
well, which is going to require a lot too :)...knowing this stuff
would, of course, allow some example code that's more relevent to what
you're doing...lest be accused by someone of "favouritism" in some
way, you decide the assembler...and I won't make any recommendations
about assemblers with good data declaration facilities (hint, hint ;),
which'll help you put together the tables more conveniently (note,
_any_ of them can obviously do it, so despite the minor rib-digging
jokes, you really _can_ pick anyone you like and it's all the
same...in fact, it's probably best to write some other program that
automatically puts the tables together automatically or you'll never
manage your massive list easily ;)...

Beth :)



Relevant Pages

  • Re: Delete key problems
    ... In Office button | Word Options, Advanced category, check the "Typing ... If you are using Word 2003, there is a similar command; ... I have to hit the delete key. ... all of a sudden it was working fine with the backspace key before. ...
    (microsoft.public.word.docmanagement)
  • Re: How do you program in Python?
    ... demonstrate the Scintilla plugin that can also be used in Python programs through things like the StructuredTextControl in wxPython), with the auto-save-on-loss-of-focus feature enabled, and a command prompt open in another window. ... I edit in the Scite window, hit Alt-Tab to change focus to the cmd console, press the Cursor Up key to retrieve the previous command (which is generally the name of my script, or a command like "python ... So, any time I need to test the changes, I hit four keys and I'm done. ...
    (comp.lang.python)
  • Question about ssh
    ... This command works fine except for one little thing. ... When I hit enter, I get a message that says ... I can tell from running "ps" commands on the host that all of the ... processes which had been started up by this ssh invocation are now gone. ...
    (comp.unix.shell)
  • Re: Computer
    ... Navigate to the folder holding the files you want to work on. ... Type your file selection ... option) and hit. ... options I already mentioned that are possible from the command line. ...
    (rec.photo.digital)
  • Re: Setx,y etc.
    ... sety on the command line. ... Start is to manually jog the table all the way forward and then hit ... UT Enter, then choose Fixture Offsets. ...
    (alt.machines.cnc)