Re: Hashing
From: Randall Hyde (randyhyde_at_earthlink.net)
Date: 03/24/05
- Next message: Randall Hyde: "Re: AoA is now available in Chinese"
- Previous message: Sevag Krikorian: "Re: International Peace March (19th / 20th March 2005)"
- Maybe in reply to: Chewy509: "Hashing"
- Next in thread: Randall Hyde: "Re: Hashing"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 24 Mar 2005 05:03:23 GMT
"Octavio" <933191278@terra.es> wrote in message
news:a3b1e869.0503230913.62b2dbe9@posting.google.com...
> "Randall Hyde" <randyhyde@earthlink.net> wrote in message
news:<s5O%d.876$gI5.144@newsread1.news.pas.earthlink.net>...
> > Octavio's function, for example, is cheap compared to
> > many, but I count 12 instructions in the main loop that
> > does the hash computation (that is, 12 instructions for
> > *each* character
> wrong ,is 12 instructions for 8 characters.
Oops, my mistake, sorry.
Of course, one problem with this approach is that you have to make
sure your string buffer is a multiple of eight characters long. Probably
not an issue for you in your assembler, but in HLA v2.0 it could be
a problem as I'm reading data from a memory mapped file and going
off the end of the file could be disasterous.
FWIW, the HLA v2.0 hashing function is quite simple:
USIDloop:
// Hash function is:
//
// hash := (hash rol 3) ^ SpreadTable[chr];
//
// Note that "hash" is a 16-bit value.
rol( 3, bx );
add( 1, esi ); // Bump up to next char.
xor( al, bl );
// Quick check for EOF:
cmp( esi, EOF );
jae USIDdone;
mov( [esi], al ); // Get the next char.
mov( SpreadTable[eax], al ); // Spread bits around.
test( al, al ); // Is it a valid ID char?
jnz USIDloop; // Repeat, if valid.
Note that this loop is actually doing quite a bit more than computing
the hash value, it is:
1. Computing the hash function, which is handled by the instructions:
mov( SpreadTable[eax], al );
rol( 3, bx );
xor( al, bl );
2. Converting lower-case and upper-case characters to the same
value (in order to do a case-insensitive reserved word lookup).
This function is handled by:
mov( SpreadTable[eax], al );
3. Fetching successive characters from the input file for lexical
analysis and checking them to see if they are part of a reserved
word or identifier, or if they are something else. This is handled
by the instructions:
add( 1, esi );
cmp( esi, EOF );
jae USIDdone;
mov( [esi], al );
mov( SpreadTable[eax], al );
test( al, al );
jnz USIDloop;
So how many instructions do I execute per character for the hash function?
Well, there are only two that aren't specifically used for anything else:
rol( 3, bx); and xor( al, bl ); All of the other instructions would still
need to
be there for the remainder of the lexical analysis needed by the HLA
scanner.
The important thing to note here, and the thing that makes a big difference,
is
that I merged the hash function computation into the same loop that scans
for identifiers/reserved words and does the upper/lower case conversion
in order to amortize the cost of loop control instructions better.
Strictly speaking, btw, there is *one* additional instruction required for
the hash value computation, after the "jnz USIDloop" instruction above,
I do the following for reserved words:
xor( bh, bl );
This is because HLA uses a eight-bit hash value computed from the
reserved word/identifier when searching through the reserved words
list. The actual reserved word hash table is organized as a two-dimensional
array declared roughly as follows:
rsHashTable: dword[13,256];
where the first (leftmost) dimension is indexed by the lexeme's length
and the second dimension is indexed by the eight-bit hash value.
The complete hash table is *very* sparse, mainly because of the length
component, but by indexing off the length as well as the hash value
I was able to do something very important -- I was able to guarantee
that when doing an actual string comparison I always know exactly
how many characters I've got to compare. This lets me write
straight-line code/immediate data string comparison code for each of
the reserved words (the "code" rather than "data table" comparison
you were complaining about in an earlier message). The four bits
of hash value used for encoding the length of the string does not
help spread the entries around well, compared to a better hashing function,
but it does reduce the cost of doing a string comparison by a *tremendous*
amount when dealing with collisions. Given that at least one string
comparison is *always* necessary, this isn't bad.
HLA has somewhere between 700 and 800 reserved words (IIRC).
The simple hash function described above maps almost 500 of those
reserved words to individual slots (i.e., only one string comparison is
needed to reject or accept the match). About 100 of the slots have
a single collision (two reserved words map to the same entry), meaning
it *could* take as many as two string comparisons to accept or reject
the match. It drops down to something like 20 slots that have three
collisions and four slots that have four collisions. Here is a typical
example of a case where there are three reserved words hashing to
the same slot (and how the collisions are resolved):
// Three identifiers of length four, all mapping to the same hash
// value (240, in this particular case):
Len4_Hash240_0:
cmp( eax, rwstr( "cset", 0, 4 ));
jne Len4_Hash240_1;
mov( tkn_cset, eax );
mov( type_tc, ebx );
jmp rwDone;
Len4_Hash240_1:
cmp( eax, rwstr( "int8", 0, 4 ));
jne Len4_Hash240_2;
mov( tkn_int8, eax );
mov( type_tc, ebx );
jmp rwDone;
Len4_Hash240_2:
cmp( eax, rwstr( "fseg", 0, 4 ));
jne IsAnID;
mov( tkn_fseg, eax );
mov( regseg_tc, ebx );
jmp rwDone;
Note that the "string comparison" consists of *two* instructions.
A single compare and a single conditional jump. While the
collision resolution search is done by a linear search, the bottom
line is that we're talking about, worst case, six machine instructions
to accept or reject one of these reserved words. Granted, three
of those instructions are conditional jumps, which are a bit expensive,
but the bottom line is that this is *much* faster than doing a
traditional string comparison. This efficiency is made possible
by hashing off the length of the identifier (so that I always know
exactly how many characters I've got to compare in one of these
sequences).
BTW, the code above is machine-generated, it is not maintained
by hand. I've simply got a text file containing all the reserved
words that a utility program reads and generates the above
code sequences for me (I guess Rene got a little bent out
of shape about this file :-) ).
The bottom line is that simply writing a fast hashing function
isn't good enough. The way to get a really fast lexer is by
streamlining the whole process of scanning a reserved word,
computing its hash value, and searching for the reserved word
in the data base.
I'm not about to claim that HLA's approach is the fastest. I'm
pretty sure it can be sped up a bit if someone really puts their
mind to work on it, but it does do a pretty good job.
BTW, should the scanned lexeme turn out *not* to be a
reserved word, then HLA uses the 16-bit hash value computed
by the lexer as an index into one of the hash tables for the
symbol table. As it turns out, HLA uses a "symbol tree"
rather than a "symbol table" (to use Beth's terminology, which
is quite appropriate here) because HLA is a block structured
language with scoped variables. Each scope (node in the
symbol tree) has its own little hash table. For global objects
and namespace objects, the hash table is indexed by a
16-bit value (i.e., 65,536 entries). For all other nodes,
the hash table is indexed by an 8-bit value (256 entries).
The fact that there are multiple tables helps spread the
symbols around, and further reduces the effects of collisions
(or, if you prefer, it automatically extends the number of
entries in the overall hash table each time you add a local
context to your source file).
Based on suggestions in this thread, I changed my hash function
a little bit, rotating by three bits instead of one. I also modified
the SpreadTable lookup table to "randomize" the bit patterns
for the legal ID symbols (if you think about it, identifiers only
consume about five bits: 26 alphabetic, 10 decimal, and a
few symbols like "_"; this small number of bits doesn't help
distribute items across the eight-bit character space very well).
That, combined with removing one other instruction from the
hash function I originally started with increased the speed of
HLA v2.0 by roughly 33% in a special benchmark I've got that
does several hundred thousand lookups. Not to bad for
a "two instruction" hashing function (well, three if you count
the table lookup).
In the symbol tree, btw, my collision resolution function
is a binary search rather than a linear search. Thus far
I've been getting away with a very simple binary search
that inserts symbols into the binary tree as it encounters
collisions. This means that if you enter all your symbols
in sorted order (forward or reverse), the whole thing
degenerates to a linear search. In general, however, you
can expect the inputs to be sufficiently random that this
shouldn't be a problem. I would point out, however, that
my "special benchmark" I mentioned above inserts all
the symbols in sorted order, so it's basically a "worst
case" scenerio; typical entries in that source file look
like this:
const
aaaa := 0;
aback := aaaa + 1 - 0;
aback_ := 0;
aback_aaaa := aback+aaaa-0;
abaft := aback + 1 - aaaa;
abaft_ := 0;
abaft_aback := abaft+aback-aaaa;
abandon := abaft + 1 - aback;
abandon_ := 0;
abandon_abaft := abandon+abaft-aback;
abandoned := abandon + 1 - abaft;
abandoned_ := 0;
abandoned_abandon := abandoned+abandon-abaft;
.
.
.
(this is the infamous "dictionary file" that crashes RosAsm within
about 250 statements that Rene claims assemblers shouldn't
have to handle :-); HLA v2.0, of course, handles it just fine.)
The full file is more than 115,000 lines of code like the above.
HLA v2.0 processes this (on a 3GHz PIV) at around 320,000
lines/second. This program, because of its size, is one of the
*slowest* compile times (on a lines/sec basis) I've found for
HLA v2.0. I've never bothered trying to profile the code and
find out where the time is being spent because, quite frankly,
processing 115,000 lines of code in under 1/2 second is
close enough to instantaneous for me.
In any case, I certainly second your point that having a ***-kicking
hash function the reduces the number of collisions to almost zero
is useless if the computation of the hash function is expensive.
Far better to use a very cheap hashing function that generates
a few more collisions. You can pay for a few extra string
comparisons (to resolve collisions) very rapidly if you're using
an expensive hash function (i.e., more than two or three instructions).
I once measured the difference in performance between two versions
of my program: one that used cmpsb (slow) to do a string comparison
(against symbols in the symbol tree) and one that use a discrete
sequence of instruction to do the string comparison (fast). On a file
like the one described above, there was about a 0.5% difference in
performance between the assemblies of the same file using the
two different algorithms (and this was on an input file heavily
skewed to favor the faster comparison algorithm). I immediately
dropped the discrete comparison and went with the cmpsb
version, because it's a whole lot easier to understand and maintain.
Bottom line, keep the hashing function cheap. It *does* make
a difference in the performance of the assembler. If you get a few
more collisions, that's not so bad -- my experiences indicate that
you don't lose a whole lot with the string comparisons needed to
resolve those collisions (even when doing a linear search, as
the test example I used winds up degenerating to a linear search).
Cheers,
Randy Hyde
- Next message: Randall Hyde: "Re: AoA is now available in Chinese"
- Previous message: Sevag Krikorian: "Re: International Peace March (19th / 20th March 2005)"
- Maybe in reply to: Chewy509: "Hashing"
- Next in thread: Randall Hyde: "Re: Hashing"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]