Re: Hashing for big list of last names
- From: Logan Shaw <lshaw-usenet@xxxxxxxxxxxxx>
- Date: Thu, 14 Sep 2006 06:44:21 GMT
sprash wrote:
To put it simply, the module I am writing essentially reads the list of
last names of Patients from a CSV file which is a very large list.
Next, based on the inputs I get from another application, I need to
manipulate this list by searching and inserting new last names and also
delete certain last names from my list.
There is existing legacy code that is implementing this using chaining
( function is based on the first character of the last name). This
causes a lot of collision for common characters such as A, D etc and
very few for X, Q etc. Basically the distribution is understandably
uneven.
I am wondering if there are any better ideas out there.
Almost anything is better than hashing only on the first letter of
the name!
One fairly simple thing to do is to do a sort of arithmetic coding
thing on the first 3 letters. For example, for the name Shaw, you'd
convert every digit into a number from 0 (for "A") to 25 (for "Z"):
S -> 18
h -> 7
a -> 0
Now, treat (18,7,0) as a base-26 number:
hashcode == 18 * 26^2 + 7 * 26^1 + 0 * 26^0
== ((18 * 26) + 7) * 26 + 0
Then create 26^3, or 17576 hash buckets. Two names will collide only
if they start with the same first 3 letters.
A simple implementation (that cheats a bit by not actually adjusting
it so that "A" is 0, since it doesn't matter if it's out of phase as
long as each letter has its own value):
int compute_hashcode (const char *lastname)
{
int hashcode;
if (lastname[0] == 0) { return 0; }
h = lastname[0] % 26;
if (lastname[1] == 0) { return h; }
h = h * 26 + lastname[1] % 26;
if (lastname[2] == 0) { return h; }
h = h * 26 + lastname[2] % 26;
return h;
}
Actually, doing the "% 26" is probably totally unnecessary, but it
does ensure your return value will always be in the range (0..17575).
Of course, there are a zillion general-purpose string hashing algorithms
out there as well. They might perform better than this one, who knows.
- Logan
.
- References:
- Hashing for big list of last names
- From: sprash25
- Re: Hashing for big list of last names
- From: Ben Pfaff
- Re: Hashing for big list of last names
- From: sprash
- Hashing for big list of last names
- Prev by Date: Re: A survey for the school project
- Next by Date: Re: A survey for the school project
- Previous by thread: Re: Hashing for big list of last names
- Next by thread: Re: Hashing for big list of last names
- Index(es):
Relevant Pages
|