Re: Hashing for big list of last names



sprash wrote:
To put it simply, the module I am writing essentially reads the list of
last names of Patients from a CSV file which is a very large list.
Next, based on the inputs I get from another application, I need to
manipulate this list by searching and inserting new last names and also
delete certain last names from my list.

There is existing legacy code that is implementing this using chaining
( function is based on the first character of the last name). This
causes a lot of collision for common characters such as A, D etc and
very few for X, Q etc. Basically the distribution is understandably
uneven.

I am wondering if there are any better ideas out there.

Almost anything is better than hashing only on the first letter of
the name!

One fairly simple thing to do is to do a sort of arithmetic coding
thing on the first 3 letters. For example, for the name Shaw, you'd
convert every digit into a number from 0 (for "A") to 25 (for "Z"):

S -> 18
h -> 7
a -> 0

Now, treat (18,7,0) as a base-26 number:

hashcode == 18 * 26^2 + 7 * 26^1 + 0 * 26^0
== ((18 * 26) + 7) * 26 + 0

Then create 26^3, or 17576 hash buckets. Two names will collide only
if they start with the same first 3 letters.

A simple implementation (that cheats a bit by not actually adjusting
it so that "A" is 0, since it doesn't matter if it's out of phase as
long as each letter has its own value):

int compute_hashcode (const char *lastname)
{
int hashcode;

if (lastname[0] == 0) { return 0; }
h = lastname[0] % 26;

if (lastname[1] == 0) { return h; }
h = h * 26 + lastname[1] % 26;

if (lastname[2] == 0) { return h; }
h = h * 26 + lastname[2] % 26;

return h;
}

Actually, doing the "% 26" is probably totally unnecessary, but it
does ensure your return value will always be in the range (0..17575).

Of course, there are a zillion general-purpose string hashing algorithms
out there as well. They might perform better than this one, who knows.

- Logan
.



Relevant Pages

  • Re: Too Many Data Fields in Mail Merge
    ... All of our letters use a CSV file as its data source. ... One or two letters do prompt for the delimiter, ... Our data source comes out of a Letter Generation tool in PeopleSoft ...
    (microsoft.public.word.mailmerge.fields)
  • Re: Too Many Data Fields in Mail Merge
    ... All of our letters use a CSV file as its data source. ... Multiple users may merge the CSV files to letters, but all of the letters and CSV files exist on a common shared drive. ... Word always has to use its "text converter" connection method to open the data source if it has more than 255 columns, and that's what pops up the box asking for the two delimiter characters. ...
    (microsoft.public.word.mailmerge.fields)
  • Re: DO D1%I=1,10
    ... the one about consisting of letters, ... digits, and underscores, with the first character being a letter. ... variable, copying it into the structure as needed, perhaps as the first ... line of the DO loop (and maybe after the loop if you also need to catch ...
    (comp.lang.fortran)
  • Re: /bin/ed syntax
    ... It's more a case AFAIK that /bin/ed uses the first character after the ... for some dim dark historic reason (like loop counters are often ... Fortran reserved those letters for denoting ...
    (alt.os.linux)
  • Re: Word 2003 Autotext Entries
    ... I have several autotext entries that consist of credentials for ... All the letters are shown on ... would appear as a jumble within the first character. ... I am running Office 2003 on XP Pro. ...
    (microsoft.public.word.docmanagement)