Re: Algorithm help for unique string searching/counting within an array.



Paul Van Delst wrote:

I want to be able to search an array of strings (with many repeated elements) so I can count the unique elements. I've attached a working program at the bottom of the post so you can see what I've got so far.

If you expect duplicates a hash table is probably easier and faster.

You might find a hash routine with a built in hash function, otherwise
my favorite is the CRC32 polynomial, which is easy to implement using
a 256 element table and some shift and exclusive or functions. After computing the CRC32 value for each string, take that value modulo the hash table size. Easiest is a linked list at each hash position, but
there are other possibilities. Maybe you can find an already written hashing routine.

As others have said, there are languages which make this very easy,
such as perl and awk. Many C libraries have a hash routine, as does
Java. From reading a file with a list to writing the result is
about two lines of AWK, 20 of C, or 30 of Java. (For the latter two, most of the work is reading the input file and writing the result.)

Sorting should be O(N log N), even for an external sort. Hashing is usually considered O(N), though that depends on N not being too large.

I have done hash table counting for millions of unique entries (out of billions). You really want the hash table to fit in memory.
If it can't, then an external sort might be better.

-- glen

.



Relevant Pages

  • Re: Why is dictionary.keys() a list and not a set?
    ... >> If the hash changes, ... > equal objects hash same. ... hold unique elements, where "unique" depends on the use at hand. ...
    (comp.lang.python)
  • Re: code with style
    ... anno4000@lublin.zrz.tu-berlin.de (Anno Siegel) wrote: ... >We have seen a few solutions using a hash to count the unique elements ... >in each array. ...
    (comp.lang.perl.misc)
  • Re: Why is dictionary.keys() a list and not a set?
    ... Mike Meyer wrote: ... hold unique elements, where "unique" depends on the use at hand. ... which has same all the time and hash same as all other equal objects. ... Regards, ...
    (comp.lang.python)