Re: efficient comparison



bob@xxxxxxxxxxxxxx said:

Let's say you have a set of names called set A. You also have a
different
set of names called set B. What is the most efficient way of
comparing the sets?

By comparison, I mean finding the names that need to be added to set B
and
deleted from set B to form set A.

Here is an O(n log n) method. It may not be the most efficient, but it's
a useful benchmark, since anything *less* efficient than it is
necessarily not the *most* efficient! :-)

The O(n log n) complexity comes from the two sort steps.

(Query: the algorithm is actually O(a log a + b log b + a + b) - am I
right in thinking that this reduces to O(n log n) where n = a + b?)

This method assumes that each set is unique - i.e. contains no
duplicates. Adjusting for duplicates is not difficult, but I wanted to
keep the explanation as lucid and brief as possible.

Sort A.
Sort B.
Let Ac = current record in A (start with the first).
Let Bc = current record in B (start with the first).
Let C be the (initially empty) list of records in A but not in B.
Let D be the (initially empty) list of records in B but not in A.

If we try to read a record from A into Ac and it fails, we'll call Ac
null. Same with B into Bc - failure->null.

While Ac is not null and Bc is not null
If Ac is less than Bc
Ac does not exist in B - add Ac to C
Read record from A into Ac
Else if Bc is less than Ac
Bc does not exist in A - add Bc to D
Read record from B into Bc
Else
Records match
Read record from A into Ac
Read record from B into Bc
Endif
Endwhile
While Ac is not null (might not happen)
Ac does not exist in B - add Ac to C
Read record from A into Ac
Endwhile
While Bc is not null (might not happen)
Bc does not exist in A - add Bc to D
Read record from B into Bc
Endwhile

C now contains all records found in A but not B.
D now contains all records found in B but not A.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
.



Relevant Pages

  • Re: Error Testing
    ... "I've been Googling for websites explaining this sort of stats, ... using Fisher's Exact Test. ... data of this sort (binomial, comparing two fractions, etc.) has a long ...
    (sci.stat.edu)
  • Re: Error Testing
    ... "I've been Googling for websites explaining this sort of stats, ... using Fisher's Exact Test. ... data of this sort (binomial, comparing two fractions, etc.) has a long ...
    (sci.stat.edu)
  • Re: $a and $b "reserved" for sort()
    ... a bc beta alpha gamma ... which is not the correct sort order. ... } qw(a bc alpha beta gamma); ... comparing a and World ...
    (comp.lang.perl.misc)
  • Re: CallByAddress
    ... interface e.g for comparing an item with another item of the same class ... and returning an negative int, ... But this implies that the sorted class has to implement this interface. ... That implements a simple class to sort a collection. ...
    (microsoft.public.vb.winapi)
  • merging directories
    ... do you know any scripts around in order to merge two directories? ... Thus a first step would be to identify duplicates, e.g. by an ls -l, ... comparing the contents (e.g. by comparing an md5 ... Duplicate file names with different content should be handled somehow: ...
    (comp.unix.shell)