Re: How to check for duplicates?

From: aniesen (axeln_at_interform.cc)
Date: 11/16/03


Date: Sat, 15 Nov 2003 17:52:29 -0800

Zoran:

I don't want to spoil your optimism. Your problem still remains that some of
the fields, address in particular, may be spelled differently even though
they are the one and the same. Like, '240 S. Brentwood St.' is the same as
'240 South Brentwood' or ' 240 S. Brentwood Street' and do on. These are all
valid postal addresses. So, I guess your success heavily depends on how
uniform your data is formatted.

-- 
Axel Niesen                                  axeln@interform.cc
interFORM Consulting Corp.         http://www.interform.cc
(866) 503-6005
Don't always say what you know,
but always know what you say!
                                                  Matthias Claudius
                                                  1740-1815
<Zoran> wrote in message news:3fb63c1f@newsgroups.borland.com...
> Hi Axel;
>
> Unfortunately NexusDB does not support stored procedures yet. I think I've
> solved the problem.
>
> This is what I did (if you are interested): on the existing table I
> concatenated strings for name, street and zip. Then I lowercased it, and
> took out all characters except a-z and 0-9. Then I made hash string (32
> bytes long) using MD5 hash (what Ignacio recommended). I created an
> additional key on that field. On 3 million rcds table I am not sure if I
> have some duplicates or not. I don't know much about hash algorithms, but
it
> looks like unique strings to me. When inserting I create the same key out
of
> input fields and check against the table. If the key exists, then I go
into
> loop and check input fields (name, address, zip) against the same fields
in
> existing table.
>
> No big deal, but it looks like it works for me.
>
> If the hash string is guaranteed to be unique, then this looping makes no
> sense. I have to learn more about hash procedures. Do you know some web
site
> where I can find some information about hashing?
>
> Thanks for your time.
>
> Zoran.
>
> "aniesen" <axeln@interform.cc> wrote in message
> news:3fb56841$1@newsgroups.borland.com...
> > Zoran:
> >
> > I don't know anything about NexusDB but if it supports stored procedures
> you
> > can speed up the verification process immensely. Just use a nested loop
> for
> > comparison and leave the nested loop if there is no match returning
FALSE.
> > Return TRUE if it exits the finishes the loop.
> >
> > -- 
> > Axel Niesen                                  axeln@interform.cc
> > interFORM Consulting Corp.         http://www.interform.cc
> > (866) 503-6005
> >
> > Don't always say what you know,
> > but always know what you say!
> >                                                   Matthias Claudius
> >                                                   1740-1815
>
>


Relevant Pages

  • Re: How to check for duplicates?
    ... Unfortunately NexusDB does not support stored procedures yet. ... bytes long) using MD5 hash. ... input fields and check against the table. ... > comparison and leave the nested loop if there is no match returning FALSE. ...
    (borland.public.delphi.thirdpartytools.general)
  • Re: Computing hash values
    ... You mean the bottom table is scanned once (for creating ... hash table) and then nested loop is needed for matching rows. ... >> I'm a little confused about the difference between Hash Match and Nested ...
    (microsoft.public.sqlserver.server)
  • Re: Computing hash values
    ... You mean the bottom table is scanned once (for creating ... hash table) and then nested loop is needed for matching rows. ... >> I'm a little confused about the difference between Hash Match and Nested ...
    (microsoft.public.sqlserver.programming)
  • Re: Computing hash values
    ... A nested loop is when the inner table is processed completely for each row ... For hash joins the inner table is read once to build the hash table, ... SQL Server MVP ...
    (microsoft.public.sqlserver.server)
  • Re: Repairing damaged MD5 values
    ... to insert the missing zeroes to get the proper s0. ... A 26-digit hash, is still a 26-digit hash, even if you add random ... mechanisms for passing binaryto stored procedures as a parameter. ...
    (microsoft.public.sqlserver.programming)