standardising spellings

From: Dermot Paikkos (dermot_at_sciencephoto.co.uk)
Date: 02/24/05

  • Next message: Peter Rabbitson: "Re: standardising spellings"
    To: beginners@perl.org
    Date: Thu, 24 Feb 2005 18:01:50 -0000
    
    

    Hi,

    I have a list of about 650 names (a small sample is below) that I
    need to import into a database. When you look at the list there are
    some obvious duplicates that are spelt slightly differently. I can
    rationalize some of the data with some simple substitutions but some
    of the data looks almost impossible to parse programmatically.
    Here what I have done so far - it's not much:

    #!/bin/perl
    use strict;
    my $file = "myfile.csv";
    open(FH,$file) or die "Can't open file: $!\n";
    while (<FH>) {
            chomp;
            s/&/and/; # change & to and
            s/"//g; # remove any quotes
            s/ $//; # remove any trailing white space
            s/ \//\//; # remove and space between slashes
            s/\/ /\//; # ditto
            s/,$//; # remove any trailing commas
            print "$_\n";
    }

    Is there some other techniques that I can use to help standardise the
    list? I know I am going to have to look at the list manually and sort
    it but I thought there might be some way to give myself a head start.

    If I could I would like to generate a csv file so that the first
    field contains the first appearance of a name and if there are any
    near hits these appear in the second and third fields. EG:
    "Alan and Sandy Carey, Alan & Sandy Carey\n"
    "Alan Carey, Alan D Carey\n"

    I know it's a tall order but does anyone have any ideas?
    Thanx.
    Dp.

    FYI: Bachmann and Bachman are different people but I suspect William
    D. is also Bill Bachman.

    === Sample data ==========
    Alan and Sandy Carey
    Alan & Sandy Carey
    Alan Carey
    Alan D. Carey
    Leonard Lee Rue III
    Leonard Lessin
    "Leonard Lessin, FBPA "
    Bill Bachman
    Bill Bachmann
    William D. Bachman
    Fred McConnaughey
    Frederica Georgia
    Frederick Ayer III
    Frederick R. McConnaughey
    Greg Dimijian
    Gregory G. Dimijian
    "Gregory G. Dimijian, M.D. "
    Herve Donnezan
    Howard Uible
    Hubertus Kanus
    Inger McCabe Elliott
    Irene Vandermolen
    J. Gerard Smith
    J. L. G. Grande
    J. Water and A. Salic
    J. Waters and A. Salic
    Jack Fields
    Jack Rosen
    Daniel Bernstein
    Dan Bernstein
    David Schleser
    Kees van den Berg
    Kees Van Den Berg


  • Next message: Peter Rabbitson: "Re: standardising spellings"