Reading poorly structured data
From: Alan Mead (amead_at_comcast.net)
Date: 12/08/04
- Next message: Scott Bryce: "Re: RegExp Help"
- Previous message: Jorge: "horizontal join of array elements"
- Next in thread: A. Sinan Unur: "Re: Reading poorly structured data"
- Reply: A. Sinan Unur: "Re: Reading poorly structured data"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Tue, 07 Dec 2004 20:40:14 -0600
I have five files of contact info (one for each year of a conference).
All five have slightly different fairly unstructured formats. One looks
like this:
Bush, George, President, 1 White House Way, Washington,
DC 00000; gbush@whitehouse.gov
Kerry, John, 1 Main, Detroit, MI 00000; jkerry@yahoo.com
Williams, Robin, 2 Main, Burbank, CA 00000
Newman, Paul, President and Principal Spokesperson,
Paul Newmans's Own Brand Foods, 123 Main Street,
Olympia Fields, WY 00000; paul@newmans.org
Blair, Tony, 1 Downing Street, London, UK 0000000
... etc..
So the fields are comma-separated, except for email which may be absent,
and the record may be split over two or three lines.
In a later file dozens of records appear on the same line.
I'd like to output
lname=Bush
fname=George
address=President, 1 White House Way, Washington, DC 00000
email=gbush@whitehouse.gov
Any ideas how to parse this using Perl? So far I can parse about 60% of
the records with the below hack. It gets tripped up when the number
of commas in a record is large (some people have five lines of
address with embedded commas) in which cases it will parse the
first half of the record fairly well and then try to parse the
next half as a new record.
-Alan
my $i=0;
while($i<=$count) {
$i++;
my($lname,$fname,$address,$email)=('','','','');
my $line = $lines{$i};
if ($line =~ /[,;]$/) { # clearly more on next line
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
if ( (scalar split/,/,$line) > 4) { # a proper name and address will
# have at least 5 parts
if ($line =~ /@/) {
my @bits = split(/;/,$line); # email is last element when split
# on semicolons, so save it
$email = pop(@bits);
$line = join(';',@bits); # put line back together (just
# in case there's more than one
# semi-colon in the record)
}
my @bits = split(/,/,$line); # now split on commas
$lname = shift @bits; # lname is first bit
$fname = shift @bits; # folllowed by fname
$address = join(',',@bits); # the rest is the address
} else {
$lines{$i+1} = "$line $lines{$i+1}";
next;
}
...
}
- Next message: Scott Bryce: "Re: RegExp Help"
- Previous message: Jorge: "horizontal join of array elements"
- Next in thread: A. Sinan Unur: "Re: Reading poorly structured data"
- Reply: A. Sinan Unur: "Re: Reading poorly structured data"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]