Problems matching or parsing with delimiters in text

From: Kevin Zembower (KZEMBOWE_at_jhuccp.org)
Date: 03/28/05


Date: Mon, 28 Mar 2005 11:13:05 -0500
To: beginners@perl.org

I'm trying to read in text lines from a file that look like this:
"B-B01","Eng","Binder for Complete Set of Population Reports",13,0
"C-CD01","Eng","The Condoms CD-ROM",12,1
"F-J41a","Fre",,13,1
"F-J41a","SPA",,13,1
"M-FC01","Eng","Africa Flip Charts- Planning Your Family (E,F, Swahili)(12""x9"")",7,1
"M-FC01","Fre","Africa Flip Charts- Planning Your Family (E,F, Swahili)(12""x9"")",7,1

The first two lines are typical of most of the file. The second two have a blank third field and the last two show embedded commas and escaped double quotes in the third field. This is an output of another program, but I can filter it and make substitutions if that makes anything easier.

I'm trying to parse it with these statements:
while (<>) { # While there are more records in the inventory export file called on the command line
   ++$ln; #increment the line number count
   my ($partno, $language, $title, $cost, $available) = m["(.*)","(.*)","?(.*?)"?,(.*),(.*)$];
   print "PN=$partno, L=$language, T=$title, C=$cost, A=$available\n" if $debug;
   next if $debug;
   createlangversion($partno, $language, $title, $cost, $available);
} #while there are more lines in the import data file

The output looks like this:
kevinz@www:~/public_html/orderDB/obsolete$ ./loadInventory.pl ../tmp/t
PN=B-B01, L=Eng, T=Binder for Complete Set of Population Reports, C=13, A=0
PN=C-CD01, L=Eng, T=The Condoms CD-ROM, C=12, A=1
PN=F-J41a, L=Fre, T=, C=13, A=1
PN=F-J41a, L=SPA, T=, C=13, A=1
PN=M-FC01, L=Eng, T=Africa Flip Charts- Planning Your Family (E, C=F, Swahili)(12""x9"")",7, A=1
PN=M-FC01, L=Fre, T=Africa Flip Charts- Planning Your Family (E, C=F, Swahili)(12""x9"")",7, A=1
kevinz@www:~/public_html/orderDB/obsolete$

Note that the first four lines parsed correctly, but that the last two incorrectly assigned $cost to part of the title.

Can anyone help me write a match which would parse all of these lines correctly? Extra bonus points for explaining it throughly, so I don't have to ask this question here again. If it's easier to just filter or substitute in the original input file, what should I do?

Thank you all in advance for your help and suggestions.

-Kevin Zembower



Relevant Pages

  • Re: inherited cobol app., cant run (xm.exe)?
    ... way to decipher the header in a data file? ... Now, obviously this isn't a perfect parse, but it's a start. ... meaning it's considered equipment in this system. ... The file header is described in documentation ...
    (comp.lang.cobol)
  • Re: parsing a dbIII file
    ... I have to parse a file whose stucture look like ... If the format is similar to Excel's CSV format then the csv module from Python's standard library may well be what you want. ... I am not sure whether the pipe bars actually appear in your data file, so it is difficult to know quite exactly what to suggest, but I would play with the file in an interactive interpreter session first to see whether csv can do the job. ...
    (comp.lang.python)
  • Parsing CSVs
    ... if i'm reading from a 3rd column in a data file, ... I know I created an array of col 3. ... sure how to parse it into an array. ...
    (comp.soft-sys.matlab)
  • Microsecond time precision required
    ... I am using Excel to parse a data file containing the time in ... microseconds. ... However excel always rounds the value ...
    (microsoft.public.excel)