Re: remove duplicate lines



Am Freitag, 27. Mai 2005 13.56 schrieb Jack Daniels (Butch):
> Wow, I'm really confused. I'm trying to remove duplicate lines from a
> marc21 text file. I have spent countless hours searching for scripts etc.
>
> What I find frustrating while trying to learn Perl, is that most solutions
> assume you know what to do. For example, someone gives the code to find
> and replace, and that's it. In other words, if the complete script was
> there, I think I could learn much faster. I have no idea of how to put the
> code into a script.
>
> I did manage to find a few perl one liners but it removed the blank lines
> between the records, which must be retained in order to convert the file
> back to actual marc format before downloading into the database.
>
> It also removed non sequential lines if they were the same in another
> record. They must also be kept as they are an important part of the file.
>
> Any help would be more than appreciated. Below is part of a very large
> file.Approx 100,000 records need to be processed. For now, I just want to
> remove adjacent duplicate fields.
>
> =LDR 01548cam 2200397La 45{92}0
> =001 ocm42328427\
> =003 OCoLC
> =005 20010526091201.0
> =006 m\\\\\\\\u\\\\\\\\
> =007 cr\cn-
> =008 831108s1984\\\\inua\\\\sb\\\\001\0\eng\d
> =010 \\$z 83048636
> =035 \\1234 (sirsi)
> =035 \\1234 (sirsi)
> =040 \\$aN{dollar}T$cN{dollar}T$dOCL
> =020 \\$a0585000905 (electronic bk.)
> =020 \\$z0253366062
> =020 \\$z0253203252
> =050 14$aNX180.F4$bL38 1984eb
> =082 04$a700/.88042$219
> =049 \\$aM7@A
> =100 1\$aLauter, Estella,$d1940-
> =245 10$aWomen as mythmakers$h[computer file] :$bpoetry and visual art by
> twentieth-century women /$cEstella Lauter. =260 \\$aBloomington :$bIndiana
> University Press,$cc1984.
> =300 \\$axvii, 267 p. :$bill. ;$c24 cm.
> =504 \\$aBibliography: p. 247-260.
> =500 \\$aIncludes index.
> =533 \\$aElectronic reproduction.$bBoulder, Colo.
> :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in
> multiple electronic file formats.$nAccess may be limited to NetLibrary
> affiliated libraries. =SUBJ \0$aFeminism and the arts.
> =SUBJ \0$aWomen artists.
> =SUBJ \0$aWomen poets.
> =SUBJ \0$aArt and mythology.
> =SUBJ \0$aArts, Modern$y20th century.
> =655 \7$aElectronic books.$2local
> =710 2\$aNetLibrary, Inc.
> =776 1\$cOriginal$w(DLC) 83048636$w(OCoLC)10162146
> =856 4\$3Bibliographic record
> display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v=1&bookid=652
>$zAn electronic book accessible through the World Wide Web; click for
> information =994 \\$a92$bM7@
>
> =LDR 01470cam 2200349La 45{92}0
> =001 ocm42328450\
> =003 OCoLC
> =005 20010526091202.0
> =006 m\\\\\\\\u\\\\\\\\
> =007 cr\cn-
> =008 980609s1998\\\\couab\\\sbf\\\001\0\eng\d
> =010 \\$z 98026266
> =035 \\1234 (sirsi)
> =035 \\1234 (sirsi)
> =040 \\$aN{dollar}T$cN{dollar}T$dOCL
> =020 \\$a0585001413 (electronic bk.)
> =020 \\$z1555662307
> =050 14$aQB581$b.L66 1998eb
> =082 04$a523.3$221
> =049 \\$aM7@A
> =100 1\$aLong, Kim.
> =245 14$aThe moon book$h[computer file] :$bfascinating facts about the
> magnificent, mysterious moon /$cKim Long ; science advisor, Larry Sessions.
> =250 \\$aRev. and expanded.
> =260 \\$aBoulder, Colo. :$bJohnson Books,$cc1998.
> =300 \\$a149 p. :$bill., maps ;$c22 cm.
> =500 \\$aIncludes 1 errata ***.
> =504 \\$aIncludes bibliographical references (p. 132-133) and index.
> =533 \\$aElectronic reproduction.$bBoulder, Colo.
> :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in
> multiple electronic file formats.$nAccess may be limited to NetLibrary
> affiliated libraries. =651 \0$aMoon$vHandbooks, manuals, etc.
> =655 \7$aElectronic books.$2local
> =710 2\$aNetLibrary, Inc.
> =776 1\$cOriginal$w(DLC) 98026266$w(OCoLC)39299241
> =856 4\$3Bibliographic record
> display$uhttp://www.netlibrary.com/urlapi.asp?action=summary&v=1&bookid=140
>$zAn electronic book accessible through the World Wide Web; click for
> information =994 \\$a92$bM7@
> =994 \\$a92$bM7@

Ok, the following script does the following:
If any adjacent line occurs multiple times, it just prints one to stdout.

To call the script with an input file and write the result in an outputfile:
$ ./test10.pl inputfile > outputfile

The script (save it to test10.pl or whatever):

[BEGIN] # not part of the script
#!/usr/bin/perl

my $last; # line before actual line

while (<ARGV>) { # read line from input file untile end of file
# don't print line if last line is the same
print unless $_ eq $last;
# assign current line to $last for
# comparison of the next line
$last=$_;
}
[END] # not part of the script


I have tested the script and it removes (line number of input in 1st column):

10: =035 \\1234 (sirsi)
45: =035 \\1234 (sirsi)
66: =994 \\$a92$bM7@


HTH and ask if you have further questions

joe
.