Re: Difficult text file to parse.
- From: Jim Gibson <jgibson@xxxxxxxxxxxxxxxxx>
- Date: Sun, 11 Sep 2005 13:59:56 -0700
In article <1126469324.440200.171070@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
<"richardkreidl@xxxxxxxxxxxxxxxxxxxxxx"> wrote:
> Basically, I have a large input file which is delimited by the pipe '|'
> symbol . Records in the file can have the same data in field 1 and
> field 3.
> Example the first six records are the same except for field 2.
>
> What I need is to match on field 1 for a possible of 4 matches and no
> more than that.
> Then take the names from field 2 and create a new record like the first
> one in my Output file below.
>
> If the match on field 1 is less than 4 records like the second set of
> records are which there are only two, look at the output file below to
> see how it would be displayed. I want to show the delimiters even if
> there is no data to show.
>
> I hope I explained everything correctly. I think a hash would be the
> best way to approach this problem. I'm not good on using hashes.
>
>
> My sample Input file: Input.txt
>
[sample input and output files with long fields snipped]
What have you tried so far? You are expected to have made some effort
at solving your problem before asking for help.
Here is something that might get you started:
#!/usr/local/bin/perl
#
use warnings;
use strict;
my %data;
while(<DATA>){
chomp;
my($x,$y,$z) = split(/\s*\|\s*/);
$data{$x}{count}++;
push( @{$data{$x}{names}}, $y);
$data{$x}{comment} = $z;
}
foreach my $x ( sort keys %data ) {
no warnings;
print "$x | ", join(" | ",@{$data{$x}{names}}[0..3]),
" | $data{$x}{comment}\n";
}
__DATA__
tagA | TOM JONES | Comment-1
tagA | RICH STEVENS | Comment-1
tagA | SUE LONG | Comment-1
tagA | TIM MAYS | Comment-1
tagA | BOB SMITH | Comment-1
tagA | STEVE WILLS | Comment-1
tagB | ALEXIS KING | Comment-2
tagB | MIKE JONES | Comment-2
tagC | DON RAINS | Comment-3
tagD | SCOTT FRANKS | Comment-4
tagD | CRAIG GRAVES | Comment-4
tagD | DB2UDB | Comment-4
__END__
.... which produces the output
tagA | TOM JONES | RICH STEVENS | SUE LONG | TIM MAYS | Comment-1
tagB | ALEXIS KING | MIKE JONES | | | Comment-2
tagC | DON RAINS | | | | Comment-3
tagD | SCOTT FRANKS | CRAIG GRAVES | DB2UDB | | Comment-4
This is similar to what you want.
Notes:
1. I have reduced your data fields to a shorter version for posting.
You might think about that in the future. Having lines wrap makes it
harder for people to cut-and-paste your posts, which makes it less
likely you will get help.
2. Output is not in the same order as input. If this is important to
you, you will have to modify the program. You can, for example, add a
sequence number to the aggregated records (e.g., $data{$x}{sequence} =
$sequence++;) and sort on the sequence number before printing. You
could also print out the data (and clear the elements) in the while
loop whenever you have 4 matching lines (e.g., $data{$x}{count} == 4).
3. Your use of spaces surrounding fields is a little inconsistent. I
have added spaces in empty fields. If this is not correct for your
application, you will have to modify it. You could split on just the
'|' character and join with '|' as well.
4. You have not said what to do if more than 4 lines match in the first
field. This program discards those records, as was done in your sample
output.
Good luck!
.
- References:
- Difficult text file to parse.
- From: richardkreidl@xxxxxxxxxxxxxxxxxxxxxx
- Difficult text file to parse.
- Prev by Date: Re: Difficult text file to parse.
- Next by Date: urgent problem: can't retrieve cookie
- Previous by thread: Re: Difficult text file to parse.
- Next by thread: Re: Difficult text file to parse.
- Index(es):
Relevant Pages
|