Re: Difficult text file to parse.



In article <1126469324.440200.171070@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
<"richardkreidl@xxxxxxxxxxxxxxxxxxxxxx"> wrote:

> Basically, I have a large input file which is delimited by the pipe '|'
> symbol . Records in the file can have the same data in field 1 and
> field 3.
> Example the first six records are the same except for field 2.
>
> What I need is to match on field 1 for a possible of 4 matches and no
> more than that.
> Then take the names from field 2 and create a new record like the first
> one in my Output file below.
>
> If the match on field 1 is less than 4 records like the second set of
> records are which there are only two, look at the output file below to
> see how it would be displayed. I want to show the delimiters even if
> there is no data to show.
>
> I hope I explained everything correctly. I think a hash would be the
> best way to approach this problem. I'm not good on using hashes.
>
>
> My sample Input file: Input.txt
>

[sample input and output files with long fields snipped]

What have you tried so far? You are expected to have made some effort
at solving your problem before asking for help.

Here is something that might get you started:

#!/usr/local/bin/perl
#
use warnings;
use strict;

my %data;
while(<DATA>){
chomp;
my($x,$y,$z) = split(/\s*\|\s*/);
$data{$x}{count}++;
push( @{$data{$x}{names}}, $y);
$data{$x}{comment} = $z;
}
foreach my $x ( sort keys %data ) {
no warnings;
print "$x | ", join(" | ",@{$data{$x}{names}}[0..3]),
" | $data{$x}{comment}\n";
}

__DATA__
tagA | TOM JONES | Comment-1
tagA | RICH STEVENS | Comment-1
tagA | SUE LONG | Comment-1
tagA | TIM MAYS | Comment-1
tagA | BOB SMITH | Comment-1
tagA | STEVE WILLS | Comment-1
tagB | ALEXIS KING | Comment-2
tagB | MIKE JONES | Comment-2
tagC | DON RAINS | Comment-3
tagD | SCOTT FRANKS | Comment-4
tagD | CRAIG GRAVES | Comment-4
tagD | DB2UDB | Comment-4
__END__


.... which produces the output


tagA | TOM JONES | RICH STEVENS | SUE LONG | TIM MAYS | Comment-1
tagB | ALEXIS KING | MIKE JONES | | | Comment-2
tagC | DON RAINS | | | | Comment-3
tagD | SCOTT FRANKS | CRAIG GRAVES | DB2UDB | | Comment-4

This is similar to what you want.

Notes:

1. I have reduced your data fields to a shorter version for posting.
You might think about that in the future. Having lines wrap makes it
harder for people to cut-and-paste your posts, which makes it less
likely you will get help.

2. Output is not in the same order as input. If this is important to
you, you will have to modify the program. You can, for example, add a
sequence number to the aggregated records (e.g., $data{$x}{sequence} =
$sequence++;) and sort on the sequence number before printing. You
could also print out the data (and clear the elements) in the while
loop whenever you have 4 matching lines (e.g., $data{$x}{count} == 4).

3. Your use of spaces surrounding fields is a little inconsistent. I
have added spaces in empty fields. If this is not correct for your
application, you will have to modify it. You could split on just the
'|' character and join with '|' as well.

4. You have not said what to do if more than 4 lines match in the first
field. This program discards those records, as was done in your sample
output.

Good luck!
.



Relevant Pages

  • Re: Help is needed to compile C program using Visual Studie 2005
    ... the pdb file that was used when this precompiled header was created, ... an output file whose name has the following format: ... The length of input file paths and name must be less than 256; ... while(i < DefinedVariableArrayIndex) { ...
    (microsoft.public.vc.language)
  • Re: Need advice on File I/O
    ... open the input file and open an output file, ... you would still have the input file unchanged. ... On all currently supported operating systems, ...
    (comp.soft-sys.matlab)
  • Re: Help with pattern matching
    ... then print the line to an output file. ... > replica of the input file. ... In Perl, CamelBack is generall reserved for package names. ... where their meaning must be expressed in comments. ...
    (perl.beginners)
  • Re: Help with pattern matching
    ... then print the line to an output file. ... > replica of the input file. ... If you had had warnings enabled as well as strict you might have found ... > # Assign the second field to an evaluation scalar ...
    (perl.beginners)