techniques for handling large text files
From: Danl001 (danl001_at_porkfriedrice.net)
Date: 12/29/03
- Next message: Danl001: "techniques for handling large text files"
- Previous message: Shawn McKinley: "RE: the ref() function: what does it mean when ..."
- Next in thread: Tom Kinzer: "RE: techniques for handling large text files"
- Reply: Tom Kinzer: "RE: techniques for handling large text files"
- Reply: Andy Unick: "Re: techniques for handling large text files"
- Reply: James Edward Gray II: "Re: techniques for handling large text files"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Mon, 29 Dec 2003 01:05:57 -0500 To: beginners@perl.org
Hi,
If this question would be better posted to another perl list, please let
me know.
I have a very large text files (~2 GB) and it's in the following format:
header line
header line
header line
marker 1
header line
header line
header line
marker 2
line type 1
line type 1
line type 1
...
line type 1
line type 2
line type 2
line type 2
...
line type 2
end of file marker line
My objective is to put all "line type 1" lines to file1.txt and all
"line type 2" lines to file2.txt. The "header line" and any of the
marker lines will not appear in either file1.txt or file2.txt. Note
there is no marker line between where line type 1 ends and where line
type 2 starts, but that can be determined by examining a field in the line.
So I have a script to do this. Essentially, it visits each line in the
file and decides which output file to write it to. The problem is it
takes a long time to run (roughly 45 min) (dual p4, 512 ram). I'd like
to cut this running time down as much as possible. What I'm looking is
either suggestions on a better way to do this in perl, or suggestions or
techniques I could use to speed up my current script. I have pasted the
relevant parts of the script below. I noticed I could shave a bit off
the runtime by reading the original file in a buffered manner instead of
line by line. My outputs to file1.txt and file2.txt at this point take
place with prints to their respective file handles.
Any suggestions that will speed this up in any way will be greatly
appreciated! Thanks,
Dan
--- script ---
# open original file
open(INPUT, $filename) or die "error: $filename cannot be opened\n";
my $BUFFER_SIZE = 4096;
my $buffer = "";
my $sz_buffer = 0;
# open output file for line type 1
my $out1 = "$file1.txt";
open(OUT1, ">$out1") or die "error: $out1 cannot be opened for writing\n";
# open output file for line type 2
my $out2 = "$file2.txt";
open(OUT2, ">$out2") or die "error: $out2 cannot be opened for writing\n";
# counter for the markers we see
my $marker_count = 0;
my $regex_split_space='\s+';
my $regex_split_newline='\n';
my $regex_marker='^marker';
my $regex_eof='^end file';
while (my $rv = read(INPUT, $buffer, $BUFFER_SIZE)) {
if ($rv >= $BUFFER_SIZE) {
$buffer .= <INPUT>;
}
#print "rv: $rv\n";
my @lines = split(/$regex_split_newline/o, $buffer);
# process each line in zone file
foreach my $line (@lines) {
#print "line: $line\n";
if ( $marker_count != 2 ) {
# if we haven't seen 2 marker lines, we
# are still in the header section of the file
if ( $line =~ m/$regex_marker/o ) {
$marker_count++;
}
} elsif ( $line =~ m/$regex_eof/o ) {
# end of the input file. close
# our two output files and get out
close(OUT1);
close(OUT2);
exit 0;
} else {
# a line we care about
# split the line on a space character
my @fields = split(/$regex_split_space/o, $line);
# check the second field in this line
if ( $fields[1] eq "1" ) {
print OUT1 "$line\n";
} elsif ( $fields[1] eq "2" ) {
print OUT2 "$line\n";
} else {
print "@fields\n";
die "saw something other than a 1 or 2 line\n";
}
}
}
$buffer = "";
}
- Next message: Danl001: "techniques for handling large text files"
- Previous message: Shawn McKinley: "RE: the ref() function: what does it mean when ..."
- Next in thread: Tom Kinzer: "RE: techniques for handling large text files"
- Reply: Tom Kinzer: "RE: techniques for handling large text files"
- Reply: Andy Unick: "Re: techniques for handling large text files"
- Reply: James Edward Gray II: "Re: techniques for handling large text files"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|
|