techniques for handling large text files

From: Danl001 (danl001_at_porkfriedrice.net)
Date: 12/29/03


Date: Mon, 29 Dec 2003 01:06:39 -0500
To: beginners@perl.org

Hi,

If this question would be better posted to another perl list, please let
me know.

I have a very large text files (~2 GB) and it's in the following format:

header line
header line
header line
marker 1
header line
header line
header line
marker 2
line type 1
line type 1
line type 1
...
line type 1
line type 2
line type 2
line type 2
...
line type 2
end of file marker line

My objective is to put all "line type 1" lines to file1.txt and all
"line type 2" lines to file2.txt. The "header line" and any of the
marker lines will not appear in either file1.txt or file2.txt. Note
there is no marker line between where line type 1 ends and where line
type 2 starts, but that can be determined by examining a field in the line.

So I have a script to do this. Essentially, it visits each line in the
file and decides which output file to write it to. The problem is it
takes a long time to run (roughly 45 min) (dual p4, 512 ram). I'd like
to cut this running time down as much as possible. What I'm looking is
either suggestions on a better way to do this in perl, or suggestions or
techniques I could use to speed up my current script. I have pasted the
relevant parts of the script below. I noticed I could shave a bit off
the runtime by reading the original file in a buffered manner instead of
  line by line. My outputs to file1.txt and file2.txt at this point take
place with prints to their respective file handles.

Any suggestions that will speed this up in any way will be greatly
appreciated! Thanks,

Dan

--- script ---

# open original file
open(INPUT, $filename) or die "error: $filename cannot be opened\n";
my $BUFFER_SIZE = 4096;
my $buffer = "";
my $sz_buffer = 0;

# open output file for line type 1
my $out1 = "$file1.txt";
open(OUT1, ">$out1") or die "error: $out1 cannot be opened for writing\n";

# open output file for line type 2
my $out2 = "$file2.txt";
open(OUT2, ">$out2") or die "error: $out2 cannot be opened for writing\n";

# counter for the markers we see
my $marker_count = 0;

my $regex_split_space='\s+';
my $regex_split_newline='\n';
my $regex_marker='^marker';
my $regex_eof='^end file';

while (my $rv = read(INPUT, $buffer, $BUFFER_SIZE)) {

     if ($rv >= $BUFFER_SIZE) {
         $buffer .= <INPUT>;
     }

     #print "rv: $rv\n";
     my @lines = split(/$regex_split_newline/o, $buffer);
     # process each line in zone file
     foreach my $line (@lines) {

         #print "line: $line\n";
         if ( $marker_count != 2 ) {

             # if we haven't seen 2 marker lines, we
             # are still in the header section of the file
             if ( $line =~ m/$regex_marker/o ) {
                $marker_count++;
            }

         } elsif ( $line =~ m/$regex_eof/o ) {
             # end of the input file. close
             # our two output files and get out
             close(OUT1);
             close(OUT2);
             exit 0;

         } else {
             # a line we care about

             # split the line on a space character
                 my @fields = split(/$regex_split_space/o, $line);

             # check the second field in this line
             if ( $fields[1] eq "1" ) {

                 print OUT1 "$line\n";

             } elsif ( $fields[1] eq "2" ) {

                 print OUT2 "$line\n";

             } else {

                print "@fields\n";
                die "saw something other than a 1 or 2 line\n";

             }
         }
     }

     $buffer = "";
}



Relevant Pages

  • techniques for handling large text files
    ... header line ... file and decides which output file to write it to. ... # open output file for line type 1 ...
    (perl.beginners)
  • RE: techniques for handling large text files
    ... Maybe try not checking the marker counter every single line. ... header line ... file and decides which output file to write it to. ...
    (perl.beginners)
  • Re: Portable record length
    ... So you HAVE to have a record end marker that you can look for. ... Since the data following your header is floating point data, ... In both cases the reading porgram opens the file as unformatted binary ...
    (comp.lang.fortran)
  • Re: techniques for handling large text files
    ... > header line ... > marker 1 ... Another warning sign, I think. ... After that I built some code to split it, ...
    (perl.beginners)
  • Re: Importing a .txt file issues
    ... is native to Excel. ... It sounds like the report contains information you want and header / ... Do While Not EOF'Process the source file through to the end ... The output file would then be free of headers, ...
    (microsoft.public.excel.misc)