Shrink large file according to REG_EXP



Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

Any help is really appreciated.

Best regards,
Davide
.



Relevant Pages

  • Re: ADC bandwidth and sample rate
    ... Are these things designed to work with a notch filter of width less than 85 MHz, precisely so that you can properly recover the frequency information after aliasing? ... MHz wide chunks to baseband as part of the digitizing process. ... Those chunks of bandwidth can be anywhere up to 1.1 GHz. ... All other signals and noise between DC and>1.1 GHz must be suppressed sufficiently to not be interferers. ...
    (sci.electronics.design)
  • Re: Shrink large file according to REG_EXP
    ... I've as input a large text file which I need to filter ... different syntax, and I have syntax files which tell me how to split ... split the file in chunks and let each thread work on a chunk of the ... If your program is I/O bound, then it might be faster to work on ...
    (comp.lang.perl.misc)
  • Re: Shrink large file according to REG_EXP
    ... I've as input a large text file which I need to filter ... I'm now reading line by line the whole file, ... and I have syntax files which tell me how to split ... split the file in chunks and let each thread work on a chunk of the ...
    (comp.lang.perl.misc)
  • Re: Shrink large file according to REG_EXP
    ... I'm now reading line by ... and I have syntax files which tell me how to split ... CPAN) to split the file in chunks and ... There's a Benchmark module that will ...
    (comp.lang.perl.misc)
  • Re: One value from previous record
    ... Additionally, if there can be only 1 reading per date, you can enforce this by setting the Indexed property of the field capturing the Date to YES. ... Filter One Influent Gallons, Filter One Influent MGD, same for Filter two, then Pump 1 Run Hours, Pump 1 hours by previous ... What I need to do is take the filter influent gallons number from today subtract that from yesterdays filter one gallons and then divide by 1,000,000. ... I want it to populate the next field (Filter One Influent MGD) with that calculation. ...
    (microsoft.public.access.modulesdaovba)