Re: Shrink large file according to REG_EXP



thellper <thellper@xxxxxxxxx> wrote:
Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Figure out which regex is slow, why it is slow, and then make it faster.

If you did the first step and posted the culprit with some sample input, we
might be able to help with the latter two.

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

I'd try to make the single-threaded one faster first, and resort to
parallelization only as a last resort. Also, if I were doing
parallelization of this, I probably wouldn't use forks.pm to do it. Once
started, your threads (or processes) really don't need to communicate with
each other (as long as you make independent output files to be combined
later) , so a simpler solution, like Parallel::ForkManager or just doing
fork yourself. Or just start the jobs as separate processes in the first
place.

If the orders of the lines in the output files isn't important, I'd give
each job a different integer token (from 0 to num_job-1) and then have each
job process only those lines where
$token == $. % $num_job

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
.



Relevant Pages

  • Re: Not filtering = Tacit Approval (hmmm whered you learn to use names in subject lines?)
    ... >> give him tacit approval'! ... >> that leads Communist dictators to resort to censorship. ... that people should not ignore or even filter whoever they want to. ...
    (sci.anthropology.paleo)
  • Re: filtered pasting
    ... But maybe you could sort your data, then paste into that contiguous range. ... Then resort and reapply the filter??? ... Problem is that it's pasting into the lines that are not visible in the ... filter. ...
    (microsoft.public.excel.misc)
  • Re: Chebyshev IIR - cutoff frequency 0.5 leads to distortion?
    ... > you can also be having overflow problems internal to the filter ... > does the distortion improve if you reduce the input signal amplitude? ... All in all, it is the problem of coming closer to the Nyquist freq, e.g. ... output files here: ...
    (comp.dsp)
  • Re: cups slow on linux-2.6.24
    ... Jeff Chua wrote: ... here's the attached output files. ... This filter (wireshark) shows one example of the problem in case ...
    (Linux-Kernel)
  • Re: OT:Thunderbird
    ... Is there a way of configuring this or must I resort to another programme? ... Nick ... the posters name which is highlighted on the bar at the top of the message>> From the menu select "Create Filter From Message". ... When the message filter tool appears check the details and press OK. ...
    (uk.radio.amateur)