Re: Shrink large file according to REG_EXP
- From: xhoster@xxxxxxxxx
- Date: 16 Jan 2008 17:54:13 GMT
thellper <thellper@xxxxxxxxx> wrote:
Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.
The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .
Figure out which regex is slow, why it is slow, and then make it faster.
If you did the first step and posted the culprit with some sample input, we
might be able to help with the latter two.
Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?
I'd try to make the single-threaded one faster first, and resort to
parallelization only as a last resort. Also, if I were doing
parallelization of this, I probably wouldn't use forks.pm to do it. Once
started, your threads (or processes) really don't need to communicate with
each other (as long as you make independent output files to be combined
later) , so a simpler solution, like Parallel::ForkManager or just doing
fork yourself. Or just start the jobs as separate processes in the first
place.
If the orders of the lines in the output files isn't important, I'd give
each job a different integer token (from 0 to num_job-1) and then have each
job process only those lines where
$token == $. % $num_job
Xho
--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
.
- References:
- Shrink large file according to REG_EXP
- From: thellper
- Shrink large file according to REG_EXP
- Prev by Date: Shrink large file according to REG_EXP
- Next by Date: Re: Problem directing BCP Error to Error file
- Previous by thread: Shrink large file according to REG_EXP
- Next by thread: Re: Shrink large file according to REG_EXP
- Index(es):
Relevant Pages
|