Re: optimize log parsing




xhos...@xxxxxxxxx wrote:
> "it_says_BALLS_on_your forehead" <simon.chao@xxxxxxx> wrote:
> > Hey Xho, I tried this:
> > ----
> > #!/apps/webstats/bin/perl
> >
> > use File::Copy;
> > use Parallel::ForkManager;
> >
> > my $pm = Parallel::ForkManager->new(5);
> >
> > $pm->run_on_start(
> > sub { my ($pid,$ident)=@_;
> > print "** $ident started, pid: $pid\n";
> > }
> > );
> >
> > my @data = 1 ... shift;
> > for (@data) {
> > my $pid = $pm->start and next;
> > print "$pid: $_\n";
> > $pm->finish;
> > }
> >
> > $pm->wait_all_children;
> > ------------
> > and got this:
> > #####
> > [smro180 123] ~/simon/1-perl > tryFork.pl 10
> > ** started, pid: 16208
> > 0: 1
> > ** started, pid: 16209
> > 0: 2
> > ** started, pid: 16210
> ...
> >
> > ...I read this:
> > start [ $process_identifier ]
> > This method does the fork. It returns the pid of the child process for
> > the parent, and 0 for the child process. If the $processes parameter
> > for the constructor is 0 then, assuming you're in the child process,
> > $pm->start simply returns 0.
> >
> > An optional $process_identifier can be provided to this method... It is
> > used by the "run_on_finish" callback (see CALLBACKS) for identifying
> > the finished process.
> >
> > and this:
> > run_on_start $code
> > You can define a subroutine which is called when a child is started. It
> > called after the successful startup of a child in the parent process.
> >
> > The parameters of the $code are the following:
> >
> > - pid of the process which has been started
> > - identification of the process (if provided in the "start" method)
> >
> > ...but I don't understand why in my: print "$pid: $_\n";
> > line, i'm getting 0 as the pid. I know the documentation said i should
> > get 0 for the child process and the child pid for the parent, but
> > aren't i calling start on the parent?
>
> You are calling "start" *in* the parent, but is returning in both the
> parent and child process. Inside, "start" does a fork, so when "start"
> ends there are two processes. The parent process gets the child's pid,
> which means the "and next" is activated. The child gets zero, so the "and
> next" is not activated. This means everything between the start and the
> finish statements are done in one of the children, not in the parent.
>
> The example I posted was just copied and modified from perldoc, and for
> some reason they do capture the pid. In practise I almost never capture
> it:
>
> $pm->start and next;
>
> If the child needs it's own pid, it gets it from $$. Why do I need
> the parent to know the child's pid? Usually I don't, because the module
> itself takes care of all the waiting and stuff for me.
>
> I rarely use anything except new, start, finish, and wait_all_children,
> except to goof around with. Once your needs get more complicated than
> those simple methods, I find that things get hairy real quick.
>
> BTW, I'm curious about the bottleneck in your code. If your code is
> CPU-bound, then parallelization to 20 processes won't help much unless you
> have 20 CPUs. If it is disk-drive bound, then parallelization won't help
> unless your files are on different disks (and probably on different
> controllers.)
>
> Xho
>

ahh, that makes sense, thanks!

to answer your question, i'm working on a box with 16 CPUs. the number
20 is from code that i inherited from a predecessor. there used to be
10 processes, and he changed it to 20, and it went faster, so 20 it
stayed. should i change it to 16?

also, what's the difference between using Parallel::ForkManager to do
20 tasks, and looping through system('script.pl &') 20 times? i mean, i
see an advantage in that with ForkManager, when one processes dies,
another takes its place so you don't need to pre-ordain which process
does which work. but let's assume that each process has exactly the
same amount of work and processes that work with the same speed. would
ForkManager be faster? Is there ever a case where multiple system()
calls is the answer?


> --
> -------------------- http://NewsReader.Com/ --------------------
> Usenet Newsgroup Service $9.95/Month 30GB

.



Relevant Pages

  • Re: Killing a process that takes too long
    ... You may instead use fork and exec; this lets you use the process-ID to ... kill 'INT', $pid; ... and it does not guarantee that the child ... So we need a way to kill several processes of the process group of the parent, ...
    (perl.beginners)
  • Non-random PIDs
    ... new process ID's, in the way that OpenBSD does. ... I'm the child and my pid is 21116. ... I'm the parent and my pid is 21115. ...
    (RedHat)
  • Re: Killing a process that takes too long
    ... and it does not guarantee that the child ... You can test it by placing $$ (process pid) in the output of these two ... So we need a way to kill several processes of the process group of the parent, ...
    (perl.beginners)
  • Re: Creatng 100% separate process from Parent
    ... the "child" process still maintains its ... So another process could start with the same pid as the "parent" - should ... > configuration tool) is typically just run from the start menu. ...
    (microsoft.public.win32.programmer.kernel)
  • Unix Programming FAQ (v1.37)
    ... Why use _exit rather than exit in the child branch of a fork? ... Why doesn't my process get SIGHUP when its parent dies? ... How do I create a named pipe? ... How do I compare strings using regular expressions? ...
    (comp.unix.programmer)