Re: Walking a tree and extracting info... Problems



jim.goodman@xxxxxxxxx wrote:
I am new to the perl thing and i am trying to extract some date from
some web pages and am having problems.... can someone please tell me
what i am doing wrong... i think i have become a charter member of the
"idiots 'r' us" club... :o)!

Nope; Perl is IMO harder to learn than some other languages. You're not helping yourself enough though. I'll get to your problem in a moment, but first some things you should do (a) to help you find your problems before posting here, and (b) to get better and quicker help here.

1. Always code "use strict;" and "use warnings"; had you done so you
might have picked up the logic problem in your code, but it will
certainly ensure that you pick up many others.
2. Code not only a test program (well done for doing that) but also
some suitable data. I had to make some in order to do the testing.
3. Learn to use the Perl debugger (perl -d yourprog.pl) and to use the
breakpoint and examine commands. Doing that I found your problem in
one pass through the program.

this is my script... pretty simple so far, i am just trying to get one
piece of info working to start. i can traverse the directory and print
the filenames, but it only seems to get the data and do the pattern
matching from the first file in the directory....

What you mean is that once it has found a file with a match it then finds that match in all subsequent files even if they themselves don't have it. I recommend you try to be very precise about your problem. Actually, showing your incorrect output is very precise and saves extra thought on your part!

#!/usr/bin/perl
$dir="/Users/test/";

If you code "use strict" you'll need to put "my $dir", and the same elsewhere in the file.

opendir(DIRECTORY, $dir) || die("Cannot open directory");
@thefiles= readdir(DIRECTORY);

This is OK as far as it goes but assumes you have enough memory to read in the whole directory. Better practice is to read the directory line by line, as you've (partly) done with the file.

closedir(DIRECTORY);

foreach $file (@thefiles) {
unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
) {

A regex could do this (untested)

unless ( $file =~ /^\.{1,2}$|^\.DS_Store$/ ) {

.... but if you could just reject all "dot" files it would be even easier

unless ( $file =~ /^\./ )

open FILE, "$dir/$file" or die "Can't open $file : $!";

Well done for checking the file open result. Lots of beginners don't.

while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array

Again, you're assuming that you always have enough memory for the whole file.

Your problem is here. Because you didn't code "use strict" you aren't forcing yourself to take control of the scope of your variables. Perl has allocated "@lines" once for the whole program; when you process the next file in the directory you push the lines on the bottom; the match for the HTML title then fires every time. If you'd coded "my @lines" just before the "while (<FILE)" line then you'd have got a new "@lines" each time and your program would have worked as you wanted it to.

}
close FILE;
$string = "@lines";

This is ugly, and produces a slap on the wrist from Perl when you code "use strict; use warnings". Not that it doesn't give you what you want, though ... it's up to you as to whether you want to write with good style.

$n++;

When "strict" forces you to code "my $n" then you'll have to put it outside the directory-read loop.

print "$n:$file:";
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n"; # print html page title

Always check the extracted text. When I fixed your program so it only examined the text of the current file I got errors from this statement every time it failed to find a match.

Here's a minimally-fixed version of your program which "works", in the sense that it finds the HTML titles. It still needs quite a lot of cleaning up and more Perlish idiom.

#!/usr/bin/perl
# Jim Goodman's problem April 9

use strict; use warnings; # I added this

#$dir="/Users/test/";
my $dir="F:/scratch"; # My directory instead of his

opendir(DIRECTORY, $dir) || die("Cannot open directory");
my @thefiles= readdir(DIRECTORY);
closedir(DIRECTORY);

my $n;
foreach my $file (@thefiles) {
unless ( ($file eq ".") || ($file eq "..") || ($file eq ".DS_Store")
) {
open FILE, "$dir/$file" or die "Can't open $file : $!";
my @lines = ();
while( <FILE> ) {
s/\t//; # ignore tabs by erasing them
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
push @lines, $_; # push the data line onto the array
}
close FILE;
my $string = "@lines";
$n++;
print "$n:$file:";
$string =~ /<span class=searchtitle><B> (.*?)<\/B><\/span><BR>/is;
print "$1\n" if $1; # print html page title
}
}

But I think I'd feel inclined use "grep" to find the files that had the relevant string in them, and pipe the output into a much smaller Perl program to find the HTML titles and print them out. You'd lose the incrementing count of the files, though.

--

Henry Law <>< Manchester, England
.



Relevant Pages

  • Walking a tree and extracting info... Problems
    ... I am new to the perl thing and i am trying to extract some date from ... foreach $file (@thefiles) { ... push @lines, $_; # push the data line onto the array ...
    (comp.lang.perl.misc)
  • Re: HTTP Filtering and Threads...
    ... You are trying to parse HTML with regular expressions. ... This is not Perl. ... # Some irrelevant code stuff... ... foreach $userID { ...
    (comp.lang.perl.misc)
  • RE: question
    ... well it's really HTML that's the problem. ... > was whether perl was appropriate, not how to do it in perl. ... > this e-mail message or disclose its contents to anybody else. ... > should check this e-mail and any attachments for viruses. ...
    (perl.beginners)
  • Re: unwanted leading whitespace when using print
    ... push; ... Note the leading whitespace before 'Ja, ' and before 'En soms'. ... This is perl, v5.8.7 built for MSWin32-x86-multi-thread ...
    (comp.lang.perl.misc)
  • Re: Two Perl programming questions
    ... directory names using Perl. ... I can debug through my Perl script and ... How would Perl create the dynamic HTML that I ... Perl is general purpose programming language. ...
    (comp.lang.perl.misc)