subroutine in LWP - in order to get 700 forum threads



hello dear Perl-addicted,

to admit - i am a Perl-novice and ihave not so much experience in perl. But i am willing to learn. i want to learn perl. As for now i have to solve some tasks for the college. I have to do some investigations on a board where i have no access to the db.

first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions. To give an example - let us take this forum here. How can i grab all the data out of this forum - and get it local and then after wards put it in a local

database - of a phpBB-forum - is this possible"?!"? [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]

Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue is. i have to get the data - so what?
I need the data in a allmost full and complete formate. So i need all the data like

username .-
forum
thread
topic
text of the posting and so on and so on.

how to do that?


[URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]



[code]

#!/usr/bin/perl
use strict;
use warnings;

use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;

use Data::Dumper; # for show and troubleshooting

my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";;
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);

my @links;
get_threads($url);

foreach my $page (@links) { # this loops over each link collected from the index
my $r = $ua->get($page);
if ($r->is_success) {
my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
# just printing what was collected
print Dumper get_thread($stream);
# would instead have database insert statement at this point
} else {
warn $r->status_line;
}
}

sub get_thread {
my $p = shift;
my ($title, $name, @thread);
while (my $tag = $p->get_tag('a','span')) {
if (exists $tag->[1]{'class'}) {
if ($tag->[0] eq 'span') {
if ($tag->[1]{'class'} eq 'name') {
$name = $p->get_trimmed_text('/span');
} elsif ($tag->[1]{'class'} eq 'postbody') {
my $post = $p->get_trimmed_text('/span');
push @thread, {'name'=>$name, 'post'=>$post};
}
} else {
if ($tag->[1]{'class'} eq 'maintitle') {
$title = $p->get_trimmed_text('/a');
}
}
}
}
return {'title'=>$title, 'thread'=>\@thread};
}

sub get_threads {
my $page = shift;
my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->parse($_[0])});
# Expand URLs to absolute ones
my $base = $r->base;
return [map { $_ = url($_, $base)->abs; } @links];
}

sub wanted_links {
my($tag, %attr) = @_;
return unless exists $attr{'href'};
return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
push @links, values %attr;
}

[/code]



If we have the necessary modules installed, and run it from the command line you'll see output such as the following:



[code]

$VAR1 = {
'thread' => [
{
'post' => 'Hello, I\'m pretty new to PHPNuke. I\'ve got my site up and running great! I\'m now starting to make modifications, add modules etc. I\'m using the most recent RavenPHP76. I want to display the 5 most recent forum posts at the top of the forum page. I\'m not sure if this functionality is built in, if so, how to activate. Or if there is a module or block made to do this. I looked at Raven\'s Collapsing Forum block but wasn\'t crazy about the format, and I don\'t want it to be collapsable. Thanks! mopho',
'name' => 'mopho'
},
{
'post' => 'hi there',
'name' => 'sail'
},
{
'post' => 'thanks for asking this; :not very sure if i got you right; Do you want to have a feed of the last forumthreads? guess the easiest way is to go to raven and ask how he did it. hth sail.',
'name' => 'sail'
},
{
'post' => 'Thanks. i found what I was looking for. It wasn\'t so easy to find! It\'s called glance_mod. mopho',
'name' => 'mopho'
},
{
'post' => 'hi there thx',
'name' => 'sail'
},
{
'post' => 'it sound interesting - i will have also a look i google after it - and try to find out more regards sailor',
'name' => 'sail'
}
],
'title' => 'Recent Forum Posts Module'
};

[/code]



to be honest - i think that the thing is to run
the script just looped over the first index page
here [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
But I need it to loop over all the more than 50 pages. Therefore I need to get a routine here


this must get a subroutine .... that the code is looped

[code]

#!/usr/bin/perl
use strict;
use warnings;

use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;

use Data::Dumper; # for show and troubleshooting

my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";;
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);

my @links;
get_threads($url);

foreach my $page (@links) { # this loops over each link collected from the index
my $r = $ua->get($page);
if ($r->is_success) {
my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
# just printing what was collected
print Dumper get_thread($stream);
# would instead have database insert statement at this point
} else {
warn $r->status_line;
}
}


[/code]


This must get a subroutine - doesn t it?


It has to get a subroutine in order to let the script loop over all the pages in the forum [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] in the above version it isnt set up a loop to grab each of the index pages but someone may consider that trivial. the demonstration is very imressive - and makes me thinking that Perl is very very powerful. I will try to harvest this category of the Forum (note those both categories are of my interest nothing more: [URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]

Question - am i able to get the results of the above mentionde forum categories - and can i get the forum threads that are stored in the two above forums....


i look forward to hear from you

fllobee




_______________________________________________________________________
Viren-Scan für Ihren PC! Jetzt für jeden. Sofort, online und kostenlos.
Gleich testen! http://www.pc-sicherheit.web.de/freescan/?mc=022222

.



Relevant Pages

  • Re: passing database data to a sub
    ... > I'm not sure of the difference, why isn't it a subroutine? ... > sure about this 'shift' thing anyway :-) ... > sub teardown ... > # Setup the template to use for the output. ...
    (perl.beginners)
  • Re: Replacing a line
    ... #Using core module Tie::File to process a file in this subroutine ... sub process_one_file { ... $cpp_file = shift; ... for (@array) #Each line should come one by one ...
    (comp.lang.perl.misc)
  • RE: passing database data to a sub
    ... Second, GetOfficersis called as a subroutine, not as a method of an object. ... I not completely sure about this 'shift' thing anyway :-) ... sub teardown ... # Note...this subroutine uses the template 'db_mainmenu.tmpl.htm' to give ...
    (perl.beginners)
  • Re: use one subroutines variable value in another subroutine inside a module.
    ... my $self = shift; ... the get_course_info subroutine. ... sub load_school_template ... Now I've answered the question you asked I'll compose another follow- ...
    (comp.lang.perl.misc)
  • Mechanize or LWP::RobotUA - which one does it
    ... print Dumper get_thread; ... sub get_threads { ... my $page = shift; ... push @links, values %attr; ...
    (perl.beginners)