subroutine in LWP - in order to get 700 forum threads
- From: Floobee@xxxxxx
- Date: Sat, 26 Aug 2006 01:54:32 +0200
hello dear Perl-addicted,
to admit - i am a Perl-novice and ihave not so much experience in perl. But i am willing to learn. i want to learn perl. As for now i have to solve some tasks for the college. I have to do some investigations on a board where i have no access to the db.
first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions. To give an example - let us take this forum here. How can i grab all the data out of this forum - and get it local and then after wards put it in a local
database - of a phpBB-forum - is this possible"?!"? [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue is. i have to get the data - so what?
I need the data in a allmost full and complete formate. So i need all the data like
username .-
forum
thread
topic
text of the posting and so on and so on.
how to do that?
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
[code]
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;
use Data::Dumper; # for show and troubleshooting
my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);
my @links;
get_threads($url);
foreach my $page (@links) { # this loops over each link collected from the index
my $r = $ua->get($page);
if ($r->is_success) {
my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
# just printing what was collected
print Dumper get_thread($stream);
# would instead have database insert statement at this point
} else {
warn $r->status_line;
}
}
sub get_thread {
my $p = shift;
my ($title, $name, @thread);
while (my $tag = $p->get_tag('a','span')) {
if (exists $tag->[1]{'class'}) {
if ($tag->[0] eq 'span') {
if ($tag->[1]{'class'} eq 'name') {
$name = $p->get_trimmed_text('/span');
} elsif ($tag->[1]{'class'} eq 'postbody') {
my $post = $p->get_trimmed_text('/span');
push @thread, {'name'=>$name, 'post'=>$post};
}
} else {
if ($tag->[1]{'class'} eq 'maintitle') {
$title = $p->get_trimmed_text('/a');
}
}
}
}
return {'title'=>$title, 'thread'=>\@thread};
}
sub get_threads {
my $page = shift;
my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->parse($_[0])});
# Expand URLs to absolute ones
my $base = $r->base;
return [map { $_ = url($_, $base)->abs; } @links];
}
sub wanted_links {
my($tag, %attr) = @_;
return unless exists $attr{'href'};
return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
push @links, values %attr;
}
[/code]
If we have the necessary modules installed, and run it from the command line you'll see output such as the following:
[code]
$VAR1 = {
'thread' => [
{
'post' => 'Hello, I\'m pretty new to PHPNuke. I\'ve got my site up and running great! I\'m now starting to make modifications, add modules etc. I\'m using the most recent RavenPHP76. I want to display the 5 most recent forum posts at the top of the forum page. I\'m not sure if this functionality is built in, if so, how to activate. Or if there is a module or block made to do this. I looked at Raven\'s Collapsing Forum block but wasn\'t crazy about the format, and I don\'t want it to be collapsable. Thanks! mopho',
'name' => 'mopho'
},
{
'post' => 'hi there',
'name' => 'sail'
},
{
'post' => 'thanks for asking this; :not very sure if i got you right; Do you want to have a feed of the last forumthreads? guess the easiest way is to go to raven and ask how he did it. hth sail.',
'name' => 'sail'
},
{
'post' => 'Thanks. i found what I was looking for. It wasn\'t so easy to find! It\'s called glance_mod. mopho',
'name' => 'mopho'
},
{
'post' => 'hi there thx',
'name' => 'sail'
},
{
'post' => 'it sound interesting - i will have also a look i google after it - and try to find out more regards sailor',
'name' => 'sail'
}
],
'title' => 'Recent Forum Posts Module'
};
[/code]
to be honest - i think that the thing is to run
the script just looped over the first index page
here [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
But I need it to loop over all the more than 50 pages. Therefore I need to get a routine here
this must get a subroutine .... that the code is looped
[code]
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;
use Data::Dumper; # for show and troubleshooting
my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);
my @links;
get_threads($url);
foreach my $page (@links) { # this loops over each link collected from the index
my $r = $ua->get($page);
if ($r->is_success) {
my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
# just printing what was collected
print Dumper get_thread($stream);
# would instead have database insert statement at this point
} else {
warn $r->status_line;
}
}
[/code]
This must get a subroutine - doesn t it?
It has to get a subroutine in order to let the script loop over all the pages in the forum [URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL] in the above version it isnt set up a loop to grab each of the index pages but someone may consider that trivial. the demonstration is very imressive - and makes me thinking that Perl is very very powerful. I will try to harvest this category of the Forum (note those both categories are of my interest nothing more: [URL]=http://www.nukeforums.com/forums/viewforum.php?f=3[/URL]
[URL]=http://www.nukeforums.com/forums/viewforum.php?f=17[/URL]
Question - am i able to get the results of the above mentionde forum categories - and can i get the forum threads that are stored in the two above forums....
i look forward to hear from you
fllobee
_______________________________________________________________________
Viren-Scan für Ihren PC! Jetzt für jeden. Sofort, online und kostenlos.
Gleich testen! http://www.pc-sicherheit.web.de/freescan/?mc=022222
.
- Follow-Ups:
- Re: subroutine in LWP - in order to get 700 forum threads
- From: Robin Norwood
- Re: subroutine in LWP - in order to get 700 forum threads
- From: Randal L. Schwartz
- Re: subroutine in LWP - in order to get 700 forum threads
- Prev by Date: Re: smtp authentication
- Next by Date: Re: Using a regular expression to remove all except certaincharacters.
- Previous by thread: smtp authentication
- Next by thread: Re: subroutine in LWP - in order to get 700 forum threads
- Index(es):
Relevant Pages
|