Re: HTML Parser
From: Jürgen Exner (jurgenex_at_hotmail.com)
Date: 12/17/04
- Next message: Asoup: "Re: HTML Parser"
- Previous message: Asoup: "Re: HTML Parser"
- In reply to: Asoup: "Re: HTML Parser"
- Next in thread: Asoup: "Re: HTML Parser"
- Reply: Asoup: "Re: HTML Parser"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Fri, 17 Dec 2004 07:08:04 GMT
Asoup wrote:
> Jürgen Exner wrote:
>> Asoup wrote:
[...]
>> The right move. Had you read the FAQ or any of the previous threads
>> about this very subject you would have found that solution much
>> earlier.
> Actually, yes, I think that people here would like to argue than
> actually help. So I did read some documentation on cpan. And I am
> going to study the HTML::Element module closely.
I've no idea what HTML::Element does, but I wonder why you persistently
resist looking at HTML::Parser as suggested by several people.
>>
>>> However, I want specific part of the
>>> text to be displayed...
>>
>> Ok, what have you tried (hint, hint: show us your code!), what did
>> you expect it to do, and what behaviour did you observe?
>>
>> One way to do that is included as an example with HTML::Parser.
>> Unfortunately the examples are not part of the standard
>> installation, so you will have to download and manually unpack the
>> HTML::Parser module from CPAN.
>
> Here is what I have right now:
>
> #!/usr/bin/perl
>
> use lib '/perl/lib';
>
> use LWP::Simple;
> use HTML::TreeBuilder;
I haven't used HTML::TreeBuilder, so I can't comment on that.
[code snipped]
> # It just removes the tags, but now I don't know how to sort and
> *grab* the text I need and remove the rest...
Well, I suppose after you removed the tags there is nothing left to help you
identify the desired parts. So grab the right text _before_ removing the
tags resp. while you still have the syntax tree or whatever
HTML::TreeBuilder returns.
And once again: the documentation for HTML::Parser already contains an
example for how to extract the body of a <title> element.
<quote>
The next example prints out the text that is inside the <title> element of
an HTML document. Here we start by setting up a start handler. When it sees
the title start tag it enables a text handler that prints any text found and
an end handler that will terminate parsing as soon as the title end tag is
seen:
[...]
More examples are found in the eg/ directory of the HTML-Parser
distribution: the program hrefsub shows how you can edit all links found in
a document; the program htextsub shows how to edit the text only; the
program hstrip shows how you can strip out certain tags/elements and/or
attributes; and the program htext show how to obtain the plain text, but not
any script/style content.
</quote>
It can't be that difficult to adapt those examples for whatever you need to
extract. BTW: did you notice, that you forgot to tell us _which_ part of the
HTML file you want to extract?
jue
- Next message: Asoup: "Re: HTML Parser"
- Previous message: Asoup: "Re: HTML Parser"
- In reply to: Asoup: "Re: HTML Parser"
- Next in thread: Asoup: "Re: HTML Parser"
- Reply: Asoup: "Re: HTML Parser"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|