Re: HTML Parser

From: Jürgen Exner (jurgenex_at_hotmail.com)
Date: 12/17/04


Date: Fri, 17 Dec 2004 07:08:04 GMT

Asoup wrote:
> Jürgen Exner wrote:
>> Asoup wrote:
[...]
>> The right move. Had you read the FAQ or any of the previous threads
>> about this very subject you would have found that solution much
>> earlier.
> Actually, yes, I think that people here would like to argue than
> actually help. So I did read some documentation on cpan. And I am
> going to study the HTML::Element module closely.

I've no idea what HTML::Element does, but I wonder why you persistently
resist looking at HTML::Parser as suggested by several people.

>>
>>> However, I want specific part of the
>>> text to be displayed...
>>
>> Ok, what have you tried (hint, hint: show us your code!), what did
>> you expect it to do, and what behaviour did you observe?
>>
>> One way to do that is included as an example with HTML::Parser.
>> Unfortunately the examples are not part of the standard
>> installation, so you will have to download and manually unpack the
>> HTML::Parser module from CPAN.
>
> Here is what I have right now:
>
> #!/usr/bin/perl
>
> use lib '/perl/lib';
>
> use LWP::Simple;
> use HTML::TreeBuilder;

I haven't used HTML::TreeBuilder, so I can't comment on that.

[code snipped]
> # It just removes the tags, but now I don't know how to sort and
> *grab* the text I need and remove the rest...

Well, I suppose after you removed the tags there is nothing left to help you
identify the desired parts. So grab the right text _before_ removing the
tags resp. while you still have the syntax tree or whatever
HTML::TreeBuilder returns.

And once again: the documentation for HTML::Parser already contains an
example for how to extract the body of a <title> element.
<quote>
The next example prints out the text that is inside the <title> element of
an HTML document. Here we start by setting up a start handler. When it sees
the title start tag it enables a text handler that prints any text found and
an end handler that will terminate parsing as soon as the title end tag is
seen:
[...]
More examples are found in the eg/ directory of the HTML-Parser
distribution: the program hrefsub shows how you can edit all links found in
a document; the program htextsub shows how to edit the text only; the
program hstrip shows how you can strip out certain tags/elements and/or
attributes; and the program htext show how to obtain the plain text, but not
any script/style content.
</quote>

It can't be that difficult to adapt those examples for whatever you need to
extract. BTW: did you notice, that you forgot to tell us _which_ part of the
HTML file you want to extract?

jue



Relevant Pages

  • Re: [Tk] Docummentation for Events
    ... I don't mean the basic documentation about Tk events. ... passed thru %tags and may be bound to the handler. ... of the mouse pointer relative to the receiving window. ... in context M, thing B in context N and thing C in context T. ...
    (comp.lang.tcl)
  • Re: Q: Solutions for requirements tracing (to design, code, and test items)
    ... This is the way translation tools like PathMATE handle documentation. ... If you are doing UML models for the design, most drawing tools allow the application of such tags to model elements fairly painlessly. ... Now that OMG is standardizing tools through MDA, most UML drawing tools offer an XMI or equivalent access to the tool repository to extract the tags and their association with model elements. ...
    (comp.object)
  • Re: XML tags and /doc switch
    ... If I enable this for a single project, add some tags for most objects, ... complaining because you've stated you want documentation, ... So mark classes that you don't intend to be ... The XML file is cool, but I do not see it as ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Looking for an excuse to buy some CDRs this weekend?
    ... are some Linux tagging utilities that are supposed to be able to read ... I didn't know there were utilities that could extract text file info. ... It can also extract information from info files like ... Tags can also be editied in an editor. ...
    (rec.music.gdead)
  • Re: slurp not working? ideas please!
    ... so the *only* tags and related events processed are those specified. ... This handler will be invoked when any *start* tag is recognized. ... Argspec is 'tagname, attr, text', which is what's passed to the sub. ...
    (comp.lang.perl.misc)