Re: how to capture multiple lines?

From: Tassilo v. Parseval (tassilo.parseval_at_rwth-aachen.de)
Date: 03/29/04


Date: 29 Mar 2004 13:02:51 GMT

Also sprach Geoff Cox:

> On Mon, 29 Mar 2004 13:53:51 +0200, Gunnar Hjalmarsson
><noreply@gunnar.cc> wrote:
>
>
>>That does not set it to default. This does:
>>
>> $/ = "\n";
>
> The best I can get is as follows
>
> sub para {
>
> local ($/ = "\0a\0d");
>
> my ($linepara) = @_;
> $linepara =~ /<p>(.*?)<\/p>/s;
> # print ("\$1 = $1 \n");
> print OUT ("<tr><td colspan=2>" . $1 . "<\/td><\/tr> \n");
> $/ = "";
> }
>
> Now, this does get the
><p> jahjsdkaljk al
> asdjk aksdj klad
> kajsd akl </p>
>
> text but it also get some lines which I do not want and do not get if
> I do not use $/ - so am a bit lost. Tempted to put the whol code up
> but that would be asking too much!
>
> I would liek to use the slurp approach but not sure how to do it so
> that as I parse through an html file and find the first line of the
> first <p> etc block of text - how do I get that text and put in into a
> file and then when find the second <p> block put it in the right
> place...I do not want toput all the <p> etc text together..they appear
> at different places in the html file....

If I understand you right, you want to grab everything that appears in
<p> tags? Here's an example using HTML::Parser:

    #! /usr/bin/perl -w

    package MyParser;
    
    use strict;
    use base qw/HTML::Parser/;

    our $in_para;

    sub start {
        my (undef, $tagname) = @_;
        $in_para = 1 if $tagname eq 'p';
    }

    sub end {
        my (undef, $tagname) = @_;
        $in_para = 0 if $tagname eq 'p';
    }

    sub text {
        my (undef, $text) = @_;
        print $text if $in_para;
    }

    package main;

    my $p = MyParser->new;
    $p->parse_file("file.html");

It's dead simple: You create a subclass of HTML::Parser (MyParser) that
overwrites the start(), end() and text() method. The start() method
simply sets the global variable $in_para to a true value when it
encountered a <p>-starttag. It's set to false when </p> is encountered.
The method text() is triggered for ordinary text. It will only print it
when $in_para is true.

This solution is very robust and since the basic skeleton is only a few
lines, it is easily extensible. You most probably want to change the
text() method to let it print into a file or so. If you want to grab
anything between <p> and </p> (including other tags) you must extend
start() and end() a bit to print their last argument (which is the
original text of the tag as it appeared in the HTML-file). Something
like:

    sub start {
        my (undef, $tagname, undef, undef, $origtext) = @_;
        print $origtext if $in_para;
        $in_para = 1 if $tagname eq 'p';
    }

    sub end {
        my (undef, $tagname, $origtext) = @_;
        $in_para = 0 if $tagname eq 'p';
        print $origtext if $in_para;
    }

Tassilo

-- 
$_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
$_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval


Relevant Pages

  • Re: Html to Text Convertor?
    ... piece of code that removes all tags from an HTML file. ... Take a look at the Web Browser Control. ... MVP Tips:http://www.flounder.com/mvp_tips.htm ...
    (microsoft.public.vc.mfc)
  • Re: Html to Text Convertor?
    ... piece of code that removes all tags from an HTML file. ... Take a look at the Web Browser Control. ... MVP Tips:http://www.flounder.com/mvp_tips.htm ...
    (microsoft.public.vc.mfc)
  • Re: Problem page IE clear float problem, Opera/FF header problem and N4
    ... > caps (which doesn't work so well with css). ... > be missing quotes or tags but those quotes and tags are already there, ... Yes but you also have to change your CSS file, an id is prefixed with a # ... to id="mainimage" in your HTML file but that they are still in your CSS ...
    (comp.infosystems.www.authoring.stylesheets)
  • Re: Extracting bolds and italics from HTML
    ... I have to make some calculations on the contents of url before ... > I had found a very useful program of Word Count from sun java forum, ... > but its problem is that it also includes the HTML tags in calculation. ... > i) A program which counts words in HTML file but doesnt include HTML ...
    (comp.lang.java.programmer)
  • Re: DOM Help Please
    ... Is there a way of crawling the DOM without IDs? ... I want to get from the below anchor tag, ... and grab the "some more text" string inside the td tags to ... I would also like to grab the "Some Text" string inside the anchor ...
    (comp.lang.javascript)