HTML::Tree help

From: Ing. Branislav Gerzo (konfera_at_2ge.us)
Date: 11/30/04


Date: Tue, 30 Nov 2004 16:22:42 +0100
To: beginners@perl.org

Hi all,

I have to parse some thousand of html files, so I'd like to use some
html parser, and not my own regexpes. Htmls I am parsing are quite
complex, so I need your help. First of all, is HTML::Tree good and
fast module?

Because, I am not sure if I have to look for some criteria using
if( my $h = $tree->look_down('_tag', 'sometag') ) { }
it is not slow ?

When I used Dumped through Data::Dumper, from 300 kb html file is 13mb
dump output...

Ok, and now to the problem, html looks like:

<table width="600%" border="3" align="center" cellspacing="2" cellpadding="2" bgcolor='#eeffff'>
 <tr>
   <td align="left" valign="top" width="20%"> <span class="tl">TEST:&nbsp;</span></td>
   <td align="left" width="80%"><table width="100%" border="0">
   <tr>
    <td width="67%"> <span class='ra'> Vysoká </span> <span class='ra'> 9 </span><br> <span class='ra'> Bratislava </span> <span class='ra'> 810 00 </span><br></td>
    <td width="33%" valign='top'>&nbsp; <span class='ra'>something</span></td>
  </tr>
  </table><table width="100%" border="0">
   <tr>
   <td width="67%"> <span class='ro'> Nám. SNP </span> <span class='ro'> 15 </span><br> <span class='ro'> Bratislava </span> <span class='ro'> 810 00 </span><br></td>
   <td width="33%" valign='top'>&nbsp; <span class='ro'>something</span></td>
  </tr>
  </table><table width="100%" border="0">
   <tr>
   <td width="67%"> <span class='ro'> Bratislava </span><br></td>
   <td width="33%" valign='top'>&nbsp; <span class='ro'>something</span></td>
  </tr>
  </table></td>
</tr>
</table>

(I hope you will see it ok, if not http://www.2ge.us/perl/html.txt ).

Ok, and now to the problem - nearly whole html is full of this kind
tables. And now how to extract values from there ? I have to look out,
if class = "tl" and value is /TEST:/i, if yes, give me all values till
end of whole table. Should be someone so neat and give me some help ?
Hint: in table is always one class='ra' and optional 0 or more
class='ro'

thanks for any help!

--
 --. ,--  ,-     ICQ: 7552083      \|||/    `//EB: www.2ge.us
,--' |  - |--    IRC: [2ge]        (. .)    ,\\SN: 2ge!2ge_us
`====+==+=+===~  ~=============-o00-(_)-00o-================~
John Tesh might drive (John says ride) a Celica.
 


Relevant Pages

  • Re: web page aligment with publisher.
    ... This works only for websites created with Publisher 2000 ... Each of your html files requires adding the code after the existing code. ... I installed Pub 2002 on the Vista machine and the exported HTML code was entirely different than the code the XP machine exported and I COULD NOT center the page ...
    (microsoft.public.publisher.webdesign)
  • Re: web page aligment with publisher.
    ... I use Publisher 2000 and it also creates websites left justified. ... Each of your html files requires adding the code after the existing ... I installed Pub 2002 on the Vista machine and the exported HTML code was ...
    (microsoft.public.publisher.webdesign)
  • Re: Converting Pub 2007 to html
    ... nothing but images because you overlapped design elements, ... The other major faux pas you have been doing, is editing the html files. ... Publisher is not a html editor. ...
    (microsoft.public.publisher.webdesign)
  • Re: cant move picture
    ... No not unethical...you wrote the html it's your's you can do whatever you ... it to Publisher would be kinda like converting an English document to ... | A couple years back I created some html files using a fairly basic html ... | on his server so I ftp'd the files to his server. ...
    (microsoft.public.publisher.webdesign)
  • Re: PHP-Yes, HTML-No --- Why?
    ... So, it's pretty fair to say, that Apache and PHP don't give a damn if they're parsing HTML files for PHP, as as I said, the performance hit is minimal. ...
    (comp.lang.php)