Re: Regular expression to find <tr> tags in 2nd level HTML tables

From: Shannon Jacobs (shanen_at_my-deja.com)
Date: 01/09/04


Date: 8 Jan 2004 21:24:29 -0800

Brian Genisio <BrianGenisio@yahoo.com> wrote in message news:<3FFC118E.7040800@yahoo.com>...
> Shannon Jacobs wrote:
>
<snip>
> Take a look at the TidyLib. It is a C library that will parse HTML for
> you, in DOM-Like nodes, which you can traverse like a tree. It was
> originally developed via the W3C, but it is available via SourceForge
<snip>
> Using a RegExp will break as soon as the HTML format changes, but a
> smart tree traversal will likely be more robust.
>
> If you go the TidyLib method, you can manipulate the data quickly, and
> easily develop your palm database via C routines.

>From your description, this doesn't really sound like an approach I
want to take. It's not a matter of simple access, but pruning
manipulation. If I really wanted to follow this approach, the most
bankable-for-use-in-the-real-office approach would be the Excel macro
programming approach I mentioned. However, anytime anyone mentions
Microsoft or Visual <anything> I feel like I want to hold up a silver
cross and scream "Return to Hades, you evil demons!"

However, due to your hint and another source, I thought to explore the
DOM tree to get a better understanding of the problem. Mozilla has a
DOM explorer that was quite good for this, and I can clarify the
problem now. Here is a reduction of the situation:

<table>
  <tr>
  <tr>
    <table>
      <tr>
      <tr>
      ....
      <tr>
  <tr>
  <tr>
  <tr>
  <tr>
  <tr>
  ...
  <tr>

In the outermost table, there is some useful data worth saving in the
first <tr> row. In the 2nd level table, there is some useful data,
mostly numbers, in each of those <tr> rows. Returning to the outer
table, the 7th <tr> row also contains some information that would be
worth saving. That's the legend I mentioned in the earlier post, but
which I still feel would be too difficult to parse in a robust way.

The rest of it is basically dross, and my current regexes toss it away
quite nicely. The main problem is that the line breaks associated with
those second level <tr> tags are useful and significant, and I want to
keep them.

There seem to be two regex-based approaches that are possible. One is
to use one regex to mark them in a way that prevents them from being
tossed, and then restore them as at the end after the other line
breaks have been removed, basically with the reverse regex. I'm
already doing that with some other information that needs to be
preserved.

The other approach would be to just save the immediately preceding
line breaks while tossing all the others. I think I favor this
approach because it strikes me as most elegant and in keeping with the
spirit of the great regex of the heading of 137 degrees. ;-) A related
approach to this one would be to toss all the line breaks at the
beginning, and then insert the correct ones before throwing the other
dross.

I actually found a rather similar recent thread in the comp.lang.perl
newsgroup, so I've cross-posted to that newsgroup, too. That involved
using

s/<[^>]*>//g;

to remove all of the HTML tags, but I need to be more selective.

I also wanted to include a response to the other reply, snide though
it was.

His first snide question was "Why?", in response to my preference for
a regex-based solution. I've already mostly answered that question,
but I'll add that I think regex-based solutions can be quite elegant,
and apparently I sometimes like having my head bent through the regex
dimension.

He then recommended using a HTML parsing module and suggested asking
in a JavaScript newsgroup. In the original post I had already
explained why I wanted this direct approach, and I had already asked
in the JavaScript newsgroup with the original cross-post. I suspect
him of being a wannabe Perler, since real Perl people tend to be very
observant of all details. The regex experts even more so. However, I
just wanted to note that his attitude is one of the main reasons I
quit working in Perl. IMNSHO, it's rather too common among Perl users,
and I'd hate to wind up like that.



Relevant Pages

  • Re: Regular expression to find <tr> tags in 2nd level HTML tables
    ... It is a C library that will parse HTML for ... basically with the reverse regex. ... in the JavaScript newsgroup with the original cross-post. ...
    (comp.lang.perl.misc)
  • Re: Regex matching non-contiguous sheds of text
    ... >> about the pitfalls of using a regex to parse HTML. ... > help me in the task I described in my original post. ... > general HTML parsing operation, such as stripping out HTML tags. ... > My regex almost works, but is acting really weird in a few cases. ...
    (comp.lang.perl.misc)
  • Re: Using PERL to retrieve MP3s
    ... and a regex is not necessarily the best way to parse HTML (consider ... using HTML::Parser* or one if the other modules on CPAN). ... Since the HTML markup is not used to identify the contents of interest, ... A regex should be sufficient. ...
    (perl.beginners)
  • Re: Recursively scraping web pages for embedded links and files
    ... I've managed to get the HTML of a target page ... Should I be using Regex? ... If you're new to working with HTML docs from Excel then it may be a long ... multiple levels of subdirectories. ...
    (microsoft.public.excel.programming)
  • Re: Regular expression to find <tr> tags in 2nd level HTML tables
    ... >> problem with the regex. ... and my source HTML does not include any of the problems covered ... If the FAQ included any examples of the use of ... With regards to the unhelpful advice to stop using Perl, ...
    (comp.lang.perl)