parsing HTML

From: Drew (drew_at_drew.com)
Date: 02/28/05


Date: Mon, 28 Feb 2005 11:02:02 -0500


Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew



Relevant Pages

  • Re: Automating steps to copy URL from IE into Word
    ... Inside an html document, VBScript is contained between script tags. ... you can create an html file and convert it to an hta file ... > webpage document and it would run when the webpage was opened? ... However, if the document is open, it opens a second ...
    (microsoft.public.word.vba.general)
  • Re: previous document in JEditorPane has lingering state, how to avoid that?
    ... I am using a JEditorPane to open plain text file and html file alternately. ... The "content type of this editor" referred to is determined by what you ...
    (comp.lang.java.programmer)
  • Some questions about q{} and qr{}.
    ... # First print the standard opening lines of an HTML file. ... # This regex says "find a string which is probably a URL minus the 'http://' ... # followed by a cluster of URL-legal characters; ...
    (comp.lang.perl.misc)
  • Strange Behavior On File:/// link in Web Page - New Issue
    ... I have written an HTML page that is emailed to end-users. ... a table in it with a row that conains a hyperlink. ... - If you save the HTML file from the email to your local drive, ... the one that is supposed to launch the a program ...
    (microsoft.public.windows.inetexplorer.ie6.browser)
  • Re: Total drop down boxes
    ... The first method consists in manually writing the content in the file, either in the HTML file directly or using some automated method on a server. ... The second one consists in writing the mass HTML in a javascript variable, then document.writing the content of this variable on the document. ...
    (comp.lang.javascript)