parsing HTML

From: Drew (drew_at_drew.com)
Date: 02/28/05


Date: Mon, 28 Feb 2005 11:02:02 -0500


Hi All:

I'm working on a mini HTML parser. Basically, what I need to do is to
take a HTML file and parse thru it. I want to pick out all of the
text that is between table data tags <td> and </td> and all of the
text between list item tags <li> and </li>.

Since, its possible that a line of HTML could have no spaces at all
like the below:

<tr><td>SomeFixture</td></tr>

I'm thinking that I'm going to need to read the HTML file one line at
a time. Then look for < and its closing >. If the text between the
two is td or li, then start capturing text at the location of > + 1
and do that until I hit another < with at /td after it.

Does this sound reasonable? Or am I coming up with too difficult of a
solution. Does Java have any built in HTML parsing methods that make
this easier?

Or even if there's an existing Java program that I could modify for
this, that's great too.

Any help is appreciated!

Drew



Relevant Pages

  • Re: Automating steps to copy URL from IE into Word
    ... Inside an html document, VBScript is contained between script tags. ... you can create an html file and convert it to an hta file ... > webpage document and it would run when the webpage was opened? ... However, if the document is open, it opens a second ...
    (microsoft.public.word.vba.general)
  • Re: web query : part of a table not captured
    ... HTML files have tables and forms. ... RowCount = RowCount + 1 ... Put the HTML code is a HTML file. ... For Each itm In Results ...
    (microsoft.public.excel.programming)
  • Re: web query : part of a table not captured
    ... HTML files have tables and forms. ... RowCount = RowCount + 1 ... save on my PC as a HTML and the macro I generated. ... Put the HTML code is a HTML file. ...
    (microsoft.public.excel.programming)
  • Re: previous document in JEditorPane has lingering state, how to avoid that?
    ... I am using a JEditorPane to open plain text file and html file alternately. ... The "content type of this editor" referred to is determined by what you ...
    (comp.lang.java.programmer)
  • Re: HTML Editor Problem
    ... HTMLKit, opened my HTML file, made appropriate changes, then saved and uploaded ... It would appear that the editor used by ... There should be 3 images displayed on the first page: ...
    (comp.infosystems.www.authoring.html)