HTML Parser Help Please

From: ZOCOR (someone_at_somewhere.com)
Date: 09/30/04


Date: Thu, 30 Sep 2004 09:41:13 GMT

Hi

I am using HTMLEditorKit.Parser class to parse a HTML file. However, I have
found this Swing HTML parser extremely difficult to use.

I am trying to parse a HTML file and extracting specific information from it
into a table. Consider the snippet of my HTML and the table I like it to
generate:

HTML source:

<HTML>
<TITLE></TITLE>
<BODY>
<PRE>
    Identifer: ABCDEFG
</PRE>
    data: 123456
<PRE>
</PRE>
</BODY>
</HTML>

TABLE:

ABCDEFG 123456

Here is the code I have so far:

import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;

public class HTMLParser extends HTMLEditorKit
{
    public HTMLEditorKit.Parser getParser()
    {
        return super.getParser();
    }

    public static void main (String[] args)
    {
        try
        {
            Reader r = new FileReader("html_file.html");
            HTMLEditor.Parser parse = new HTMLParser.getParser()
            HTMLEditorKit.ParserCallback cb =
            {
                public void handleStartTag(HTML.Tag t, MutableAttributeSet
a, int a)
                {
                    if (t==HTML.Tag.PRE)
                    {
                            //print whats between the pre tag
                    }
                }
                public void handleText(char[] data, int pos)
                {
                    //print whats between the pre tags
                }
            };

            parse.parse(r, cb, true);
        }
        catch (IOException e)
        {
            System.out.println(e);
        }
}
}

I would appreciate it very much if someone could solve this problem for me.
I tried the sun tutortial, but the examples aren't that clear enough for me.

Thanks

ZOCOR

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004


Relevant Pages

  • Re: Processing XML thats embedded in HTML
    ... I need to parse a fairly complex HTML page that has XML embedded in ... plain XML, but I cannot get it to work with this HTML page. ... matching grammar and filtering parse action ...
    (comp.lang.python)
  • Re: HTMLParser.HTMLParseError: EOF in middle of construct
    ... is valid HTML or not? ... if so it's a bug on HTMLParser ... may appear in an element's start tag. ... And I have to parse many different sites, I just want extract the links, so ...
    (comp.lang.python)
  • Re: Parsing HTML Files
    ... > My Lists of "Useful URLs" are getting a bit difficult to keep nicely ... > designed) HTML Parser can properly Parse HTML. ... Firefox doesn't quite follow that spec but it's close enough to parse. ...
    (uk.people.silversurfers)
  • Re: HTML parser
    ... > having to do more than I bargained for -especially since, for HTML, ... But ATagParser can parse basically anything with a tag format ... At one time, I created a DOM type tree on top of ATagParser, but ...
    (borland.public.delphi.thirdpartytools.general)
  • Re: Programs browsing inernet.
    ... >At least the following flex input file creates a program that ... >non-valid HTML. ... you will need to parse the HTML. ... episode in more detail and gets the episode number. ...
    (comp.programming)