Re: HTML Processing in Java




"Honza" <jan.zeman@xxxxxxxxx> wrote in message
news:1133255497.231778.229120@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Hello,
>
> I would like to process html pages in java. The very first task would
> be to ignore unnecessary information like comments (everything in <!--
> -->) or images.
> What would be the best start point?
> I have found JTidy and HTML Parser in SourceForge, but none of them is
> able of ignoring tags - or did I miss it?
>
> Thank you for any clue
> Honza

Haven't used the parsers you're talking about, but if you find any SAX
based parser, you'll just receive a bunch of "events" representing the
discovery of "things" in an HTML document, and you can just ignore the
"comment" events.

- Oliver


.



Relevant Pages

  • Re: Extending JEditorPane or related classes/interefaces?
    ... I don't really want to be writing a parser or to ... >> reinvent a class to display whatever the HTML should produce. ... This is true only if you're talking about absolute data security. ... Every business suffers loss from theft, ...
    (comp.lang.java.gui)
  • Re: HTML-Seite parsen in Java??
    ... Bei HTML brauchst du einen ... verzeihenden Parser, z.B. Tagsoup. ... Fans von Scriptsprachen wie Perl, ... Java musst du recht viel "boilerplate code" schreiben. ...
    (de.comp.lang.java)
  • Re: Deleting effect like icon-recycle bin
    ... > For HTML source code the appropriate parser is called 'soap parser'. ... The appropriate parser für HTML is an SGML parser as that is an SGML ... > It means that is it will skip all not known elements of markup'. ... > and furthermore it is valid under XHTML. ...
    (comp.lang.javascript)
  • Re: multiple lines / success or failure?!
    ... > blocl of text in an html file ... callbacks because those are the parts you want to customize. ... In order to make your parser do something useful, ... start tags: 4 ...
    (comp.lang.perl.misc)
  • Re: HTML Parsing and Indexing
    ... known web sites inthe format of ... I need a help on HTML parser. ... Parser and Indexer need to run unattended. ... One nice parser which should work on HTML/text file (lynx output) and ...
    (comp.lang.python)