Re: Combining Regular Expressions

From: hiwa (HGA03630_at_nifty.ne.jp)
Date: 01/01/04


Date: 31 Dec 2003 16:23:55 -0800


"Andrew Dixon - Depictions.net" <andrew.dixon@NOREPLY.depictions.net> wrote in message news:<V_DIb.2908$nB3.27114715@news-text.cableinet.net>...
> Hi Everyone.
>
> I have been working on some code that strips the HTML code out of an HTML
> page leaving just the text on the page. At the moment this is what I have:
>
> // Strip all tags
> replacePattern = "<(.|\n)+?>";
> pageHTML = pageHTML.replaceAll(replacePattern,"");
>
> //Remove any HTML specific characters (e.g. &quot; or &amp;)
> replacePattern = "&(.|\n)+?;";
> pageHTML = pageHTML.replaceAll(replacePattern,"");
>
> // Remove whitespace
> replacePattern = "\\s{2,}";
> pageHTML = pageHTML.replaceAll(replacePattern," ");
>
> Is there a way I can combine all four patterns into one expression so I can
> make the code more efficient? I've not really worked with RegEx so any
> advice would be most welcome. Can I do something like:
>
> replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
> pageHTML = pageHTML.replaceAll(replacePattern,"");
>
> Thanks.

Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.ParserCallback#handleText() method.



Relevant Pages

  • Re: Convert HTM File to Txt and Skip un needed info
    ... I have a function that a contractor wrote for me that takes a HTML ... page and strips out all the tags and save the out put to a text file. ...
    (microsoft.public.scripting.vbscript)
  • Removing Blank Lines
    ... I have written a function which takes a HTML file, ... from it - leaving me with the text I want. ... TARAS - I WILL LOVE AGAIN (MAPL) ... NATASHA BEDINGFIELD - UNWRITTEN ...
    (microsoft.public.access.formscoding)
  • Re: text span 2 characters (fixed font width)
    ... the essential feature was leaving the whitespace in. ... This is the same as in HTML 3.2, and is intended to preserve constant line spacing and column alignment for text rendered in a fixed pitch font. ...
    (alt.html)
  • Re: blind men and page declaration
    ... from HTML 5; having said that, leaving aside CSS, what alterations ...
    (alt.html)
  • disable button when leaving page
    ... Which is the correct approach to disable buttons when leaving an aspnet ... Currently I have the following html, which disables buttons when pressing F5 ... in FireFox, which ofcourse can occur before leaving the page. ...
    (microsoft.public.dotnet.framework.aspnet)