Re: Combining Regular Expressions
From: hiwa (HGA03630_at_nifty.ne.jp)
Date: 01/01/04
- Next message: Tony Morris: "Re: Combining Regular Expressions"
- Previous message: Tony Morris: "Re: Multiposting"
- Maybe in reply to: Chris: "Re: Combining Regular Expressions"
- Next in thread: Tony Morris: "Re: Combining Regular Expressions"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 31 Dec 2003 16:23:55 -0800
"Andrew Dixon - Depictions.net" <andrew.dixon@NOREPLY.depictions.net> wrote in message news:<V_DIb.2908$nB3.27114715@news-text.cableinet.net>...
> Hi Everyone.
>
> I have been working on some code that strips the HTML code out of an HTML
> page leaving just the text on the page. At the moment this is what I have:
>
> // Strip all tags
> replacePattern = "<(.|\n)+?>";
> pageHTML = pageHTML.replaceAll(replacePattern,"");
>
> //Remove any HTML specific characters (e.g. " or &)
> replacePattern = "&(.|\n)+?;";
> pageHTML = pageHTML.replaceAll(replacePattern,"");
>
> // Remove whitespace
> replacePattern = "\\s{2,}";
> pageHTML = pageHTML.replaceAll(replacePattern," ");
>
> Is there a way I can combine all four patterns into one expression so I can
> make the code more efficient? I've not really worked with RegEx so any
> advice would be most welcome. Can I do something like:
>
> replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
> pageHTML = pageHTML.replaceAll(replacePattern,"");
>
> Thanks.
Java regular expressions can be combined with the '|' operator. But if
your objective is retrieving text from html documents, you can
effectively use HTMLEditorKit.ParserCallback#handleText() method.
- Next message: Tony Morris: "Re: Combining Regular Expressions"
- Previous message: Tony Morris: "Re: Multiposting"
- Maybe in reply to: Chris: "Re: Combining Regular Expressions"
- Next in thread: Tony Morris: "Re: Combining Regular Expressions"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|