Combining Regular Expressions
From: Andrew Dixon - Depictions.net (andrew.dixon_at_NOREPLY.depictions.net)
Date: 12/31/03
- Next message: diggum: "JRE VM dying under random network events"
- Previous message: Steve R.: "Re: Multiposting"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 31 Dec 2003 17:49:41 GMT
Hi Everyone.
I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:
// Strip all tags
replacePattern = "<(.|\n)+?>";
pageHTML = pageHTML.replaceAll(replacePattern,"");
//Remove any HTML specific characters (e.g. " or &)
replacePattern = "&(.|\n)+?;";
pageHTML = pageHTML.replaceAll(replacePattern,"");
// Remove whitespace
replacePattern = "\\s{2,}";
pageHTML = pageHTML.replaceAll(replacePattern," ");
Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:
replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
pageHTML = pageHTML.replaceAll(replacePattern,"");
Thanks.
-- Best Regards >>> Andrew Dixon
- Next message: diggum: "JRE VM dying under random network events"
- Previous message: Steve R.: "Re: Multiposting"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]