Combining Regular Expressions

From: Andrew Dixon - Depictions.net (andrew.dixon_at_NOREPLY.depictions.net)
Date: 12/31/03


Date: Wed, 31 Dec 2003 17:49:41 GMT

Hi Everyone.

I have been working on some code that strips the HTML code out of an HTML
page leaving just the text on the page. At the moment this is what I have:

  // Strip all tags
  replacePattern = "<(.|\n)+?>";
  pageHTML = pageHTML.replaceAll(replacePattern,"");

  //Remove any HTML specific characters (e.g. &quot; or &amp;)
  replacePattern = "&(.|\n)+?;";
  pageHTML = pageHTML.replaceAll(replacePattern,"");

  // Remove whitespace
  replacePattern = "\\s{2,}";
  pageHTML = pageHTML.replaceAll(replacePattern," ");

Is there a way I can combine all four patterns into one expression so I can
make the code more efficient? I've not really worked with RegEx so any
advice would be most welcome. Can I do something like:

  replacePattern = "[<(.|\n)+?>][&(.|\n)+?;][\\s{2,}]";
  pageHTML = pageHTML.replaceAll(replacePattern,"");

Thanks.

-- 
Best Regards
>>> Andrew Dixon