Re: [PHP] Extract printable text from web page using preg_match




On 27-Feb-07, at 1:44 PM, Richard Lynch wrote:

On Tue, February 27, 2007 11:47 am, M5 wrote:
I am trying to write a regex function to extract the readable
(visible, screen-rendered) portion of any web page. Specifically, I
only want the text between the <body> tags, excluding any <script> or
<style> tags within the document, also excluding comments. Has anyone
here seen such a regex? Is it possible to do in one expression?

I think http://php.net/striptags may be your best bet...

No, it's not a very good solution. Striptags will leave everything within <head>, <style> and <script> (in the body or out). Comments are also included.

I know it's possible to use non reg-ex strpos/substr to extra everything within <body>, but as another poster correctly said, this assumes a consistent HTML document (which there is not).

I realize now that such a regex would be rather sophisticated, but I thought surely it must exist, since text-scrapping the readable content of a web page must not be rare.



--
Some people have a "gift" link here.
Know what I want?
I want you to buy a CD from some starving artist.
http://cdbaby.com/browse/from/lynch
Yeah, I get a buck. So?

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

.



Relevant Pages

  • Re: [PHP] Extract printable text from web page using preg_match
    ... portion of any web page. ... tags within the document, also excluding comments. ... here seen such a regex? ...
    (php.general)
  • Extract printable text from web page using preg_match
    ... I am trying to write a regex function to extract the readable portion of any web page. ... Specifically, I only want the text between the tags, excluding any or tags within the document, also excluding comments. ...
    (php.general)
  • Re: Regex help
    ... Basically I need to parse a page for certain information which ... will be fed back into CURL to post to a site. ... I don't need any other tags. ... i'd apply another regex to break ...
    (comp.lang.php)
  • Re: Regex help
    ... be fed back into CURL to post to a site. ... I don't need any other tags. ... i'd apply another regex to break ... I was thinking of trying to just get everything for a single element ...
    (comp.lang.php)
  • Re: "negative" regex matching?
    ... I've done some digging in Friedl's RegEx book but I'm not sure if I ... for nested tags. ... Sarah likes Johnny's cooking ... Because Johnny does good cooking ...
    (comp.lang.perl.misc)