Re: Extract printable text from web page using preg_match
- From: gmane@xxxxxxxxxxxxxx (Colin Guthrie)
- Date: Wed, 28 Feb 2007 08:48:44 +0000
M5 wrote:
No, it's not a very good solution. Striptags will leave everything
within <head>, <style> and <script> (in the body or out). Comments are
also included.
I know it's possible to use non reg-ex strpos/substr to extra everything
within <body>, but as another poster correctly said, this assumes a
consistent HTML document (which there is not).
I realize now that such a regex would be rather sophisticated, but I
thought surely it must exist, since text-scrapping the readable content
of a web page must not be rare.
Said it before, but low-tech solution is to use program "lynx" with the
-dump argument and capture the output back to PHP. I'm assuming you are
on Linux or OSX I guess as I've not heard of using lynx on windows.....
There are loads of command line options to control the way lynx displays
the output so you have a very fine grain of control here.
Col
.
- References:
- Prev by Date: Re: [PHP] PHP Documentation in XML
- Next by Date: Re: [PHP] IE6 session issues
- Previous by thread: Re: [PHP] Extract printable text from web page using preg_match
- Next by thread: Does PHP require patch for Daylight Savings Time 2007 change
- Index(es):