Re: LWP get but only get the first 25% then exit/stop_retrieveing
- From: "Purl Gurl" <purlgurl@xxxxxxxxxxxx>
- Date: Sun, 13 Nov 2005 09:53:07 GMT
Purl Gurl wrote:
> Purl Gurl wrote:
> > Al wrote:
(snipped)
> > > I want some way(s) to be able to get only the first portion of a web page.
> > > use LWP::Simple;
> Here is a code example with which you may play.
This is another code example which will limit the length of returns
using HTTP / LWP modules (LWP only for this case example).
However, you need to know a bit about how servers operate to understand
the difficulties involved, and why I previously suggest writing a socket to
control a read by bytes received.
Servers send out "chunks" of data. The length of those chunks depends on
the specific server in use. Using these modules, HTTP and LWP, you must
read, at least, the first chunk, entirely, regardless of length.
This code example is rather rough and does not afford a lot of fancy features
but will serve to exemplify how you can limit your returns to the first chunk
coming out of a server, but not an Apache server as prior discussed.
Use of exit will break the connection, for your script, upon reading the first
chunk arriving. However, you cannot truly break the connection, you are
simply "abandoning" the connection which is rude in a technical sense.
Without an exit, the read process will continue until the final chunk arrives.
Rather than print to STDOUT, you could pre-open a file for write, before
you call for a connection to reduce connect time, then write and exit.
Other methods could be employed.
In summary, HTTP / LWP is NOT well suited for byte limit reads. You can
limit the read (pseudo) to the first arriving chunk. That chunk can be written
to a file. If you do not break the connection with exit, reading will continue.
Personally, I would add code based upon content length which will allow
a full read if content length is within an acceptable size. If not, if a very
large file, then I would switch to a limit of the first chunk. Either way,
this method is not all that efficient because background reading will
continue while you print / write, up until actual exit; you are engaging
in a race between read speed and how fast you can exit.
Research and read about socket programs. A socket is best for this.
Keep in mind, breaking an httpd transaction early, is considered rude.
$percent_limit = int ($content_length / 4);
You can divide by 2, by 3, by 4, whatever "seems" to work best for
you. Actual chunk size returned depends on the server and system.
Don't forget, Apache does NOT return a content length header
for html documents; code for failure handling.
Purl Gurl
#!perl
# use HTTP::Headers; # (optional, use for other specific functions)
# use HTTP::Request; # (optional, use for other specific functions)
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$url = 'http://www.sec.gov/edgar/searchedgar/companysearch.html';
$request = new HTTP::Request("HEAD", $url);
$response = $ua->request($request);
$content_length = $response->header('Content-length');
print "Content Length: ", $content_length, "\n";
$percent_limit = int ($content_length / 4);
print "Percent Byte Limit: $percent_limit\n";
$ua->request(HTTP::Request->new('GET', $url),
sub
{
($chunk, $res) = @_;
$bytes_in += length($chunk);
if ($bytes_in > $percent_limit)
{ print "$chunk\n\n LIMIT REACHED EXITING"; exit; }
});
****
An alternative sub which will print percentage read, sorta real time,
but is too fast to human eye read for small pages:
sub
{
($chunk, $res) = @_;
$bytes_in += length($chunk);
unless (defined ($total_bytes))
{ $total_bytes = $res->content_length || 0; }
if ($total_bytes)
{ printf "%d%% : ", 100 * $bytes_in / $total_bytes; }
});
.
- Follow-Ups:
- References:
- Prev by Date: Re: LWP get but only get the first 25% then exit/stop_retrieveing
- Next by Date: Re: Problem executing a simple script!help please !!
- Previous by thread: Re: LWP get but only get the first 25% then exit/stop_retrieveing
- Next by thread: Re: LWP get but only get the first 25% then exit/stop_retrieveing
- Index(es):
Relevant Pages
|
Loading