Re: LWP get but only get the first 25% then exit/stop_retrieveing



Purl Gurl wrote:
> Purl Gurl wrote:
> > Al wrote:

(snipped)

> > > I want some way(s) to be able to get only the first portion of a web page.
> > > use LWP::Simple;

> Here is a code example with which you may play.

This is another code example which will limit the length of returns
using HTTP / LWP modules (LWP only for this case example).

However, you need to know a bit about how servers operate to understand
the difficulties involved, and why I previously suggest writing a socket to
control a read by bytes received.

Servers send out "chunks" of data. The length of those chunks depends on
the specific server in use. Using these modules, HTTP and LWP, you must
read, at least, the first chunk, entirely, regardless of length.

This code example is rather rough and does not afford a lot of fancy features
but will serve to exemplify how you can limit your returns to the first chunk
coming out of a server, but not an Apache server as prior discussed.

Use of exit will break the connection, for your script, upon reading the first
chunk arriving. However, you cannot truly break the connection, you are
simply "abandoning" the connection which is rude in a technical sense.
Without an exit, the read process will continue until the final chunk arrives.

Rather than print to STDOUT, you could pre-open a file for write, before
you call for a connection to reduce connect time, then write and exit.
Other methods could be employed.

In summary, HTTP / LWP is NOT well suited for byte limit reads. You can
limit the read (pseudo) to the first arriving chunk. That chunk can be written
to a file. If you do not break the connection with exit, reading will continue.

Personally, I would add code based upon content length which will allow
a full read if content length is within an acceptable size. If not, if a very
large file, then I would switch to a limit of the first chunk. Either way,
this method is not all that efficient because background reading will
continue while you print / write, up until actual exit; you are engaging
in a race between read speed and how fast you can exit.

Research and read about socket programs. A socket is best for this.

Keep in mind, breaking an httpd transaction early, is considered rude.

$percent_limit = int ($content_length / 4);

You can divide by 2, by 3, by 4, whatever "seems" to work best for
you. Actual chunk size returned depends on the server and system.

Don't forget, Apache does NOT return a content length header
for html documents; code for failure handling.

Purl Gurl

#!perl

# use HTTP::Headers; # (optional, use for other specific functions)
# use HTTP::Request; # (optional, use for other specific functions)

use LWP::UserAgent;

$ua = new LWP::UserAgent;
$url = 'http://www.sec.gov/edgar/searchedgar/companysearch.html';
$request = new HTTP::Request("HEAD", $url);
$response = $ua->request($request);

$content_length = $response->header('Content-length');
print "Content Length: ", $content_length, "\n";

$percent_limit = int ($content_length / 4);
print "Percent Byte Limit: $percent_limit\n";

$ua->request(HTTP::Request->new('GET', $url),
sub
{
($chunk, $res) = @_;
$bytes_in += length($chunk);

if ($bytes_in > $percent_limit)
{ print "$chunk\n\n LIMIT REACHED EXITING"; exit; }
});


****

An alternative sub which will print percentage read, sorta real time,
but is too fast to human eye read for small pages:

sub
{
($chunk, $res) = @_;
$bytes_in += length($chunk);
unless (defined ($total_bytes))
{ $total_bytes = $res->content_length || 0; }

if ($total_bytes)
{ printf "%d%% : ", 100 * $bytes_in / $total_bytes; }
});
.



Relevant Pages

  • log off with process running
    ... I know this has got to be a basic question, but strangely enough I haven't been ... Suppose I'm at home with a dial-up connection to the Internet. ... where I work we have a server running FreeBSD with a full-time connection (T1 ... (and wouldn't typing "exit" kill any process I'm running?). ...
    (freebsd-questions)
  • Re: logging.SocketHandler connections
    ... this behaviour when the server closes the connection. ... there is a while loop which you have left out ... chunk = self.connection.recv ...
    (comp.lang.python)
  • socket closing problem
    ... I have a gui that enables a user to connect to a server ... Ctrl-C in order to have it exit completely. ... The GUI is closed though. ... Create a socket and start a thread to handle the connection: ...
    (comp.lang.python)
  • Re: Disabling ssh timeouts?
    ... I've tried setting ClientAliveCountMax and ClientAliveInterval ... Connection to troutmask.apl.washington.edu closed. ... it is not posible to install programs on the server) ... "exit" or CTRL+D. ...
    (freebsd-questions)
  • TS closing connection?
    ... I have a problem when I exit the application terminal ... server does not close the connection. ... to run a simple application like calc. ...
    (microsoft.public.win2000.termserv.apps)

Loading