Re: urllib2.urlopen(url) pulling something other than HTML



"dogatemycomputer@xxxxxxxxx" <dogatemycomputer@xxxxxxxxx> writes:
[...]
----------------------------------------------------------
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
----------------------------------------------------------

I get the idea that we're allocating some memory that looks like a
file so formatter.dumbwriter can manipulate it.

Don't worry too much about memory. The "StringIO()" probably only
really allocates the memory needed for the "bookkeeping" that StringIO
does for its own internal purposes, not the memory needed to actually
store the HTML. Later, when you use the object, Python will
dynamically (== at run time) allocate the necessary memory for the
HTML, when the .write() method is called on the StringIO instance.
Python handles the memory allocation for you -- though of course the
code you write affects how much memory gets used.

Note:

- The StringIO is where the *output* HTML goes.

- The formatter.DumbWriter likely doesn't do anything with the
StringIO() at the time it's passed (it hasn't even seen your HTML
yet, so how could it?). Instead, it just squirrels away the
StringIO() for later use.

The results are
passed to formatter.abstractformatter which does something else to the
HTML code.

Again, nothing much happens right away on the "f = ..." line. The
formatter.AbstractFormatter just keeps the formatter so it can use it
to format HTML later on.


The results are then passed to "f" which is then passed to

The results are not "passed" to f. Instead, the results are given a
name, "f". You can give a single object as many names as you like.


htmllib.HTMLParser so it can parse the html for links. I guess I

htmllib.HTMLParser wants the formatter so it can format output
(e.g. you might want to write out the same page with some of the links
removed). It doesn't need the formatter to parse the HTML.
HTMLParser itself is responsible for the parsing -- as the name
implies.


don't understand with any great detail as to why this is happening.
I know someone is going to say that I should RTFM so here is the gist
of the documentation:

formatter.DumbWriter = "This class is suitable for reflowing a
sequence of paragraphs."
formatter.AbstractFormatter = "The standard formatter. This
implementation has demonstrated wide applicability to many writers,
and may be used directly in most circumstances. It has been used to
implement a full-featured World Wide Web browser." <-- huh?

The web browser in question was called "Grail". Grail has been
resting for some time now. By today's standards, "full-featured" is a
bit of a stretch.

But I wouldn't worry too much about what they're trying to say there
yet (it has to do with the way the formatter.AbstractFormatter class
is structured, not what it actually does "out of the box").


So.. What is dumbwriter and abstractformatter doing with this HTML and
why does it need to be done before parser.feed() gets a hold of it?

The "heavy lifting" only really actually starts happening when you
call parser.feed(). Before that, you're just setting the stage.


The last question is.. I can't find any documentation to explain
where the "anchorlist" attribute came from? Here is the only
reference to this attribute that I can find anywhere in the Python
documentation.

----------------------
anchor_bgn( href, name, type)
This method is called at the start of an anchor region. The
arguments correspond to the attributes of the <A> tag with the same
names. The default implementation maintains a list of hyperlinks
(defined by the HREF attribute for <A> tags) within the document. The
list of hyperlinks is available as the data attribute anchorlist.
----------------------

That is indeed the (only) documentation for .anchorlist . What more
were you expecting to see?


So .. How does an average developer figure out that parser returns a
list of hyperlinks in an attribute called anchorlist? Is this

They keep the Library Reference under their pillow :-)

And strictly it doesn't *return* a list of links. And that's
certainly not HTMLParser's main function in life. It merely makes
such a list available as a convenience. In fact, many people instead
use module sgmllib, which provides no such convenience, but otherwise
does the same parsing work as module htmllib.


something that you just "figure out" or is there some book I should be
reading that documents all of the attributes for a particular
method? It just seems a bit obscure and certainly not something I
would have figured out on my own. Does this make me a poor developer
who should find another hobby? I just need to know if there is
something wrong with me or if this is a reasonable question to ask.

But you *did* figure it out. How else is it that you come to be
explaining it to us?

Keep in mind that *nobody* knows all of the standard library. I've
been writing Python code full time for years, and I often bump into
whole standard library modules whose existence I'd forgotten about, or
was never really aware of in the first place. The more you know about
what it can do, the more convenience you'll get out of it, is all.


The last question I have is about debugging. The spider is capable
of parsing links until it reaches:

"html = get_page(http://www.google.com/jobs/fortune)" which returns
the contents of a pdf document, assigns the pdf contents to html which
is later passed to parser.feed(html) which crashes.
[...]
How would an experienced python developer check the contents of "html"
to make sure its not something else other than a blob of HTML code? I
should note an obviously catch-22.. How do I check the HTML in such
a way that the check itself doesn't possibly crash the app? I thought
about:

try:
parser.feed(html)
except parser.HTMLParseError:
parser.close()


.... but i'm not sure if that is right or not? The app still crashes
so obviously i'm doing something wrong.

That kind of idea is often the best way. In this case, though, you
probably want to do an up-front check by looking at the HTTP
Content-Type header (Google for it), something like this:

response = urllib2.urlopen(url)
html = response.read()
if response.info()["Content-Type"] == "text/html":
parse(html)


John
.



Relevant Pages

  • Re: kobject_set_name() uses GFP_KERNEL
    ... allocating their memory for the device name with GFP_KERNEL. ... we better add an extra comment in the documentation. ... just send a patch? ...
    (Linux-Kernel)
  • Re: [Announce]: Target_Core_Mod/ConfigFS and LIO-Target v3.0 work
    ... allocations would help the performance of a storage target. ... A single codepath memory allocating *AND* mapping for: ... Allocating multiple contigious struct page from the memory allocator ... I never claimed that RDMA is only possible from user space -- that was ...
    (Linux-Kernel)
  • Re: Forcing a Large Object Heap allocation.
    ... you should be passing the recommended 60% memory limit. ... that point and you should be experiencing an unstable app IMO. ... > compacts the heap. ... > objects in the regular heap causes poorer performance than allocating many ...
    (microsoft.public.dotnet.framework.performance)
  • Single Source Documentation Woes
    ... provide the ability to single-source user manual documentation, ... | (WAP, HTML, etc.) | ... Application Name has the following command line options: ... To make life truly entertaining, ...
    (comp.programming)
  • Re: Angband documentation project
    ... Given that my plans involve making the help external HTML files (with ... Using a wiki isn't that great for this, ... the documentation is currently distributed as, ... helpfiles are probably GPL, though that's not really that appropriate ...
    (rec.games.roguelike.angband)