HTMLParser question

From: Rajarshi Guha (rajarshi_at_presidency.com)
Date: 08/19/04


Date: Thu, 19 Aug 2004 11:27:24 -0400

Hi,
  I have some HTML that looks essentially consists of a series of <div>'s
and each <div> having one of two classes (tnt-question or tnt-answer).
I'm using HTMLParser to handle the tags as:

class MyHTMLParser(HTMLParser.HTMLParser):

    def handle_starttag(self, tag, attrs):
        if len(attrs) == 1:
            cls,whichcls = attrs[0]
            if whichcls == 'tnt-question':
                print self.get_starttag_text(), self.getpos()
    def handle_endtag(self, tag):
        pass
    def handle_data(self, data):
        print data

if __name__ == '__main__':

    htmldata = string.join(open('tt.html','r').readlines())
    parser = MyHTMLParser()
    parser.feed( htmldata )

However what I would like is that when the parser reaches some HTML like
this:

        <div class="tnt-question">
            How do I add a user to a MySQL system?
        </div>

I should get back the data between the open and close tags. However the
above code prints the text contained between all tags, not just the <div>
tags with the class='tnt-question'.

Is there a way to call handle_data() when a specific tag is being handled?
Placing a call to handle_data() in handle_starttag seems to be the way -
but I';m not sure how to actually do it - what data should I pass to the
call?

Any pointers would be appreciated
Thanks,
Rajarshi



Relevant Pages

  • Re: multiple lines / success or failure?!
    ... > blocl of text in an html file ... callbacks because those are the parts you want to customize. ... In order to make your parser do something useful, ... start tags: 4 ...
    (comp.lang.perl.misc)
  • Re: Understanding simplest HTML page
    ... These tags are required, but they're required by external good-practice ... A correct HTML parser is based around SGML practice and the HTML DTD ... Note also that your terminology of "tag pairs" is useful, ...
    (comp.infosystems.www.authoring.html)
  • Re: Logic failure..
    ... I've been writting a HTML parser and ran into pretty much all the ... all the HTML tags etc. ... When a closing tag is detected, instead of recursing, I exit the ...
    (borland.public.delphi.non-technical)
  • Re: can DIV elements be reached?
    ... "There is no legal way to use the name attribute from such tags as ... For the HTML markup operated on MUST NOT contain `div' elements (or `span' ...
    (comp.lang.javascript)
  • Re: HTML Parser
    ... > I am looking for an HTML parser which can also replace certain tags ... I tried using Microsofts HTML but I am having some ... You should use XML and XSLT instead. ... Also you can now use Microsoft's XML parser, ...
    (borland.public.delphi.thirdpartytools.general)