Re: Processing XML that's embedded in HTML



On Jan 22, 11:32 am, Paul Boddie <p...@xxxxxxxxxxxxx> wrote:

The rest of the document is html, javascript div tags, etc. I need the
information only from the row where the Relationship tag = Owner and
the Priority tag = 1. The rest I can ignore. When I tried parsing it
with minidom, I get an ExpatError: mismatched tag: line 1, column 357
so I think the HTML is probably malformed.

Or that it isn't well-formed XML, at least.

I probably should have posted that I got the error on the first line
of the file, which is why I think it's the HTML. But I wouldn't be
surprised if it was the XML that's behaving badly.


I looked at BeautifulSoup, but it seems to separate its HTML
processing from its XML processing. Can someone give me some pointers?

With libxml2dom [1] I'd do something like this:

import libxml2dom
d = libxml2dom.parse(filename, html=1)
# or: d = parseURI(uri, html=1)
rows = d.xpath("//XML/BoundData/Row")
# or: rows = d.xpath("//XML[@id="grdRegistrationInquiryCustomers"]/
BoundData/Row")

Even though the document is interpreted as HTML, you should get a DOM
containing the elements as libxml2 interprets them.

I am currently using Python 2.5 on Windows XP. I will be using
Internet Explorer 6 since the document will not display correctly in
Firefox.

That shouldn't be much of a surprise, it must be said: it isn't XHTML,
where you might be able to extend the document via XML, so the whole
document has to be "proper" HTML.

Paul

[1]http://www.python.org/pypi/libxml2dom


I must have tried this module quite a while ago since I already have
it installed. I see you're the author of the module, so you can
probably tell me what's what. When I do the above, I get an empty list
either way. See my code below:

import libxml2dom
d = libxml2dom.parse(filename, html=1)
rows = d.xpath('//XML[@id="grdRegistrationInquiryCustomers"]/BoundData/
Row')
# rows = d.xpath("//XML/BoundData/Row")
print rows

I'm not sure what is wrong here...but I got lxml to create a tree from
by doing the following:

<code>
from lxml import etree
from StringIO import StringIO

parser = etree.HTMLParser()
tree = etree.parse(filename, parser)
xml_string = etree.tostring(tree)
context = etree.iterparse(StringIO(xml_string))
</code>

However, when I iterate over the contents of "context", I can't figure
out how to nab the row's contents:

for action, elem in context:
if action == 'end' and elem.tag == 'relationship':
# do something...but what!?
# this if statement probably isn't even right


Thanks for the quick response, though! Any other ideas?

Mike
.



Relevant Pages

  • Re: ruby html (or xhtml) forms class...
    ... xx is a library designed to extend ruby objects with html, xhtml, and xml ... xml or xhtml as clean looking and natural as ruby it self. ... attributes may be passed to any tag method as either symbol or string. ...
    (comp.lang.ruby)
  • Re: Stripping HTML from RSS feed
    ... that it reads correctly are the ones that don't have HTML. ... it's got to be a "problem" with rss2array. ... can I parse an XML database using PHP? ... The XML tag is, but the script I use to parse it ...
    (comp.lang.php)
  • Re: XML in XHTML
    ... >> The solutions include using tag names which don't conflict with HTML ... >> tag names, putting all your tags in a different namespace, ... When you have XML inside an XML island IE has full support for namespaces, ...
    (comp.lang.javascript)
  • Processing XML thats embedded in HTML
    ... I've done parsing before with the xml.dom.minidom module on just ... plain XML, but I cannot get it to work with this HTML page. ... the Priority tag = 1. ...
    (comp.lang.python)
  • Re: Processing XML thats embedded in HTML
    ... I've done parsing before with the xml.dom.minidom module on just ... plain XML, but I cannot get it to work with this HTML page. ... With libxml2dom I'd do something like this: ...
    (comp.lang.python)