Re: HTMLParser.HTMLParseError: EOF in middle of construct




Sérgio Monteiro Basto wrote:
Stefan Behnel wrote:

Sérgio Monteiro Basto wrote:
but is one single error that blocks this.
Finally I found it , it is :
<td colspan="2"align="center"
if I put :
<td colspan="2" align="center"

p = re.compile('"align')
content = p.sub('" align', content)

I can parse the html
I don't know if it a bug of HTMLParser

Sure, and next time your key doesn't open your neighbours house, please
report to the building company to have them fix the door.


The question, here, is if
<td colspan="2"align="center"
is valid HTML or not ?
I think is valid , if so it's a bug on HTMLParser

According to the HTML 4.01 specification this is *not valid* HTML.

"""
Elements may have associated properties, called attributes, which may
have values
(by default, or set by authors or scripts). Attribute/value pairs
appear before the final
">" of an element's start tag. Any number of (legal) attribute value
pairs, separated
by spaces, may appear in an element's start tag.
"""

if not, we still have a very bad message error (EOF in middle of
construct !?)

HTMLParser can deal with some errors e.g. lack of ending tags,
but it can't handle many other problems.

I have to use HTMLParser because I want use only python 2.4 standard , I
have to install the scripts in many machines.
And I have to parse many different sites, I just want extract the links, so
with a clean up before parse solve very quickly my problem.

In Python 2.4 you have to use some third party module. There is no
other option for _invalid_ HTML. IMHO BeautifulSoup is the best among
them.

--
HTH,
Rob

.



Relevant Pages

  • Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.
    ... The two Python modules I'm aware of to do this are HTMLParser and htmllib. ... The problem I'm having with HTMLParser is simple; I don't seem to be getting the actual text in the HTML document. ... This would obviously be easy to achieve if I simply had an html parse that called a method for each start tag, text chunk, and end tag. ...
    (comp.lang.python)
  • Re: HTMLParser.HTMLParseError: EOF in middle of construct
    ... Rob Wolfe wrote: ... is valid HTML or not? ... if so it's a bug on HTMLParser ... may appear in an element's start tag. ...
    (comp.lang.python)
  • Re: unknown tag
    ... It was plausible that you might have had problem parsing the link ... It seams that at character position 7122 the parser runs is a table tag ... I recommend you download the offending HTML, and try to parse it from ...
    (comp.lang.java.help)
  • RE: Stripping scripts from HTML with regular expressions
    ... Stripping scripts from HTML with regular expressions ... choked on the script-blocks that didn't contain comment-indicators. ... namely HTMLParser? ... and it tries to parse the contents of the document.write's. ...
    (comp.lang.python)
  • Re: Dynamic User Controls
    ... You will have to parse the "html" (which happens to be a little bit more ... You will have to parse the string in content. ... When you find some control you will have to first instantiate ... > detect this tag and actually load it in the place ...
    (microsoft.public.dotnet.framework.aspnet)