Re: HTMLParser.HTMLParseError: EOF in middle of construct
- From: Rob Wolfe <rw@xxxxxxxxx>
- Date: Wed, 20 Jun 2007 00:07:39 -0700
Sérgio Monteiro Basto wrote:
Stefan Behnel wrote:
Sérgio Monteiro Basto wrote:
but is one single error that blocks this.
Finally I found it , it is :
<td colspan="2"align="center"
if I put :
<td colspan="2" align="center"
p = re.compile('"align')
content = p.sub('" align', content)
I can parse the html
I don't know if it a bug of HTMLParser
Sure, and next time your key doesn't open your neighbours house, please
report to the building company to have them fix the door.
The question, here, is if
<td colspan="2"align="center"
is valid HTML or not ?
I think is valid , if so it's a bug on HTMLParser
According to the HTML 4.01 specification this is *not valid* HTML.
"""
Elements may have associated properties, called attributes, which may
have values
(by default, or set by authors or scripts). Attribute/value pairs
appear before the final
">" of an element's start tag. Any number of (legal) attribute value
pairs, separated
by spaces, may appear in an element's start tag.
"""
if not, we still have a very bad message error (EOF in middle of
construct !?)
HTMLParser can deal with some errors e.g. lack of ending tags,
but it can't handle many other problems.
I have to use HTMLParser because I want use only python 2.4 standard , I
have to install the scripts in many machines.
And I have to parse many different sites, I just want extract the links, so
with a clean up before parse solve very quickly my problem.
In Python 2.4 you have to use some third party module. There is no
other option for _invalid_ HTML. IMHO BeautifulSoup is the best among
them.
--
HTH,
Rob
.
- Follow-Ups:
- References:
- Re: HTMLParser.HTMLParseError: EOF in middle of construct
- From: Gabriel Genellina
- Re: HTMLParser.HTMLParseError: EOF in middle of construct
- From: none
- Re: HTMLParser.HTMLParseError: EOF in middle of construct
- From: Marc 'BlackJack' Rintsch
- Re: HTMLParser.HTMLParseError: EOF in middle of construct
- From: Sérgio Monteiro Basto
- Re: HTMLParser.HTMLParseError: EOF in middle of construct
- From: Stefan Behnel
- Re: HTMLParser.HTMLParseError: EOF in middle of construct
- From: Sérgio Monteiro Basto
- Re: HTMLParser.HTMLParseError: EOF in middle of construct
- Prev by Date: Re: Python's "only one way to do it" philosophy isn't good?
- Next by Date: Re: caseless dictionary howto ?
- Previous by thread: Re: HTMLParser.HTMLParseError: EOF in middle of construct
- Next by thread: Re: HTMLParser.HTMLParseError: EOF in middle of construct
- Index(es):
Relevant Pages
|