Re: Writing HTML parser wasn't as hard as I thought it'd be
- From: Kent M Pitman <pitman@xxxxxxxxxxx>
- Date: 30 Apr 2007 12:55:39 -0400
gisle@xxxxxxxxxxxxxx (Gisle Sælensminde) writes:
Kent M Pitman <pitman@xxxxxxxxxxx> writes:
Modularizing the task into something that corrects bad HTML to good
and something that displays good HTML is probably the way to go.
Parsers for bad HTML don't have to know about HTML "meaning", just its
structure.
This is in fact what at least one web browser does internally. In an earlier
job I did web browser development, and that web browser first parsed the HTML
into a DOM-tree, and before the tree was sent to the rendering engine it went
through a so called "DOM-fixer". The DOM-fixer basicly was a set of rules to
rewrite bad HTML, so that the rendering engine not had to deal with them.
This rules was constantly rewritten in order to be able to display all the
pages the other guys could display. This code was required for the browser
to be able to show what people expected a browser to be able to display.
I would guess that most web browsers do something similar.
Thanks much for the data point.
Do you recall how they handled the special case of <b><a>...</b>...</a>?
Did they have special knowledge of <a>...</a> as special in its spanning
abilities, or was there a general rule? It was the only one I could think
of where you'd need to know something special about the semantics to resolve.
One might argue that old-style <p> should have required some special knowledge,
too, but the answer seems to have been resolved in favor of not really fixing
the intepretation people meant (that is, not trying to find the other end of
the <p> but rather just treating some parts as "not in any <p>" and others as
"in ones they didn't expect to be in".
.
- References:
- Re: Writing HTML parser wasn't as hard as I thought it'd be
- From: Robert Uhl
- Re: Writing HTML parser wasn't as hard as I thought it'd be
- From: Kent M Pitman
- Re: Writing HTML parser wasn't as hard as I thought it'd be
- From: Gisle Sælensminde
- Re: Writing HTML parser wasn't as hard as I thought it'd be
- Prev by Date: Re: Opposite of ~^ FORMAT Directive
- Next by Date: Re: copying arrays
- Previous by thread: Re: Writing HTML parser wasn't as hard as I thought it'd be
- Next by thread: Anybody still use those old lisp machines?
- Index(es):
Relevant Pages
|