Re: Writing HTML parser wasn't as hard as I thought it'd be



Robert Uhl <eadmund42@xxxxxxxxxxxxxxx> writes:

mbstevens <NOXwebmasterx@xxxxxxxxxxxxxxx> writes:

You need to set up a switch 0-9 for how much crap code the parser will
accept.

Eh, 1-12 is better--then you have more sensible fractional settings
(1/6, 1/4, 1/3, 1/2, 2/3, 3/4, 5/6 vs 1/5, 2/5, 1/2, 3/5, 4/5). For a
simple bogosity switch, why not just accept-bogosity-p?

In the HTML parsers I've written, the problem isn't even writing a parser
for these things, it's the enormous and unwanted burden of resolving the
errors in a way that makes people happy.

The problem with HTML as a spec is indeed that it presupposes no
typos. And yet the design mistake set early on was to tolerate them.
If Mosaic (the original browser) had just said
"Bad syntax in page. Can't display it."
I think it was Mosaic that started the trend of correcting errors, but
maybe it came later. But my point is that the early competition in
browsers was NOT in correctness, but in tolerance. And what resulted
was a "semantics" which is simply not described by the spec. And that
means "implementing it" does not mean "implementing the spec".

In later revs of the standards, I seem to recall that they finally
figured out that, for example, table layout was underspecified. I
recall implementing table layout for a web browser I wrote in Lisp
once [*], and the issue was that the constraint relaxation in
tables had choice points that if you didn't do right, people would
complain you'd done it wrong because it didn't display stuff like
Netscape and/or IE did. So finally they've started to explain how
those issues are to be resolved. That's both good and bad, though,
because what they did was entrench the Rightness of particular
commercial endeavors and not the Rightness of correct thought. This
isn't a grumbling about winners and losers under capitalism, this is
an observation that it's favoring, indirectly, a particular
implementation strategy over others that might be faster, smaller,
more extensible, etc. Because those strategies might be perfectly
suitable for correct HTML, but if they can't handle junk, they are not
favored in spite of their other features.

So what makes HTML hard is that HTML does not _mean_ HTML. In the
end, HTML means Internet Explorer or Mozilla or something... and when
people say your browser works, they mean "it works like those". They
don't mean "it works like the spec". One is not rewarded for
succeeding, one is rewarded for deliberately not succeeding, and for
creating a system that encourages others to do likewise. Depending,
of course, on what you accept as the goal.

Modularizing the task into something that corrects bad HTML to good
and something that displays good HTML is probably the way to go.
Parsers for bad HTML don't have to know about HTML "meaning", just its
structure. Since I don't know of a public spec for what corrections
are required of browsers to not get yelled at, I don't know for sure,
but my sense is that the repair operations are not allowed to depend
on the high level semantics. I think they're mostly about how to
repair missing ">"'s or how to treat missing end tags or how to treat
errors in element attributes or how to treat misbalanced elements like
end tags in the wrong order. It's possible that the special treatment
of anchors, allowing them to span from the middle of one tree to the
middle of another is a violation of what I said about not knowing the
high level semantics. But those are the kinds of things one needs
such a preprocessor to do.

So I would think it's qualitative/discrete control one wants, not
numerical/fractional/percentage based. You either do or do not want
such fixups. You might want to enable/disable specific ones. But
saying a percentage is weird. It suggests a homogeneity to the
problem that there is not, and it suggests a canonical ordering to the
fixes that says the ones at one end are "must do" and the others are
"maybe" and that throttling up or down will hit them in the right
order. I don't see it. What I see in such systems is a laziness in
the design of the controls, or a cynical theory of such laziness of
users in controlling them that I find myself wondering why allow them
to have control at all.

I don't see how the strangeness of HTML handling can be improved by a
numeric slider for bogosity. Control should be designed for those who
have the presence of mind to be thoughtful, not brought on a platter
for those who don't know what they're doing to fiddle mindlessly.

It reminds me of the way I'm always telling about why I don't like
Windows XP in its default configuration. The real controls are
papered over, showing you only a cartoon caricature. The analogy I
always use is that of a microwave that used to have settings for heat
and time replaced by a newer model that has only two buttons: popcorn
and steak. Yes, the controls are simpler. No, they are not more
useful.

- - - -

[*] The web browser I refer to is unrelated to the web server I wrote
for my own consulting company. The browser was something I wrote
at Harlequin, before it folded. Like many interesting things Harlequin
did internally, never productized. I don't know what ended up happening
to that code. It wasn't product-worthy, but it was functional at a
prototype level, capable of displaying early web pages with fonting,
tables, gifs, etc. There as no scripting back then and the HTML spec
was smaller. (But a lot of the growth in the HTML spec has been to
clarify questions everyone had to make anyway and to give users access
to them... the basic concepts didn't really change.)

Incidentally, I claim I was the only one (or perhaps one of a very
few) who really did "accidentally" author a web browser, a feat
remarked about in in a dilbert cartoon, with ratbert commenting on
that. The browser was originally written as an HTML->PostScript
rendering tool, and when I got done I said "Hey, I bet if I changed
the back-end to use CLIM instead of PostScript, it would be a
browser. It turned out the only thing I had to add besides modular
display changes was support for anchors and there it was, without
even giving thought to wanting to build one in advance. A true
accident. I'd never have gotten the time allocated to build it
otherwise.

But it wasn't a tasked project, and so it didn't have anyone saying
it had to become finished. What it became was more of a vehicle for
debugging myriad CLIM bugs...

Ah well. Such is the fate of the best laid (non)plans of mice and men,
I guess... with due apologies to Ratbert and Dilbert for the ill-placed
metaphor.
.