Re: ANC: tmlrss.tcl - process RSS newsfeeds for tclhttpd



David Gravereaux wrote:
It does extra effort to
make sure it generates *legal* HTML such as ... fixing improper encoding errors.

Just for fun, I thought I'd explain this part, because I think it's such an
interesting problem. Many newsfeeds are themselves collections of other feeds
that come from all kinds of sources. Thus any errors become additive.

When one gets the feed, which is in XML, over HTTP, encodings are sometimes done
in transit (MIME header Content-Type), or processed by the XML parser (TDOM in my
case). As TDOM's Expat parser reads the XML declaration, it makes a large mistake
by doing the translation it contains. TDOM's performance enhancements make it
necessary to remove the declaration as TDOM subverts the Tcl_Obj interface and
goes right to the internal representation and Expat assumes utf-8 without a
declaration, which in this case, is correct. So that leaves me to do [encoding
convertto ...] manually and remove the declaration before passing to TDOM. Which
is just fine by me, as Tcl is very well encoding conversion capable.

Well, that was the first issue. Second, was the big lies about the content in the
XML files. Ignoring that most early RSS formats can't describe what the format of
their <content> elements are really in, I found this bugger of a problem:

One of the common things I found were either entities or actual characters in the
range of &#130; through &#159; when the XML file itself claimed to be in
iso-8859-1 (or whatever after decoding). Characters in those ranges are not
defined for iso-8859-1. The problem is discussed @
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

So I assume those are chars where meant to be in cp1252 and move them to their
correct unicode rep. The great example would be &#151; which is supposed to be
\u2014 (em dash), which after verifying using TDOM's html parser spits back
&mdash; when I ask for the document back as html. And life is good :)

Well, that's my story.. What a mess the world is in.

--
Why waste time learning, when ignorance is instantaneous?
-- Calvin

Attachment: signature.asc
Description: OpenPGP digital signature



Relevant Pages