Re: tDOM doesn't support encoding='ASCII'?



Neil Madden wrote:
I think the point as far as tdom is concerned is that if it came through
a Tcl channel (and if not, where else did it come from?) then Tcl will
have already converted it to UTF-8 on the way in (unless you
specifically asked for binary encoding), so any XML encoding declaration
is most probably wrong at this stage. So tDOM is absolutely doing the
right thing here in requiring you to remove any erroneous xml encoding
declaration. The file may have started off as ASCII (or some other
encoding), but when tdom sees it it is almost certainly UTF-8.

So, if you know what encoding your files are in then [fconfigure
-encoding] the channel when you read the file and then strip the xml
declaration (or remove the encoding part anyway). If you don't know the
encoding then use tDOM::xmlReadFile which will do the right thing in
terms of figuring out the correct encoding to use (following the XML specs).

Neil sums it up pretty well. That's a lot of the rationale, why I
implemented it the way, it is. And I don't think, that's "bizarre" at
all.

There are even more details, fine points and considerations (even
'down on earth' ones, like historical reasons how the interface
evolved). But going into that probably confuse the topic even more.

It works, as it works now since years (at least around 5 years) and
that's the result of at lot of musing and tinkering around.

Fact seems to be, that this topic comes up on and off. One problem is,
that some things add confusion, which are not really in tdoms
basket. Examples of this:

Tcl channel didn't know something about BOMs. They plain just handle
them (but just hand them throu).

In some areas (no offence folks, but AOLServer people seem to be
notorious, here) there still seem to be pre 8.1 binary extensions in
usage. Which means, that the parser sees some Tcl_Obj string reps,
which are in fact not in utf-8.

And others. Not to talk about, that the topic raises his head again,
if you want to write a XML serialization w/ XML declaration with
encoding info.

But in the end, tdom is a tool for _programmers_. A tcl programmer
must have a basic unterstanding of how tcl handles i18n (or he will
run into problems on the long run). If a programmer has to handle some
data format (nothing else is XML), he must have a basic understanding
of that format. In case of XML one essential point of that is how XML
handles i18n. Up to now, I wasn't able to come up with a rmmadwim
solution for the problem, we discuss here.

That all said, there's always room for improvement. I'm open to
listen. But don't expect, that you hit the nail after 30 seconds of
thinking.

rolf

.



Relevant Pages

  • Re: utf-8/unicode encoding confusion
    ... According to documentation, TCL is UTF-8 internally. ... encoding two or more times which creates garbage data. ... beyond 0x7f change representation in the conversion. ...
    (comp.lang.tcl)
  • Re: Want Input boxes to accept unicode strings on Standard Window
    ... If ther encoding is not specified, then the encoding is assumed to be ... Ah, UTF-8. ... That would be wrong according to the standard. ... when producing XML files. ...
    (microsoft.public.vc.mfc)
  • Re: UTF-8 encoding problem
    ... Declaration having the "encoding" attribute at the begining of file ... What I am saying is the "encoding" of your physical file is different then the logical file (the xml itself). ... It sounds like your physical file is UTF-8, while I'm concerned your logical file is whatever, where whatever is the text you blindly copied from an MSDN article. ...
    (microsoft.public.dotnet.languages.vb)
  • Re: Workable encryption in Tcl??
    ... > illustrating binary to utf-8, which isn't the direction I'm stuck ... it's the conversion from Tcl internal to binary. ... Tcl native strings don't have any encoding at the Tcl level. ...
    (comp.lang.tcl)
  • Re: Want Input boxes to accept unicode strings on Standard Window
    ... simply writes std::string to and from XML. ... Why does it need to understand UTF-8? ... And if you read an XML and the encoding ... You can also read/write ANSI using std::string, ...
    (microsoft.public.vc.mfc)