Re: utf-8/unicode encoding confusion



On Jun 8, 7:34 pm, vitic <vit...@xxxxxxxxx> wrote:
I have a little custom web server that needs to support incoming/
outgoing UTF-8 data.

According to documentation, TCL is UTF-8 internally.
So, in order to receive the UTF-8 data, I need to do [fconfigure
$insock -encoding utf-8] ?

Yes.

Not so, TCL is already utf-8, no need to convert, even if I do it,
nothing happens, the data is wrong.

That's because the default encoding on the socket *isn't* UTF-8,
so the incoming data gets converted from something else. Just because
Tcl keeps the characters as UTF-8 internally doesn't mean that they
won't get converted in the I/O subsystem.

How about [fconfigure -encoding binary] ?
Same thing, nothing happens, the data is wrong.

OK, -encoding binary is more or less the same thing as
-encoding iso8859-1. So we've ascertained that your data aren't
ISO8859-1. No surprise.

What if I do [ encoding convertfrom utf-8 $mydata ] ?
What do you know, it works. Why? I have no idea. Convert from utf-8
into utf-8 sounds weird.

OK, here's what happened. There was UTF-8 on the socket. You
read it in -encoding binary (more-or-less the same thing as
-encoding iso8859-1), and got a stream of bytes (that Tcl converted
to utf-8 as if they were iso8859-1 already). Now you took that
stream of bytes and converted it from utf-8, yielding the original
data. You'd have got the same result with '-encoding utf-8' on
the socket to begin with.

I have to do the same trick [ encoding convertto utf-8 $mydata ] for
outgoing data.
That's the only thing that works. [ fconfigure $outsock -encoding
utf-8 ] does not work. May be that setting is for input only? I don't
know.

What are you writing to the socket? If it's data that shows up
correctly encoded in Tcl, then [fconfigure $outsock -encoding utf-8]
is the correct thing. If it's stuff that you've already encoded
(via [encoding convertto], or from reading utf-8 data with -encoding
binary), then it's already encoded for a binary socket, and will
of course come out wrong.

Plus, converting the encodings inline while sending data out is a bad
idea anyway because it changes the length/size of the data.

It changes the size in bytes. It doesn't change the size in characters
in the correct encoding. If your application needs to know the size
in bytes, you need to give it the correct size in bytes, of course.
If there are no NUL characters in the stream, [string bytelength] will
also get the size.

Additional complications arise if the data is converted into the same
encoding two or more times which creates garbage data. That shouldn't
happen. It doesn't happen to ASCII data.

ASCII data go into and out of UTF-8 unchanged. Only characters
beyond 0x7f change representation in the conversion.

I can't test/find out if the data is unicode/utf-8 already. I can't
test/find out what encoding the data is in. I don't even know if that
>> is possible at all.

You *did* find out. You read it with -encoding binary, and then
did [encoding convertfrom utf-8]. That worked. It's UTF-8.
If you [fconfigure $socket -encoding utf-8], it'll read correctly.

Anybody has experience/knowledge about this? Is there anything that
can be done to test what encoding the data is in? What's the safest/
best way to use different encodings within Tcl?

The best way is usually to [fconfigure] your channels to the
correct encoding, and do nothing else with the strings. Then
everything Just Works.

Also, I think Tcl needs a [ encoding_convert -from cp1251 -to utf-8
$mydata ] command.

I think you're misunderstanding encodings.

We say that "Tcl uses UTF-8 internally" primarily for C programmers,
who need to know when they're writing Tcl extensions. At the
level of Tcl scripts, a more accurate statement would be:

Tcl uses some encoding internally, and a Tcl script should not
need to know what the internal encoding it is, because it's no
business of scripts. But we promise that any Unicode (well, UCS-2)
character can be represented in it. When you're doing I/O, you
should do it using a channel that's [fconfigure]d to use the
encoding that the data are in. For files, we'll give you a
default of the system encoding, which is almost always the
right one for text files to interoperate with other apps
on your machine.

If you're dealing with binary data, or with data that have mixed
encodings, pr with data where you need to count octets (as opposed
to characters), you can [fconfigure] the channel to -encoding binary.
In that case, if there are character strings embedded within the
binary data, you can extract them and use [encoding convertfrom]
to convert from the external encoding to the one Tcl uses. Similarly,
if you need to construct character strings for embedding within
binary data, you can use [encding convertto] to convert them
from Tcl's internal encoding (which is no business of yours) to
bytes that can go on a binary channel.

And that *really* is all that you should be doing with encodings.
Anything else is either a bug or wasted effort.

And don't *ever* change [encoding system]. That's never right.
Don't even think of it. (If you do need to change [encoding system],
there's something wrong outside your application. Fix it.)

--
73 de ke9tv/2, Kevin
.



Relevant Pages

  • Re: tDOM doesnt support encoding=ASCII?
    ... a Tcl channel then Tcl will ... specifically asked for binary encoding), so any XML encoding declaration ... but when tdom sees it it is almost certainly UTF-8. ...
    (comp.lang.tcl)
  • Re: Workable encryption in Tcl??
    ... > illustrating binary to utf-8, which isn't the direction I'm stuck ... it's the conversion from Tcl internal to binary. ... Tcl native strings don't have any encoding at the Tcl level. ...
    (comp.lang.tcl)
  • Re: Proposal to extend documentation about interop
    ... > utf-8 encoding of the character FF. ... > I solved it by doing the conversion of UTF-8 to bytes and when going back to ...
    (microsoft.public.dotnet.framework.interop)
  • Re: how to get the encoding of a file?
    ... Modifying files using the tcl text editor. ... There's no truly general way to guess the encoding of an arbitrary file, ... Ideally everyone would use UTF-8 and the problem would go away. ...
    (comp.lang.tcl)
  • Re: File in UTF-8 or local encoding
    ... > I have only minor control over this script. ... > in basicly every possible encoding in the world. ... > script stored in UTF-8. ... > the conversion twice. ...
    (comp.lang.tcl)