Re: Unicode support

From: Victor Wagner (vitus_at_45.free.net)
Date: 09/13/04


Date: Mon, 13 Sep 2004 04:48:39 +0000 (UTC)

Georgios Petasis <petasis@iit.demokritos.gr> wrote:

> You can also make the same trick in tcl with encoding convertfrom,but
> I don't think this is a "better" approach, not for tcl nor

I do it now, and don't think it is good approach too.
I have to include any text constant which appears in the my app
into procedure call. It looks like

[rus "some text in Russian"]

where

proc rus {string} {
        return [encoding convertfrom myScriptEncoding [ encoding convertto\
                [encoding system] $string
        ]]
}

I.e. I convert string to the form it was in the script file, reversing
what Tcl have done upon loading, and then convert it correctly into
utf-8.

Really, matter is more complicated, because if encoding system happens
to be utf-8, step [encoding convertto] should be omitted. So, actually,
I have conditional definiton of procedure.

Second problem is that if I want for some reason to change encoding
system for some reason after loading script, procedure breaks.

So I have to embed actual value of [encoding system] into procedure body
at the time of definition, which assures me at lest that encoding is
used to translate string is same as one used when script is loaded.

> for perl, as you are changing a global property of the interpreter/file.
> What happens if you load a pre-5.8 package from inside this script?

It would be interpreted as system (locale defined) encoding, because no
explicit encoding is set.

> Let alone the case where the encoding is not found (imagive startkits
> or the perl equivalent). The whole application crashes for no reason
> actually...

Why "no reason"? It is pretty good reason to crash. If it can produce
understandable message like "Your installation is corrupt, very
important file encoding.something is not found", because missing
encoding means, that program couldn't produce any readable for user
message.

Note that end user of application typically don't know English. There
are roughly same amount of Cyrillic-reading people as there are english
speakers, and much more Chinese and Indian people.

I can imagine that Polish or Czech people would be able to read messages
where all non-ascii letters would be replaced by question marks, but I
doubt they would like it. In Polish language I think approximately
one-third of letters have some umlaut or other accent.

> The best approach is to ensure that your script gets loaded correctly
> *whatever* encoding the interpreter is using.
> So, you never use characters not in ASCII. If you must, always use the
> \u????
> notation, or if you want to make the code more readable, use message
> catalogs.

It is impossbile. May be I would be able to remember unicode codes
for all cyrillic letters, at least there are about fourty of them
(including Ukrainian), but how do you expect Chinese people to remember
codes for thousands of hieroglyphs?

> However, for existing applications (written years ago where the original
> coders are
> not around to check things) the safest way is to use the \u???? notation.
> Its amazingly
> easy to write a small script that translates all characters not in ASCII to
> the proper
> \u???? notation (based on the encoding originally the application was
> written for)
> and filter the whole application through it...

But if I have to run my code through some translator, which obfuscates
it completely, why not use C++ or some other compiled language?

-- 
Sex dumps core
(Sex is a Simple editor for X11)
	-- Seen on debian bugtracking


Relevant Pages

  • Re: hebrew encoding
    ... encoding work with no success: ... encoding system iso8859-8 ... in the braces it is shalom in hebrew but you probably need the font. ... I change the script encoding: ...
    (comp.lang.tcl)
  • Re: F is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
    ... I'm going nuts when I see "use encoding". ... debatable if this is even a programming language - is it ... A script is the complete program text, ... That doesn't fix the endianness, ...
    (comp.lang.perl.misc)
  • Re: [Patch] Support UTF-8 scripts
    ... For a script, the shell does not care about the encoding ... the interpreter *does* care about the encoding. ... UTF-8, meaning that non-ASCII can be used in string literals, ... > signature, so introducing a signature for UTF-8 does not win anything. ...
    (Linux-Kernel)
  • Re: Case-sensitivity as option?
    ... The reason why we chose case insensitivity in Gforth is to allow to ... in one encoding, but the system works with a different encoding). ... names might look strange when using a Latin-1 font (and vice versa for ...
    (comp.lang.forth)
  • Re: invalid byte sequence in US-ASCII (ArgumentError)
    ... Result in Ruby 1.8: ... the script on Windows, ... valid in that locale, and check no exception is raised. ... Using the locale encoding does the right thing here. ...
    (comp.lang.ruby)

Loading