Re: Supporting full Unicode

From: Ludovic Brenta (ludovic.brenta_at_insalien.org)
Date: 05/12/04


Date: 12 May 2004 10:57:25 GMT


Bjorn Persson wrote:
> David Starner wrote:
>> they should have defined Wide_Character to be UTF-16 like Java did.
>
> Keeping in mind that in UTF-16 some characters take two bytes and
> others take four, how do you propose to define that type?

It is true that variable-width encodings such as UTF-16 or UTF-8 are
more difficult to handle than fixed-width encodings like UCS-2 or
UCS-4. Basically, if you want to do advanced processing of character
data, you may find it easier to first transcode it to UCS-4
(i.e. Wide_Wide_Character, 32 bits wide).

But UTF-8 is gaining momemtum. Originally intended as an external
encoding only, it is now in use as an internal encoding, too. I
suppose that it turned out that processing UTF-8 directly is not that
difficult after all. This is especially true if all you want to do is
localisation of software using gettext; in this case, you can use
UTF-8 as both your internal and external encoding without any trouble.

The Perl regular expression engine, for example, supports UTF-8
strings directly. I don't know if it transcodes to UTF-4 internally.

-- 
Ludovic Brenta.
-- 
Use our news server 'news.foorum.com' from anywhere.
More details at: http://nnrpinfo.go.foorum.com/


Relevant Pages

  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • Re: Special Characters in Query String
    ... I've had numerous problems with utf-8, ... in common characters in spanish not geting displayed. ... > available for encoding of characters. ... > If you can display your characters with ISO-8859-1, ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: DBD::ODBC and character sets
    ... you have and accept UTF-8 encoded data does mean you need to "use ... encoding" but if your script is encoded in xxx you need "use encoding ... Perl sees the left-hand side of eq as a string literal containg sixcharacters encoded as ISO-8859-1 ...
    (perl.dbi.users)
  • Re: Character Encoding
    ... > to decode the text when I read it from the database so I can display it ... I'm using UTF-8 character encoding. ... > characters that were UTF-8 incompatible came along for the ride, ...
    (comp.lang.java.programmer)
  • Re: UTF-16 file input, C programming.
    ... However, you are only partly correct, from the fact that all standard ASCII chars, are mapped on a single byte as you mention. ... UTF-8 only maps the standard ASCII chars in one byte and anything above is represented in two or more bytes. ... I believe unicode.org has some source, providing functions, that can convert UTF-16 surrogate pairs, into UTF-8 multibyte characters, but I will have to look into that. ...
    (comp.unix.programmer)