UTF-8 in strings - a bug?

From: Björn Persson (spam-away_at_nowhere.nil)
Date: 05/06/04


Date: Wed, 05 May 2004 22:12:03 GMT

The reference manual says:

3.5.2(2): The predefined type Character is a character type whose values
correspond to the 256 code positions of Row 00 (also known as Latin-1)
of the ISO 10646 Basic Multilingual Plane (BMP).

3.6.3(4): type String is array(Positive range <>) of Character;

It seems clear to me: Strings are Latin-1 (except for programs compiled
in nonstandard modes). But when I set my Fedora system to use UTF-8, the
strings I get from Ada.Command_Line.Argument contain UTF-8. This means
that some of the elements in the string aren't characters, only byte
values that are parts of multi-byte characters. And of course 'Length
returns the number of bytes, not the number of characters. This looks
like a violation of the standard. Should I consider this a bug in the
library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)?

-- 
Björn Persson
jor ers @sv ge.
b n_p son eri nu


Relevant Pages

  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: Defacto standard string library
    ... And for 8-bit Characters this might be true. ... But when UTF-8 is being manipulated, this is going to cause problems. ... the same problem exists when you're comparing strings encoded with two different variants of ISO-8859; comparisons only work when the strings use the same encoding. ... If you're doing extensive manipulation of Unicode strings, converting to wide characters is almost always the correct solution. ...
    (comp.lang.c)
  • Re: Multilingual support in Len method
    ... This page explains details of decoding UTF-8. ... # UTF-8 encoded characters may theoretically be up to six bytes long, ... while storing a Unicode string internally. ... English strings return the exact number of characters. ...
    (microsoft.public.scripting.vbscript)
  • Re: Defacto standard string library
    ... then it's not a UTF-8 encoded file. ... I assumed EF BB BF 40 was another way of encoding x40. ... The comparison would correctly report a difference since the first characters of the file are different when treated as UTF-8 encoded characters. ... C's str*functions can be effectively and correctly used to handle UTF-8 encoded strings, providing a large subset of the functionality they provide for ASCII strings. ...
    (comp.lang.c)
  • Re: UTF-8 in strings - a bug?
    ... The predefined type Character is a character type whose values ... But when I set my Fedora system to use UTF-8, ... > values that are parts of multi-byte characters. ... The technical answer is that GNAT is not validated on Fedora ...
    (comp.lang.ada)