UTF-8 in strings - a bug?
From: Björn Persson (spam-away_at_nowhere.nil)
Date: 05/06/04
- Next message: Samuel Hawk: "19$ get paid to take online surveys"
- Previous message: Ludovic Brenta: "Re: Named Pipes"
- Next in thread: Robert I. Eachus: "Re: UTF-8 in strings - a bug?"
- Reply: Robert I. Eachus: "Re: UTF-8 in strings - a bug?"
- Reply: David Starner: "Re: UTF-8 in strings - a bug?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Wed, 05 May 2004 22:12:03 GMT
The reference manual says:
3.5.2(2): The predefined type Character is a character type whose values
correspond to the 256 code positions of Row 00 (also known as Latin-1)
of the ISO 10646 Basic Multilingual Plane (BMP).
3.6.3(4): type String is array(Positive range <>) of Character;
It seems clear to me: Strings are Latin-1 (except for programs compiled
in nonstandard modes). But when I set my Fedora system to use UTF-8, the
strings I get from Ada.Command_Line.Argument contain UTF-8. This means
that some of the elements in the string aren't characters, only byte
values that are parts of multi-byte characters. And of course 'Length
returns the number of bytes, not the number of characters. This looks
like a violation of the standard. Should I consider this a bug in the
library? Or in the compiler (Gnat (GCC) 3.3.2 and 3.4.0)?
-- Björn Persson jor ers @sv ge. b n_p son eri nu
- Next message: Samuel Hawk: "19$ get paid to take online surveys"
- Previous message: Ludovic Brenta: "Re: Named Pipes"
- Next in thread: Robert I. Eachus: "Re: UTF-8 in strings - a bug?"
- Reply: Robert I. Eachus: "Re: UTF-8 in strings - a bug?"
- Reply: David Starner: "Re: UTF-8 in strings - a bug?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|