Re: Defacto standard string library
- From: user923005 <dcorbit@xxxxxxxxx>
- Date: Fri, 2 Jan 2009 12:23:09 -0800 (PST)
On Jan 1, 1:30 pm, Keith Thompson <ks...@xxxxxxx> wrote:
Phil Carmody <thefatphil_demun...@xxxxxxxxxxx> writes:
"christian.bau" <christian....@xxxxxxxxxxxxxxxxxx> writes:
On Dec 29, 4:59 pm, Jon Harrop <j...@xxxxxxxxxxxxxxxxx> wrote:
I'd like to manipulate strings from C code that may contain null characters
and may need unicode support. Is there a defacto standard string library
for C (particularly under Linux) that satisfies these needs?
You could use a method that is common in Java implementations: Use
Unicode, encoded in UTF8 format, except that a zero byte is
represented as a two byte sequence 0xc0 0x80 (which is against UTF8
rules, so this is not UTF8). Then append a zero byte at the end to
make it a C string. Standard C string functions will be fine with this
format.
Some background, in case not everyone knows this. Unicode is a
character set that includes characters from multiple languages; as a
result, it cannot be encoded using a single byte per character, unless
a byte is at least (I think) 17 bits. UTF8 is an encoding for Unicode
using 8-bit bytes; some Unicode characters are encoded in a single
byte, others require multiple bytes. (It's likely I've gotten some
details of the terminology wrong.)
Concatenating looks like it will fail horribly.
Why? As far as I can see, it would work just fine. Do you have an
example where it would fail?
Escape character sequences.
The length
of the (semantically) empty string is no longer zero.
Why? The empty string would still be represented as a sequence of a
single byte with the value 0 (the terminating null character);
strlen() would return 0 for this string.
The
standard C str* functions seem to not have much more use
than the standard C mem* functions, to be honest.
Again, why? You'd just have to remember that strlen() gives you the
length *in bytes*, not the length in Unicode characters. This is just
what you want if you want to know, for example, how much memory to
allocate for a given string. (You could have other, somewhat more
expensive functions for things like the logical length in characters.)
Sometimes you want to know things like "What is the display length?".
You may also want to create a table to store the answer in a
database. The database notion of length will be in Unicode
characters.
I can't imagine any standard 8-bit character function working properly
even on 8-bit unicode.
You won't get the lenght in units, you won't be able to reliably
locate a character or substring, concatenation will be unreliable,
etc.
They don't even have the right interface (e.g. You can't search for
the chinese character for boat {8 mouths in a box} with strchr).
If you want to do unicode, I think it is a terrible mistake to try to
manipulate it as ordinary C strings {an 8 bit Unicode character can
comprise up to 5 8-bit characters IIRC}.
The free open source IBM library will do it without any difficulty and
is fully debugged. Considering the tens of thousands of lines of code
involved, I guess that simplified efforts to reproduce the correct
behavior will end in tragedy.
It is true that there are some things that could be done using the
standard C string library. But if you have gone to the bother of
using Unicode character sets, then the right behavior will be that
provided by the IBM library and the C functions will only answer some
of the questions some of the time.
If the needs are super-simple (e.g. the interface is used only for
transport), then maybe all that is needed are the mem*() functions.
But if we ever want to *look* at the data in any useful way, then the
C library does not measure up.
I think it would be good to incorporate ICU into the standard C
library for the next standard. There is no alternative with
equivalent functionality that I know of.
.
- Follow-Ups:
- Re: Defacto standard string library
- From: Stephen Sprunk
- Re: Defacto standard string library
- From: Richard Tobin
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- References:
- Re: Defacto standard string library
- From: Phil Carmody
- Re: Defacto standard string library
- From: Keith Thompson
- Re: Defacto standard string library
- Prev by Date: Re: Comment on thread "Heathfield's errors"
- Next by Date: Re: Another crazy new language effort - Language #42
- Previous by thread: Re: Defacto standard string library
- Next by thread: Re: Defacto standard string library
- Index(es):
Relevant Pages
|