Re: Need help on string manipulation




liljencrantz@xxxxxxxxx 写道:

WaterWalk skrev:

Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

Characters represented by wchar_t must use one wchar_t per character,
unlike characters using char, which may use a multibyte encoding. The
actual size and encoding of wchar_t is undefined, and e.g. Dragonfly
BSD uses different encodings of wchar_t depending on the encoding of
char strings. If Windows uses a 16-bit wchar_t, you will be unable to
use some newer Unicode characters, if this is a problem for you, then
avoid wchar_t. You will not have this problem under Linux, since glibc
uses the UCS4, which is 31-bit.

Yes, This is my problem. If any unicode char can be encoded in a single
wchar_t, then life will be much easier. *BUT*, on windows, I can't
simply use wchar_t which is only 16-bit to represent all unicode
characters. I hear that MS WORD uses 2 wchar_t chars to hold those
"extented characters". Then, if one char in a string needs be changed,
the handy array index operation can't be used. What's more, the whole
string may need change. This is really annoying. Any ideas?

Things like being able to use [] to access a character with a specific
index, being able to use int:s to iterate over a string and being able
to examine a specific character without worrying about if it's a
multibyte character makes life _much_ easier.


What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.

I have written a non-trivial program called fish (It's a commandline
shell for Unix, kind of like bash or zsh) that uses wide character
strings internally, you can download it from
http://roo.no-ip.org/fish/.

For some reason, I can't visit this site. Feel sad.

.



Relevant Pages

  • Re: *RANT* UTF-8 Character Processing
    ... UTF-8 ENCODING AT ANY LAYER LOWER THAN THE END-USER APPLICATION! ... like PLT strings are *neither* octet strings nor codepoint strings, ... sure whether we are handling bytes, or characters, or codepoints. ... but most applications have not been ...
    (comp.lang.scheme)
  • Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
    ... DO WITH CHARACTERS ABOVE "\xFF". ... suspect, openworks on the supplied byte stream AS IS, discregarding ... Unocode inserts hints in strings. ... encoding to perl strings by readdir and from perl strings to the OS ...
    (comp.lang.perl.misc)
  • Re: R5.97RS---adoption candidate---posted
    ... In an ambiguous encoding there may be more than one way ... ascii" text where the loading of the upper 128 characters ... But Unicode didn't manage to avoid chimericality. ... happen when you need to compare strings linguistically. ...
    (comp.lang.scheme)
  • Re: is any work being done to fix/improve PHPs string handling beyond 8 bits?
    ... >try to make guesses about multi-byte characters. ... Well - your questions, if I recall, were less about PHP supporting multibyte ... strings, but rather you were receiving strings from external sources with no ... well-defined encoding, or worse they were coming in with an encoding different ...
    (comp.lang.php)
  • Re: Image to Text
    ... If it is encoding you want, use Base64 encoding. ... Make sure to specify the character encoding for your Strings, and that it's the same on both ends of your code/decode actions. ... It is the largest power-of-two base that can be represented using only printable ASCII characters. ...
    (comp.lang.java.help)