Re: Need help on string manipulation



On 2006-03-27, WaterWalk <toolmaster@xxxxxxx> wrote:
Hello, I'm currently learning string manipulation. I'm curious about
what is the favored way for string manipulation in C, expecially when
strings contain non-ASCII characters. For example, if substrings need
be replaced, or one character needs be changed, what shall I do? Is it
better to convert strings to UCS-32 before manipulation?

But on Windows, wchar_t is 16 bits which isn't enough for characters
which can't be simply encoded using 16 bits.

On Linux, I hear wchar_t is 32 bit. Maybe on Linux, strings can be
simply converted to wchar_t and then handle them without worrying? I'm
not sure.

What is a "good" way to handle all this mess? Are there any good
examples? I'll be very thankful for your help.


I was looking up some similar stuff recently : here is an ok start -

http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_18.html

This concentrates more on "big chars" as I call them : characters
encoded in more that that standard char.

At the end of the day, do you really need dbcs or mbcs support?

good luck. hope the link helps.

--
Debuggers : you know it makes sense.
http://heather.cs.ucdavis.edu/~matloff/UnixAndC/CLanguage/Debug.html#tth_sEc
.



Relevant Pages

  • Re: Why R6RS is controversial
    ... the semantics of the language, ... behavior of grapheme-cluster characters under most linguistic ... as the strings grow longer. ... Normalization is hideously complicated, and may require many ...
    (comp.lang.scheme)
  • Re: Unicode LISP??
    ... I'm not experienced with Common Lisp library, ... terms of strings rather than characters. ... have their representation upgraded if they are updated in place. ...
    (comp.lang.lisp)
  • Re: not quite 1252
    ... The kill_gremlins function is intended to fix Unicode strings that have been obtained by decoding 8-bit strings using 'latin1' instead of 'cp1252'. ... In fact it wasn't, it was UTF-8 like Sergei wrote, but it was easy to convert it to cp1252, no problem. ... characters to documents marked up as ISO 8859-1 or other encodings. ...
    (comp.lang.python)
  • Re: How to check variables for uniqueness ?
    ... FI in English typography), so the correct uppercase version of those ... characters is the sequence SS. ... So you at least agree with me that it should be consistent with toUpperCase -- all strings should have a single canonical toUpperCase, a single canonical toLowerCase, both should define equivalence classes on the mixed-case input strings, these should be the SAME equivalence class, and equalsIgnoreCase should implement and embody the corresponding equivalence relation. ... The version that doesn't shouldn't surprise English speakers; the version that does shouldn't surprise anyone familiar with its locale-specific behavior for the locale actually used. ...
    (comp.lang.java.programmer)
  • Re: How to check variables for uniqueness ?
    ... characters is the sequence SS. ... is simply capitalizing strings. ... The fact that case mapping in English /is/ simple is neither here not ... That is a fair criticism of the Unicode position. ...
    (comp.lang.java.programmer)