Re: bytes, chars, and strings, oh my!



David N. Welton wrote:

[...]
Ok - then I could also use this to transform the bytes into a String by
then doing new String(bytes, "some encoding, possibly the system one")
for regular text files, right?
That might or might not rise new problems, because the system default encoding may vary from system to system.

I had somewhat similar conceptual problems, when I tried to interpret PostScript files from Java. (PostScript is a language that doesn't distinguish between byte and char, because it was invented back in the 1980s era).
My solution there was to choose the "ISO-8859-1" (aka ISO-Latin-1) encoding. "ISO-8859-1" is essential a no-encoding. Its byte->char conversion is simply adding a zero high-byte. Its char->byte conversion is dropping the zero high-byte, and treating all chars beyond '\u00FF' as being illegal (i.e. converting to byte 63, which is '?').


--
"Thomas:Fritsch$ops:de".replace(':','.').replace('$','@')

.



Relevant Pages

  • Re: Unicode conversion
    ... what encoding your char string is in, then use wcstombs and mbstowcs functions to perform the conversion. ...
    (comp.lang.c)
  • Re: wofstream
    ... >I found that wide char file stream doesn't write national symbols. ... How can I switch text encoding? ... that includes a codecvt facet for the conversion. ...
    (microsoft.public.vc.stl)
  • Re: utf8 vs iso8859-1 speed/responsiveness
    ... Glibc internal encoding is UTF32/UCS4, and modern toolkits, thus ... on RH9 as well. ... conversion happens everywhere on the fly. ... So regardless of RH9 or FC2, ...
    (Fedora)
  • Re: lost mysql root password
    ... Not 02, but $20's, eg an ascii space char. ... this is encoding related. ... the database reserves multiple bytes ... To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx ...
    (Debian-User)
  • Re: Proposal: require 7-bit source strs
    ... I'm referring to a time when there was no encoding ... It would be possible to go back and find all strings ... That's why I specified to do this after conversion to ... make the assumption that the character set is ASCII-based, ...
    (comp.lang.python)