Re: Transmitting strings via tcp from a windows c++ client to a Java server



Roedy Green wrote:

I have rewritten the essay and written an experiment explorer program
to back up much of what I say.

see http://mindprod.com/jgloss/utf.html

Thanks for making the changes.

I haven't actually checked the code -- it seems safe to assume it does
what you say it does -- but with that proviso it seems pretty much OK.
I still think you could usefully make it clearer that your example
en/decoding code is not actually useful (because incomplete), I know
you /do/ say that, but it's burried away and (IMO) gives the impression
that it "doesn't really matter".

However, there is still one major error. It's near the bottom under
"Exploring Java's UTF Support". First off, it still isn't plain that 2
out of the four options you mention (1 and 3) have /nothing at all/ to
do with UTF-8. The so-called "modified UTF-8" format is not compatible
(upwards or downwards) with UTF-8. So I don't think you should mix
references to the two together, and certainly not intermingle them as
if they were all of comparable relevance. Specifically, the page
states (slightly further up, under "DataOutputStream.writeUTF()") that
the length is "followed by a standard UTF-8 byte encoding of the
String"; that is simply not true. You note already that Quasi-UTF-8
encodes 0x0 differently from UTF-8, which all by itself is enough to
make writeUTF() useless for interoperability with standards compliant
encodings. However there is also a major difference in how it encodes
characters off the BMP. Eg. the Uncode character:
U+10302
will encode in UTF-8 as (taken from the Uncode Standard 4.0.1, table
3.3):
0xF0 0x90 0x8C 0x82
whereas under Sun's scheme it encodes as:
0xED 0xA0 0x80 0xED 0xBC 0x82
(I'm using unsigned bytes here).

BTW, you also express some opinions on the (non-)value of the >16-bit
Unicode characters. I have no problem with your expressing your
opinions on your own webpages. I just wanted to add that I don't agree
with them.

-- chris
.



Relevant Pages

  • Re: The telling detail
    ... expands them into multiple characters. ... And I'm arguing that the number of times I see UTF-8 characters ... systems which do not comply with that standard. ...
    (rec.arts.sf.fandom)
  • Re: Authenticating an UTF-8, I18N field in struts using regular expressions
    ... is there any such thing as an invalid UTF-8 encoding ... Java doesn't actually use the official UTF-8 standard. ... and how characters outside the Basic Multilingual Plane are encoded. ...
    (comp.lang.java.programmer)
  • Re: Replace special characters by non-special characters
    ... There's probably a bit more to it than that, such as the encoding of the page ... encoded data and pretend it's UTF-8, of course it won't work, except for the ... since UTF-8 encodes all those characters and more. ...
    (comp.lang.php)
  • Re: Practical Common Lisp takes apart binary files
    ... > UTF-8 is a popular encoding for Unicode text that consists primarily ... > of characters from the the ASCII subset of Unicode since it encodes ... > all such characters in a single byte, just as they would be if encoded ...
    (comp.lang.lisp)
  • Re: RfD: c-addr/len
    ... You say that UTF-8 works on the top of octet bytes, octet characters, ... the Forth94 standard allows this sort of implementation. ...
    (comp.lang.forth)