Re: Transmitting strings via tcp from a windows c++ client to a Java server



qqq111 wrote:

....

But first a request. /Please/ follow Usenet etiquette and say who you are
replying to and quote selectively from the post as you reply. Normally I just
ignore people who don't follow "The Rules"; I'm making an exception in this
case on a whim ;-)


4. Each of our msgs is indeed preceded by a length field
(as fixed-size text field). Length is measured in Java
characters and dup by 2 to obtain size in bytes

That algorithm will not give you the size in bytes of a UTF-8 encoded string.
There is no way to compute the length of the UTF-8 encoding of a Unicode
sequence that does not involve scanning every character. The easiest thing, of
course, is just to let the platform do the encoding and then transmit the
length of the resulting byte array. If you want to calculate the length
yourself, then it's a bit messy -- the main problem is that in Java or Windows
the input data is encoded as UTF-16 so you have to undo that encoding and then
re-encode the result as UTF-8. Not especially difficult, but more work than
you might expect if you are used to relying on strlen() and the like.

It would work for UTF-16. But if you decide to stick with UTF-8 (which sounds
better to me) then I suggest you prototype your receiving code (for both
platforms) before you set the protocol in stone.

Whatever you do, make very sure that your documentation (formal or informal) of
the protocol is /very/ clear about the meaning of the size field. Remember
that the word "character" is ambiguous -- it could mean Java char-s, C++
wchar-s, or (most confusingly) Unicode characters. An inexperienced programmer
could even assume it meant "byte".


5. The BOM issue is, frankly, news to me. If I limit myself to
UTF-8 strings only, and stick to standard Win/Java api at
both client & server end, do I need to worry about BOM ?

I doubt it. The important thing is to have made a conscious (and documented)
decision. I would probably decide that a BOM must not be used, unless there's
something in your project's requirements that I don't know about.

-- chris



.



Relevant Pages

  • Re: Custom Resource, XML problem
    ... Why are you assuming that it is 8-bit characters? ... //JWxml is namespace used by CXml ... which is then screamingly obvious as the UTF-8 Byte Order Mark, ... BOM is the only meaning of BOM in my brain was for "Bill Of Material" which ...
    (microsoft.public.vc.mfc)
  • Re: Custom Resource, XML problem
    ... Mr.David Chingand I tried to use it with a XML wrapping ... Why are you assuming that it is 8-bit characters? ... which is then screamingly obvious as the UTF-8 Byte Order Mark, ... you have a BOM, if you do, which one, and convert the text appropriately. ...
    (microsoft.public.vc.mfc)
  • Re: Character Encoding
    ... > to decode the text when I read it from the database so I can display it ... I'm using UTF-8 character encoding. ... > characters that were UTF-8 incompatible came along for the ride, ...
    (comp.lang.java.programmer)
  • Re: Print Spanish characters in Perl?
    ... and ensure that your file is saved in the UTF-8 format. ... encoding then your display device expects. ... forgetting to specify UTF-8 as charset. ... To avoid this kind of problem, make sure that all the characters are ...
    (comp.lang.perl.misc)
  • Re: different encoding handling between old ASP and ASP.Net
    ... none of the vaporized characters in the original example are ... prohibited from utf-8 per se; what was broken about the original example was ... don't see how that's any less "wrong" than what ASP does. ... throw an invalid format exception when garbarge is fed to an Encoding class. ...
    (microsoft.public.dotnet.framework.aspnet)