Re: length of char in bits differs on Win/Linux and Mac



Bart Rider wrote:
Now i observed the following. The character 'ä' stored in the
char variable c and used to access the counting array:
countingArray[c]++
caused no problems on windows/linux computers, but on macs,
where the value 8240 (0x2030) was assigned with this char.

It seems to me, that char on mac computers is 16bit wide.
Is this true?

You were just lucky on Windows with your algorithm, and you used the wrong encoding for reading on the Mac.

You were lucky on Windows, because Java uses Unicode for all characters. Current Unicode standards support characters with code points beyond 2^16 (Unicode is not a 16 character standard) - although you have trouble with Unicode beyond 2^16 in Java. But whatever Java version you use, your 256 wide array could have fallen any time. You were lucky, because your input didn't contain any character beyond the Latin-1 range. If it would, your code would have blown up on Windows already.

Regarding the Mac result: You used the wrong encoding. When you read text data into Java, Java needs to know in what encoding that data comes, so it can be translated to Java's internal Unicode. You did use an encoding (implicitly or explicitly) which triggered the translation of some input data to the Unicode code point 0x2030. Since 0x2030 is the Unicode code point for the permille sign, and not for a-umlaut, the conversion was wrong.

You need to fix the encoding which you use for reading the data. All your casting and and bit-masking is nonsense, it will not fix the encoding problem.

In general, even if you had fixed the encoding problem, your original algorithm was faulty. It failed for everything beyond code point 255, which are roughly 96000 possible characters your algorithm doesn't cover. Your original algorithm just handled about 1/377th of all valid input values.

You only partly fixed that with the counting of 'other' characters, partly only because ...

I solved the problem by using a try-catch-block and counting
'other' characters through it. :)

.... using exceptions to handle valid input data is bad. A simple comparison if a code point is greater 255 would be the right thing to do here.

/Thomas
--
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
.



Relevant Pages

  • Re: Im sure glad I didnt buy a Mac Mini!
    ... George Graves wrote: ... Also if one writes a document in Word for Mac, and saves it as a current Word for Windows file, All ASCI ... Characters above 255 will be missing when the file is opened with a current version of Word on Windows. ... These incomaptibilities are mostly caused because Windows font sets are incomplete and special characters have to be accessed from a special menu. ...
    (comp.sys.mac.advocacy)
  • Re: Converting textfile from Mac to Windows
    ... CE languages showed corrupted characters. ... After I wrote the conversion tool, the files were ok for mac. ... Filemaker should write a file to filesystem containig data from database. ... I have tried to encoding using C#'s Encoding classes but still special ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Im sure glad I didnt buy a Mac Mini!
    ... in Word for Mac, and saves it as a current Word for Windows file, All ... Characters above 255 will be missing when the file is opened with a ... I'd call MS and Apple were I you. ...
    (comp.sys.mac.advocacy)
  • Re: Serial port read troubles
    ... and converting the bytes into ASCII (man readable characters). ... Mac) that I can send data to correctly, but when it sends data back, I ... Windows using the same Keyspan converter. ...
    (comp.lang.basic.realbasic)
  • Re: ASCII convention
    ... which put different characters in the same ... Mac fonts don't have all those silly ... Windows fonts (except the ones made to emulate MS-DOS ...
    (sci.lang)