Re: Slightly tricky string problem



On May 28, 4:30 am, Dirk Bruere at NeoPax <dirk.bru...@xxxxxxxxx>
wrote:
... which I'm having trouble getting my head around.

I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation of
the UTF8 ascii code ie "97"

What did you do to try to solve your problem?

As Mayeul pointed out, "UTF8 ascii code" [sic] doesn't mean anything.

ASCII is a code defining 128 entities, which are usually represented
each on 8 bits, with the most significant bit set to 0. But in any
case "ASCII the characters" should not be mistaken with "ASCII the
encoding".

Same for Unicode.

Unicode defines much more entities (called codepoints).
The 128 first Unicode entities are the 128 ASCII entities.

UTF-8 is an encoding that has been created so that any byte
with the most significant bit set to 0 is an ASCII entity.

So an UTF-8 encoded file containing only ASCII characters shall
be the same as an ASCII encoded file.

But in your case, if you have a String [sic] you shouldn't
care at all about encoding details: UTF-8 or little faeries
wearing boots drawing you characters using magical powder has
no importance.

Things get quickly messy in Java because when Java was created
Unicode didn't define codepoints outside the BMP. So we end
up with a backward compatible charAt(..) method that is broken
beyond repair because it definitely does NOT give back the
character at 'x' when you have a String that contains characters
outside the BMP.

All hope is not lost that said, for we now have the codePointAt(..)
method which works correctly for codepoints outside the BMP, as
shown in the example below:

@Test public void tests() {
assertEquals( Integer.toString("\u0000".codePointAt(0)),
"0" );
// Java offers no easy way to source code encode, say, U+1040B
(dec 66571)
assertEquals( Integer.toString("\uD801\uDC0B".codePointAt(0)),
"66571" ); // 0x1040B (hex) 66571 (dec)
assertEquals( Integer.toString("a".codePointAt(0)), "97" );
}

If you're curious as to how to do what Integer.toString(..) does
you can look at the source code for the Integer class.

Note that Integer.toString(int) works as expected on
entities outside the BMP:

Integer.toString("\uD801\uDC0B".codePointAt(0))

gives back the expected "66571" string.

By now you can expect the "JLS-nazi bot" (that shall recognize
itself) to nitpick on grammatical mistakes and claim loud
that Java is perfect and that the fact that we have both a
(broken) charAt(..) method and codePointAt(..) is not a
problem at all.

But as usual the "JLS-nazi bot"'s deranged ramblings shall be
sent to /dev/null without any consideration.



.



Relevant Pages

  • Re: Slightly tricky string problem
    ... I have a String, which is single character eg "a" ... ASCII is a code defining 128 entities, ... So an UTF-8 encoded file containing only ASCII characters shall ... Unicode didn't define codepoints outside the BMP. ...
    (comp.lang.java.programmer)
  • Re: encrypt email address to a string
    ... would simply reverse the string, like Abigail said or remove the ... You don't want to do an even ASCII exchange mapping because you don't ... and make a simple escape sequence for illegal characters), ... will no be transferring integers, ...
    (comp.lang.perl.misc)
  • Re: Reading an Ascii string
    ... When said "Ascii characters", I meant that they are stored as bytes rather than 16-bit quantities as unicode requires. ... so a string with three characters appears as byte 0x14 followed by the three characters followed by 17 space characters. ... I'll get rid of the text box and store them directly in the database. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Writing extended ascii characters to text file.
    ... so in order to get real ASCII codes you should use the GetBytes ... method of an Encoding instance configured for the ASCII encoding (as far as ... again, you've got bytes, not characters. ... > string line; ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: creating a please wait message
    ... image (.bmp) on a page. ... HTML is just a string of characters, so the innerHTML must be a string, so ... But I still get a syntax error when trying to insert a gif. ...
    (microsoft.public.scripting.vbscript)