Re: Slightly tricky string problem
- From: Dirk Bruere at NeoPax <dirk.bruere@xxxxxxxxx>
- Date: Thu, 28 May 2009 18:34:55 +0100
charlesbos73 wrote:
On May 28, 4:30 am, Dirk Bruere at NeoPax <dirk.bru...@xxxxxxxxx>
wrote:
... which I'm having trouble getting my head around.
I have a String, which is single character eg "a"
I need to convert it to a String which is the decimal representation of
the UTF8 ascii code ie "97"
What did you do to try to solve your problem?
As Mayeul pointed out, "UTF8 ascii code" [sic] doesn't mean anything.
ASCII is a code defining 128 entities, which are usually represented
each on 8 bits, with the most significant bit set to 0. But in any
case "ASCII the characters" should not be mistaken with "ASCII the
encoding".
Same for Unicode.
Unicode defines much more entities (called codepoints).
The 128 first Unicode entities are the 128 ASCII entities.
UTF-8 is an encoding that has been created so that any byte
with the most significant bit set to 0 is an ASCII entity.
So an UTF-8 encoded file containing only ASCII characters shall
be the same as an ASCII encoded file.
But in your case, if you have a String [sic] you shouldn't
care at all about encoding details: UTF-8 or little faeries
wearing boots drawing you characters using magical powder has
no importance.
It is when I have a protocol that interfaces with a machine that only accepts ascci encoded strings. So UTF8 would be a good starting point.
Things get quickly messy in Java because when Java was created
Unicode didn't define codepoints outside the BMP. So we end
up with a backward compatible charAt(..) method that is broken
beyond repair because it definitely does NOT give back the
character at 'x' when you have a String that contains characters
outside the BMP.
All hope is not lost that said, for we now have the codePointAt(..)
method which works correctly for codepoints outside the BMP, as
shown in the example below:
@Test public void tests() {
assertEquals( Integer.toString("\u0000".codePointAt(0)),
"0" );
// Java offers no easy way to source code encode, say, U+1040B
(dec 66571)
assertEquals( Integer.toString("\uD801\uDC0B".codePointAt(0)),
"66571" ); // 0x1040B (hex) 66571 (dec)
assertEquals( Integer.toString("a".codePointAt(0)), "97" );
}
If you're curious as to how to do what Integer.toString(..) does
you can look at the source code for the Integer class.
Note that Integer.toString(int) works as expected on
entities outside the BMP:
Integer.toString("\uD801\uDC0B".codePointAt(0))
gives back the expected "66571" string.
By now you can expect the "JLS-nazi bot" (that shall recognize
itself) to nitpick on grammatical mistakes and claim loud
that Java is perfect and that the fact that we have both a
(broken) charAt(..) method and codePointAt(..) is not a
problem at all.
But as usual the "JLS-nazi bot"'s deranged ramblings shall be
sent to /dev/null without any consideration.
Thanks.
Right now my problem is lack of full definition of the protocol, so I'll have to return to this later.
--
Dirk
http://www.transcendence.me.uk/ - Transcendence UK
http://www.theconsensus.org/ - A UK political party
http://www.onetribe.me.uk/wordpress/?cat=5 - Our podcasts on weird stuff
.
- Follow-Ups:
- Re: Slightly tricky string problem
- From: Mark Space
- Re: Slightly tricky string problem
- References:
- Slightly tricky string problem
- From: Dirk Bruere at NeoPax
- Re: Slightly tricky string problem
- From: charlesbos73
- Slightly tricky string problem
- Prev by Date: Re: Slightly tricky string problem
- Next by Date: Re: Slightly tricky string problem
- Previous by thread: Re: Slightly tricky string problem
- Next by thread: Re: Slightly tricky string problem
- Index(es):
Relevant Pages
|