Re: Character Encoding

From: John C. Bollinger (jobollin_at_indiana.edu)
Date: 02/21/05


Date: Mon, 21 Feb 2005 11:30:34 -0500

Fred wrote:

> I've been using java.net.URLEncoder to encode text coming from a form
> on a web page before I store it in my database, and java.net.URLDecoder
> to decode the text when I read it from the database so I can display it
> to the user. I'm using UTF-8 character encoding.
>
> I recently had a problem where a user copied and pasted text from the
> Attachmate terminal emulator into a textarea and submitted the form.
> The text was stored successfully, but when it came time to decode it,
> the URLDecoder class started throwing errors. I'm guessing that some
> characters that were UTF-8 incompatible came along for the ride,
> because I've had similar problems with Attachmate in the past.

There are no characters incompatible with UTF-8 -- it is a
general-purpose charset covering all of Unicode. Moreover, if you
successfully _encode_ the characters with UTF-8 (in the process of
URL-encoding them) then there is absolutely no reason that you should
not be able to reverse the process. (You do, however, need to specify
UTF-8 at both encoding and decoding time.)

If you post a small, self-contained, compilable example that exhibits
the problem, preferably with test data, then we can probably point you
to where the problem lies. You would also get much better advice if you
showed the actual stack traces for the exceptions thrown. The problem
is not that the classes you are trying to use are broken; it is that you
are not using them according to specs.

Do note, by the way, that you have _two_ encoding/decoding pairs to
worry about here, and so far you have only discussed one. You also need
to worry about the the encoding and decoding involved in sending the
form from the client to your application. Since you say you've had
trouble with Attachmate before, I tend to suspect that your
application's character handling is not as robust as you think it is.

-- 
John Bollinger
jobollin@indiana.edu


Relevant Pages

  • =?utf-8?B?UmU6IFN0cmluZyAiw6LigqzihKIiIHRyYW5zbGF0ZWQgdG8gYXBvc3Ryb3BoZS4gV2h5Pw==?=
    ... it works), though it seems to use mostly just Ascii characters, representing ... but the author is not making the best possible use of UTF-8. ... They don't map it to ASCII apostrophe, ... Latin 1 encoding. ...
    (alt.html)
  • Re: Special Characters in Query String
    ... I've had numerous problems with utf-8, ... in common characters in spanish not geting displayed. ... > available for encoding of characters. ... > If you can display your characters with ISO-8859-1, ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: DBD::ODBC and character sets
    ... you have and accept UTF-8 encoded data does mean you need to "use ... encoding" but if your script is encoded in xxx you need "use encoding ... Perl sees the left-hand side of eq as a string literal containg sixcharacters encoded as ISO-8859-1 ...
    (perl.dbi.users)
  • Re: Print Spanish characters in Perl?
    ... and ensure that your file is saved in the UTF-8 format. ... encoding then your display device expects. ... forgetting to specify UTF-8 as charset. ... To avoid this kind of problem, make sure that all the characters are ...
    (comp.lang.perl.misc)
  • Re: UTF-8 practically vs. theoretically in the VFS API
    ... > Additional good news is that following octets in a utf-8 character sequence ... The original name for the encoding was, in fact, "FSS-UTF", ... do not decode to anything. ... if we don't want the kernel to know about utf-8. ...
    (Linux-Kernel)