Re: Encoding conversion problem



Andrea wrote:
Hi everyone,
sorry for my previous double-post (a mistake).

Is is possible to ask the database driver to do the conversions for
you? Perhaps internally it is Unicode or some other encoding that can
deal with Euros.
I've checked the properties of the JDBC driver I use (http://
publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/
com.ibm.db2.udb.doc/ad/rjvdsprp.htm) but there's nothing concerning
encoding conversions.

We have the clue that C++ programs seem to store euro s and get them back out.
Yes we have C and COBOL programs that can store and write non-IBM850
chars without problems too.
As pointed out by Sabine in her post the reason may be that C programs
work with the pure sequences of bytes, without performing any encoding
conversion.

I do not really understand why a Euro sign would work with 8859-1 since
that does not contain that character as far as I am aware of.

SORRY SORRY SORRY SORRY SORRY
I tried to insert (through JDBC) the EURO character in a DB2
configured with
...
Database territory = C
Database code page = 819
Database code set = ISO8859-1
...
and I can't neither write nor read in Java the EURO character
correctly :-(
A COBOL program works instead correctly.

Then I tried the same thing on a SQL-Server 2000 instance with
collation compatibility_51_409_30003 (correponding to a 1252 codepage,
i.e. Latin 1) and I can store and read the EURO character via
Java&JDBC.

That doesn't work in Java with Oracle 10g configured with
...
NLS_LANGUAGE = AMERICAN
NLS_TERRITORY = AMERICA
NLS_CHARACTERSET = US7ASCII
NLS_LENGTH_SEMANTICS = BYTE
...
store&read through COBOL is ok, and in Java I can even write&read
accented vowels... even if those characters are outside USASCII7...

You could do an experiment. Try feeding your database all possible
unicode chars in a set of 1-char records, and see which ones come back
unmangled. This is a kludge, but you could preconvert your Euro to
one of those invariant unused chars.
The EURO character is just an example and part of the problem, I can't
use this type of kludges.
The specific problem is much more complex: a password is crypted and
stored to DB with a C program but the crypted chars fall outside
IBM850 range and in Java I'm unable to read and decrypt back the
string... this works if the database is ISO-8859-1 (that's why I
though I were able to write another 'weird' char, the euro char, on an
ISO-8859-1 DB, sorry...). I've also the more general problem of data
entry: I don't know wich characters users will insert so I can't
substitute chars.
I've found a workaround for my crypting problem but I'm just trying to
understand the reason of the problem.

Now it's clear to me that with a CHAR field Java performs an encoding
conversion using the encodings of the JVM and of the DBMS: if some
characters fall outside the destination encoding then they are lost
(i.e. converted in something completely different).
The only 'mysterious' thing for me now is the behavior on Oracle (JDBC
can read&write accented vowels even if they are outside ascii7)... any
idea? Maybe the Oracle driver is smarter than the DB2 Universal
Driver...

Thanks everyone,
Andrea


Hello Andrea,

Even if you set a database encoding to ASCII it is very unlikely that the DB will strip non-ASCII characters. Actually, most databases treat every byte-size (ie 8-bit) encoding almost identically internally. They may sometimes have different default collations but that is about it.
The codepage attribute is mostly important for programs interfacing with the DB. As most of those (especially older ones) are encoding unaware also bytes pass in and out inharmed. In the end all 8-bit encodings are equal until actually interpreted to represent characters, aren't they?

I have seen application running on cp-1252 platforms using 8859-1 encoded databases for years without anyone noticing. Same for cp-1257 on a cp-1252 database. Nobody realy cares when the same data that was put in comes out again.

This is not unlike SMTP which is supposed to be 7-bit only but since the transport encoding passes 8-bit characters freely people are used to sending non-ascii characters in plain-text emails although this is not supported. This all works great until someone from Lithuania sends me an email (I am in the Netherlands).

Regards,

Silvio
.



Relevant Pages

  • Re: Loading a data file containing character fields with different encodings
    ... The data is coming from one database that contains UTF-8 characters and it appears that he's attempting to load ... UTF-8 characters along with Latin-1 characters. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ...
    (comp.databases.informix)
  • Re: Loading a data file containing character fields with different encodings
    ... The data is coming from one database that contains UTF-8 characters and it appears that he's attempting to load ... UTF-8 characters along with Latin-1 characters. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ...
    (comp.databases.informix)
  • Re: [PHP] Preventing SQL Injection/ Cross Site Scripting
    ... It's a shame that so many PHP installations have them enabled, and a huge disappointment that PHP is actually distributed with this stuff enabled! ... encoding data for output to an HTML document. ... characters into 5, 6, or 7-byte strings, if you already provided the correct character set in the Content-Type HTTP header. ... For anything that gets written to a database or used for a query, I suggest escaping the data using a function specifically designed for that database. ...
    (php.general)
  • Re: [PHP] Preventing SQL Injection/ Cross Site Scripting
    ... It's a shame that so many PHP ... encoding data for output to an HTML document. ... characters into 5, 6, or 7-byte strings, if you already provided the ... anything that gets written to a database or used for a query, ...
    (php.general)
  • Re: Loading a data file containing character fields with different encodings
    ... UTF-8 characters along with Latin-1 characters. ... it would be just as easy to write the loader script that converts the encoding to a "unicode" intermediate format and then load with the correct database encoding. ...
    (comp.databases.informix)