Re: Encoding conversion problem



Andrea wrote:

...
If I save characters outside the range supported by IBM-850 (i.e. the
euro currency character EURO) then I read garbage...

Yes, the Euro symbol is not part of the encodings, so your database
can't contain it.
I've found a strange thing: C and COBOL application can write and read
(using embedded SQL) characters outside the accepted range without
problems... So the database can contain those characters without
loosing any information, but I can't understand how...

Yes, in theory you can store any value (0 - 255 in case of one byte
strings) in a string, but how that is interpreted (i.e. encoding) is
where it gets hairy. Also, multibyte characters would break the
interpretation.

If you need it, you would have to change the databases
encoding (ISO-8859-15 includes the Euro symbol).
Otherwise, you have to take care not to try to write unsupported
character into string/character fields.

One solution could be to parse all strings and replace the symbol with
the shorthand "EUR", but it might not be acceptable to your client.
Actually the EURO character is just an example, I have more complex
strings to handle (and I can't change the encoding of the database).
If my problem has no solution at all then I'd like to understand why
other languages don't have this problem...

Ah, there is always hacks around limitations. But they aren't usually
pretty. The problem is to funnel a string with these "unsupported"
characters through the JDBC driver (both ways).

You might get around it by using typeless fields (you can put any byte
sequence there), like BLOBS maybe...

Or you write a parser that substitutes the impossible characters with
acceptable replacements. Of course, this is most likele not feasable.

But the customer has to be aware that a database with encoding X can
only hold strings encoded in X. If they need UTF-8 for example now, they
will eventually have to change their database. And it would be better to
migrate to a suitable encoding than to hack around it and in a few
years, have to do all over again (and then some), when they finally do
want to change the database encoding.

On other languages not having the problem, in C, you can treat a string
just like an array of bytes and use those for whatever you like, the
compiler won't complain. Even interpreting them as memory addresses is
possible, adding and subtracting etc...

Thanks,
Andrea

--
Sabine Dinis Blochberger

Op3racional
www.op3racional.eu
.



Relevant Pages

  • Re: *RANT* UTF-8 Character Processing
    ... UTF-8 ENCODING AT ANY LAYER LOWER THAN THE END-USER APPLICATION! ... like PLT strings are *neither* octet strings nor codepoint strings, ... sure whether we are handling bytes, or characters, or codepoints. ... but most applications have not been ...
    (comp.lang.scheme)
  • Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
    ... DO WITH CHARACTERS ABOVE "\xFF". ... suspect, openworks on the supplied byte stream AS IS, discregarding ... Unocode inserts hints in strings. ... encoding to perl strings by readdir and from perl strings to the OS ...
    (comp.lang.perl.misc)
  • Re: R5.97RS---adoption candidate---posted
    ... In an ambiguous encoding there may be more than one way ... ascii" text where the loading of the upper 128 characters ... But Unicode didn't manage to avoid chimericality. ... happen when you need to compare strings linguistically. ...
    (comp.lang.scheme)
  • Re: is any work being done to fix/improve PHPs string handling beyond 8 bits?
    ... >try to make guesses about multi-byte characters. ... Well - your questions, if I recall, were less about PHP supporting multibyte ... strings, but rather you were receiving strings from external sources with no ... well-defined encoding, or worse they were coming in with an encoding different ...
    (comp.lang.php)
  • Re: Encoding conversion problem
    ... Perhaps internally it is Unicode or some other encoding that can ... I tried to insert the EURO character in a DB2 ... Database territory = C ... even if those characters are outside USASCII7... ...
    (comp.lang.java.databases)