Re: DBD::ODBC and character sets



On 30.09.2009 15:39, Martin Evans wrote:
Does your setup pass the DBD::ODBC tests?
No, it does not:

t/40UnicodeRoundTrip.t
At least this test should pass without warnings and errors. If it doesn't, the following Unicode tests do not make sense at all.


You are entering a world of pain.

Right. Unicode is too young in computer terms ... ;-)
And the various encodings and Unicode versions don't make things easier.

use encoding xxx

This is used in Perl to say your script is encoded in xxx. Just because
you have and accept UTF-8 encoded data does mean you need to "use
encoding" but if your script is encoded in xxx you need "use encoding
xxx". For instance, the example Hendrik gave you includes unicode
characters but does not need encoding. As a result, I cannot see how
adding "use encoding 'utf-8'" should make any difference to data
returned from sql server through DBD::ODBC.
It can make a difference, if you add "use encoding 'utf-8';" to a script that is really encoded as iso-8859-1 or if you don't add it to a script encoded as UTF-8 *and* the script contains non-ASCII string literals. In both cases, you end with strings where encoding and UTF-8 flag do not match.

Example 1:

#!/usr/bin/perl -w
use strict;
use encoding "utf-8"; # but file is encoded as iso-8859-1
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as iso-8859-1
print "ok\n";

Output:

Malformed UTF-8 character (unexpected non-continuation byte 0xd6, immediately after start byte 0xc4) at test.pl line 4.
Malformed UTF-8 character (unexpected non-continuation byte 0xdc, immediately after start byte 0xd6) at test.pl line 4.
Malformed UTF-8 character (1 byte, need 2, after start byte 0xdc) at test.pl line 4.
encoding mismatch at test.pl line 4.

Example 2:

#!/usr/bin/perl -w
use strict;
# no "use encoding "utf-8";", but file is encoded as UTF-8
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as UTF-8
print "ok\n";

Output:

encoding mismatch at test.pl line 4.


Note that Example 2 does not give you any warnings, as ISO-8859-1 does not have any invalid byte sequences. Perl sees the left-hand side of eq as a string literal containg six(!) characters encoded as ISO-8859-1 (those 6 bytes that encode ÄÖÜ in UTF-8), that literal has its UTF-8 flag turned off. The right-hand side is a string literal containing three UTF-8 characters, internally stored as the same six bytes, but with the UTF-8 flag turned on. A string of six characters cannot be the same as a string of three characters, so the eq expression is false.

In Example 1, Perl sees three(!) bytes(!) in the string literal on the left-hand side of eq that do not represend a valid UTF-8 string, hence the three warnings. Still, the string has a length of three characters and has its UTF-8 flag set. The right-hand side is the same as in Example 2, but the binary junk is not equal to "ÄÖÜ", so again, the eq expression is false.

t/40UnicodeRoundTrip.t is intentionally written using \x{0000} sequences instead of non-ASCII literals to prevent this special problem. And it has four paranoia tests (utf8::is_utf8(...) in the BEGIN block) to absolutely make sure the test data has the UTF-8 flag set or cleared as expected.

t/UChelp.pm has a dumpstr() function that dumps the unicode string in pure ASCII using \x00 or \x{0000} sequences, including length and UTF-8 flag. It prevents the unwanted side effect of a UTF-8-capable terminal that displays bytes written by Perl as Unicode characters, even if they were ment to be non-unicode.


Alexander


--
Alexander Foken
mailto:alexander@xxxxxxxx http://www.foken.de/alexander/

.



Relevant Pages

  • Re: Byte Array to String
    ... retrieved text will mismatch the original characters. ... encoding the characters. ... Dim strFileData as String ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: DBD::ODBC and character sets
    ... whether UTF-8 encoded data is in the script or not as in my examples (as ... DBD::ODBC) use \xin which case use encoding does not come in to ... as a string literal containg sixcharacters encoded as ISO-8859-1 ...
    (perl.dbi.users)
  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • Re: UTF-8 encoding
    ... I need to pass a UTF-8 encoded writer ... reading that file with the system's default encoding. ... String), but used elsewhere as if it were a StringBuffer. ... There's a very good reason that ...
    (comp.lang.java.programmer)

Loading