Re: Perl opting for double-byte chars?

From: J. Romano (jl_post_at_hotmail.com)
Date: 09/12/04


Date: 12 Sep 2004 10:11:56 -0700

Bëelphazoar <http://joecosby.com/code/mail.pl> wrote in message news:<9i57k0hfs4ov5orh4cji217f55icn6lnrq@4ax.com>...
>
> I am working on a problem, I have text in a database which
> includes the word "más". The "á" is ASCII value 225/E1 .

Dear Joe,

   It will help a lot if you give us the output of "perl -v". I'm
sure Unicode has something to do with your problem, but Unicode
support has been changing (updating) in recent versions of Perl.
Without knowing the version of Perl you're using and the platform
you're using it on, we can only guess as to what the problem is.

  By the way, are you SURE that "á" is the extended ASCII value 225?
According to one source I have, it is extended ASCII value 160. Maybe
we're using different code pages, but it's worth checking.

> ASCII only defines the low 7 bits, whcih are the same
> character representations in most english-based code
> pages.
>
> In addition to ASCII there is unicode, which is 16-bit,
> and which, somewhere in my application, is apparently
> being used when the "á" is used because it is greater
> than 127.

   You're wrong about Unicode being 16-bit. That's a myth. It CAN be
encoded in two bytes (16 bits), but it can also be encoded using a
different method called UTF-8 (which is what Perl normally uses
internally). The UTF-8 encoding uses variable-length character
encoding, which means that a character can be encoded in one to six
bytes. In your case, the character whose value is greater than 127 is
being encoded in two bytes, whereas the other characters (< 128) are
being encoded in one byte.

   Understand? If you don't, here's a great link to an FAQ I used to
understand more about how Unicode is encoded:

   http://www.cl.cam.ac.uk/~mgk25/unicode.html

You may also want to check the following perldocs (which, depending on
your version of Perl, you may or may not have all of):

   perldoc Encode
   perldoc perluniintro
   perldoc Unicode::String

> The code pulls the text out of the database and
> assigns it to a variable, but when I print the
> variable it is now "más", the "á" has been
> replaced by C3A1 .

   This certainly looks to me like UTF-8 Unicode encoding, but let's
check just to make sure:

According to the FAQ (whose link I mentioned above), a Unicode
character value can be UTF-8 encoded using one to six bytes:

1: 0xxxxxxx
2: 110xxxxx 10xxxxxx
3: 1110xxxx 10xxxxxx 10xxxxxx
4: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

where "x" is a bit that stands for the Unicode value.

0xC3A1 is two bytes long. Its bit representation is:

   11000011 10100001

So when you apply the 2-byte bit pattern to it:

   110xxxxx 10xxxxxx

the "x"s stand represent the bits: 00011 100001

Put them together and you get 11100001 which is the binary
representation of 225. Therefore, we now know that character number
225, when encoded into UTF-8 encoding, results in the two bytes 0xc3
and 0xa1, which is exactly what you're seeing.
   
> I am PRETTY sure that this is not happening
> within the code I am working on, if I am following
> the code flow correctly it looks like it does
> nothing but pull the text from the database and
> pass it back.

   SOMEWHERE in the code the characters greater than 127 are being
converted from extended-ASCII to UTF-8 encoding, but it's hard to say
exactly where unless I have access to the code. Therefore, I'll leave
it up to you to figure out where it's happening.

   But even if you do find where this is happening, you will still
have to deal with the problem of converting the two-byte UTF-8
representation (of characters greater than 127) to their one-byte
extended-ASCII equivalent. ¿Comprende?

   I'm not sure how to do this, but here are three things you can try.
 Whether or not each one works may depend on the version of Perl you
are using, so letting me know your "perl -v" output may help me out.

----------------------------------------
# Method 1: Convince Perl that your string
# is UTF-8 encoded:
use Encode;
$string = pullTextFromDb();
# Convince Perl that $string is in UTF-8 format:
$string = decode_utf8($string);
# Convert UTF-8 string to extended-ASCII:
$string = encode("iso-8859-1", $string);
----------------------------------------
# Method 2: Tell Perl that $string is UTF-8
# encoded and that you want its
# latin1 equivalent:
use Unicode::String qw(utf8 latin1);
$string = pullTextFromDb();
$string = utf8($string)->latin1();
----------------------------------------
# Method 3: Tell Perl to pack each character's
# Unicode value into just one byte
# of a larger string:
$string = pullTextFromDb();
$string = pack "C*", map ord, split //, $string;
----------------------------------------

   Try all these and see if any of them work. Again, what works and
what doesn't work might very well depend on the version of Perl that
you're using. Also, even if one of them does work, some other part of
your code might be converting it back to UTF-8 encoding, undo-ing the
conversion you just made.

   But it's still worth a shot to try them out. Hopefully one of the
above three methods will work for you, and your problem will be "no
más."

   I hope this helps, Joe.

   -- Jean-Luc



Relevant Pages

  • Re: Psycopg and queries with UTF-8 data
    ... > how do I get my utf-8 encoded data into the DB? ... This sounds like the usual unicode/utf-8 confusion: ... So unicode objects encapsulate abstract unicode character sequence - however ... Do encode the unicode object in utf-8, and pass that to the psycopg. ...
    (comp.lang.python)
  • Re: Case-sensitivity as option?
    ... Code points beyond 0x10FFFF cannot be encoded with UTF-16, ... it is unlikely that Unicode will ... Windows to UTF-8. ... encode them with normal surrogates. ...
    (comp.lang.forth)
  • Re: DBD::mysql and UTF-8
    ... Isn't $somevar already UTF-8? ... Aren't encode and decode ... > So what makes you think this is a Perl problem and not a php problem? ...
    (comp.lang.perl.modules)
  • Re: More elegant UTF-8 encoder
    ... >>UTF-8 is officially capped at 4 octets. ... >>longer without breaking Unicode (consider round-tripping with ... It is capped at six octets. ... because UTF-16 cannot encode them independently. ...
    (comp.lang.c)
  • Re: writing (char) 129 to file
    ... in Unicode it is a control character. ... encode into one byte in UTF-8. ...
    (comp.lang.java.programmer)