Re: unicode conversion
- From: Donald King <dlking@xxxxxxxx>
- Date: Thu, 23 Mar 2006 11:56:54 -0600
corff@xxxxxxxxxxxxxxxxxx wrote:
[...]
Paradoxically, starting my script with the flag -CS, like in the shebang line
#!/usr/bin/perl -CS
breaks utf8 output of Chinese characters to an otherwise perfectly utf8- transparent console, see my XML::Simple and utf8 woe posting of
last week and try yourself. So the opposite of what the perlrun manpage promises happens.
Works fine for me with this example script:
#!/usr/bin/perl -CS
use strict;
use warnings;
use Text::Unidecode;
my $str = "\x{624B}\x{8868}";
print "$str\n";
print unidecode($str), "\n";
# As-is, prints:
# 手表
# Shou Biao
# Without the -CS, prints the following:
# Wide character in print at unitest.pl line 7.
# 手表
# Shou Biao
As I explained in the other thread, what's probably happening is that, without -CS, your data is being read in by Perl as octets, then printed out as octets; however, under -CS your data is still read as octets (since it's not one of the STDFOO handles that's affected by -CS) yet printed to a UTF8-aware filehandle (which assumes that your octets are actually ISO-8859-1).
I find the off-and-on auto-predictive and authoritarian style in which Perl seems to treat utf8 data not really transparent and a source of aweful headaches.
Most of that's for backwards compatibility with pre-Unicode versions of
Perl. In Perl 5.6, you used the "use utf8" and "use bytes" pragmata to
treat *all* strings as chars or octets in a given block. Having each
string remember is a blessing compared to that.
In addition, Perl's utf8 support occasionally slows down things significantly; my latest experience is with bulk quantities of utf8 data (latin, CJK material, _tons_ of characters with accents and diacritics in one soup).
When I try to segment such a string with approx. 400kB of data into an array using split(), and my regex contains a single utf8 character
then the whole thing gets terribly slow when being done in utf8 mode. Actually, split()-ing becomes so slow that I can't use the script for production purposes any more. If in contrary I treat my 400kB long string as series of octets, ignoring character semantics,
and let my regex in split() search for two adjacent octets of a
given type, then the whole thing is lightning fast, as usual, and as
expected.
What version of Perl are you using? I'm using Perl 5.8.8 on Debian
testing, and I don't see the slowdown you're having. I wrote a simple
benchmark that generates a string of over 1 million Unicode characters
(from the U+2400 block, so they're 3 octets each) and does various
string ops on it, such as m//g, s///g, and split. Using utf8::encode()
to create the equivalent UTF-8 byte string, I compared char-vs-byte
performance and it was within a few percent (with character-oriented ops
just a hair slower than byte-oriented).
chronos@isis:~/temp$ ./unicode-benchmark.pl -c
Using characters
Creation: 1.687 seconds
length = 1250000
␙␋␃␎␣␂␝␐␐␣␗␜␏␓␣␒␀␚␔␣...
Match One: 0.000 seconds
Match All: 0.131 seconds
Split: 0.264 seconds
s///g: 0.054 seconds
chronos@isis:~/temp$ ./unicode-benchmark.pl -b
Using bytes
Creation: 1.675 seconds
length = 3750000
E2 90 99 E2 90 8B E2 90 83 E2 90 8E E2 90 A3 E2 90 82 E2 90...
Match One: 0.000 seconds
Match All: 0.121 seconds
Split: 0.246 seconds
s///g: 0.042 seconds
The benchmark program is available at
<http://chronos-tachyon.net/~chronos/unicode-benchmark.pl>.
So I think, either Perl's control features of which data are utf8 and
which are not, need a significant overhaul, or Perl's utf8 processing capabilities need streamlining.
If you're not using Perl 5.8, the biggest selling point of the whole 5.8 series is that UTF-8 support has been overhauled and streamlined compared to 5.6. Also, there have been many bugfixes and Unicode optimizations since the early 5.8's, so if you're not using it already, you might try your problem code on the newest 5.8 release.
One of the main points of potential conflict is certainly the way in which regex automata are built, and notably how to define atoms. For me, it would be fine if a complex Perl script could do all its data processing, IO trans- fers etc. in pure octet semantics unless instructed otherwise.
That's basically how things worked in 5.6, except that instead of giving
you an option, all regexps had octet semantics, period. Octets by default drove people nuts, hence 5.8.
The "use bytes" pragma almost but not quite does what you ask for; unfortunately, it doesn't affect regexps. Other than calling utf8::encode() and utf8::decode() liberally by hand, I don't think Perl is currently capable of what you ask.
[...]
I frequently encountered the problem that Perl without any instruction treated my utf8 data correctly on a, e.g. Linux box, including console and file output, but goofed in WinXP unless additional binmode() instructions were given; to make things worse, utf8-clean stuff developed on XP failed miserably on Linux.
Strange. I never had much trouble with "binmode(HANDLE, ':utf8');" on either OS, which is the official way of doing any UTF-8 I/O in modern Perl. (However, that was Cygwin Perl, not ActivePerl. I don't think I've ever tried Unicode under ActivePerl, so YMMV.)
Note that there are a lot of situations (esp. under Unix) where a buggy Perl program can still end up spitting out valid UTF-8. Unless the I/O handles involved have been marked as :utf8, 8-bit octet strings are output literally, Unicode strings are output as UTF-8, and all input is treated as octets. That can result in some very strange and corrupt output the program mixes octets with Unicode -- especially since "octet" is a synonym for "ISO-8859-1" as far as Perl is concerned -- but depending on the program and circumstances, just because the output is valid UTF-8 doesn't mean the program's working correctly.
Hope I've helped more than confused.
--
Donald King, a.k.a. Chronos Tachyon
http://chronos-tachyon.net/
.
- Follow-Ups:
- Re: unicode conversion
- From: Nospam
- Re: unicode conversion
- From: corff
- Re: unicode conversion
- References:
- unicode conversion
- From: Nospam
- Re: unicode conversion
- From: Bart Van der Donck
- Re: unicode conversion
- From: corff
- unicode conversion
- Prev by Date: Re: What is the best way to pull out a range of values?
- Next by Date: cgi clearing a form
- Previous by thread: Re: unicode conversion
- Next by thread: Re: unicode conversion
- Index(es):
Relevant Pages
|
Loading