F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
- From: Eric Pozharski <whynot@xxxxxxxxxxxxxx>
- Date: Wed, 15 Apr 2009 02:45:51 +0300
On 2009-04-12, Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
On 2009-04-12 14:14, Eric Pozharski <whynot@xxxxxxxxxxxxxx> wrote:*SKIP*
On 2009-04-11, Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
On 2009-04-11 11:59, Eric Pozharski <whynot@xxxxxxxxxxxxxx> wrote:
I've thought a lot. I should admit, whenever I see C<use 'utf8';>
instead of C<use encoding 'utf8';> I'm going nuts. Look at what we've
got here
*SKIP*
Let's compare 4 programs, which are all essentially the same:*SKIP*
The differences are in the encoding of the source file (UTF-8 vs.*SKIP*
ISO-8859-7) and the line marked "use XXX ###" above.
1) encoded in UTF-8, contains "use utf8;"
prints:
char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5
3) encoded in ISO-8859-7, contains "use encoding 'ISO-8859-1';"*SKIP*
prints:
char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5
And with C<use encoding 'utf8';> you'll get the same character string,
and lots of other useful stuff. (I just can't get why anyone would need
implicit upgrade of scalars into characters and yet then maintain wide
IO.) But my point isn't that F<encoding.pm> outperforms F<utf8.pm>.
I'm scared. I consider F<utf8.pm> kind of Pandora box. Read this, if
you can
проц запросить {
мое ($имяфайла) = @_;
если (существует $ЗАГАЛ{$имяфайла}) {
вернуть 1 если $ЗАГАЛ{$имяфайла};
прекратить "Сбой компиляции в запросить";
}
мое ($настоящийфайл,$результат);
ИТЕР: {
длякаждого $префикс (@ЗАГАЛ) {
$настоящийфайл = "$префикс/$имяфайла";
если (-ф $настоящийфайл) {
$ЗАГАЛ{$имяфайла} = $настоящийфайл;
$результат = делать $настоящийфайл;
последний ИТЕР;
}
}
прекратить "$имяфайла не найдено в \@ЗАГАЛ";
}
если ($@) {
$ЗАГАЛ{$имяфайла} = неопред;
прекратить $@;
} другое (!$результат) {
удалить $ЗАГАЛ{$имяфайла};
прекратить "не ИСТИНА возвращена из $имяфайла";
} иначе {
вернуть $результат;
}
}
I admit, it's imposible to write this with F<utf8.pm> alone
(F<overload.pm> comes to mind, however I can't comment on this I haven't
used it). I looked for simple yet rich code, and then found this piece
more showing. I bet you've seen this before, you use it constantly.
Yet can you name it?
Someone could say "Who the heck would need that stupidity?" Idiots. It
still surprises me how many idiots are around. They would scream:
"Look! What a cool stuff! I have to learn nothing!"
You can say: "Eric, what a strange stuff you smoke? That's
impossible." I think you're wrong. I've come to conclusion
(overoptimistic?) that idiots around you are the same that around me.
So they would scream. (BTW, I don't smoke, I pipe "Prima optima
light".)
[ Lots of irrelevant stuff below, can easily be skipped ]
*SKIP*
My understanding is based on this -- C<perldoc perlunicode>
"use encoding" needed to upgrade non-Latin-1 byte strings
By default, there is a fundamental asymmetry in Perl's Unicode
model: implicit upgrading from byte strings to Unicode strings
assumes that they were encoded in ISO 8859-1 (Latin-1), but
Unicode strings are downgraded with UTF-8 encoding.
This paragraph is confusing. I have a vague idea what the author wanted
to say but even then it's not quite correct. I doubt somebody can
understand this paragraph unless they already exactly understood the
problems before.
This happens because the first 256 codepoints in Unicode happens
to agree with Latin-1.
If encoding is unknown, it's treated as latin1, even if it's not.
This has nothing to do with "use utf8" and "use encoding". The
"implicit upgrading" which is mentioned here happens (for example) when
you concatenate a byte string to a character string. But then the result
*is* a character string, not a byte string.
BTW, F<encoding.pm> says exactly what you've said. What F<utf8.pm>
mangles. Thanks, now I feel much better.
Byte strings are *not* implicitely assumed to be ISO-8859-1, as you can
easily check by matching against a character class:
% perl -le '$_ = "\x{FC}"; print /\w/ ? "yes" : "no"'
no
% perl -le '$_ = "\x{FC}"; utf8::upgrade($_); print /\w/ ? "yes" : "no"'
yes
For those unaware what happened
perl -MDevel::Peek -wle '
$_ = "\x{FC}";
Dump $_;
utf8::upgrade($_);
Dump $_;
print' | recode latin1..utf8
SV = PV(0x92556d0) at 0x9280470
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x926ca60 "\374"\0
CUR = 1
LEN = 4
SV = PV(0x92556d0) at 0x9280470
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x9269068 "\303\274"\0 [UTF8 "\x{fc}"]
CUR = 2
LEN = 3
ü
*SKIP*
In case there would be C<use utf8> or C<use encoding 'utf8'>,
then the compiler would complain about a malformed UTF-8 character if
the source file was actually in ISO-8859-7.
But it didn't.
It does for me. If I change "use encoding 'ISO-8859-7'" to "use utf8"
in my ISO-8859-7 encoded file, I get a lot of warnings.
Yes, it does. Since I've typed examples on command-line I'd gone with
those hex-escapes. They don't warn. If B<perl> finds *bytes* with high
bit set (so they aren't valid utf8) while being in any way utf8 encoding
mode then it really complains (and complains a lot).
*SKIP*
Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8. The utility functions described below
are directly usable without "use utf8;".
I believe I already said that once or twice in this thread.
My understanding of "script" is a program text outside of any quotes in
it.
Bull***. A script is the complete program text, including any string
constants, numeric constants, comments, the __DATA__ stream, if any.
Why would a string constant in a script not be part of it?
Yes, I should agree that "script" in general means this. That's my
understanding of what was said (or meant) here by this word.
*SKIP*
You mean:
{3415:30} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.
That doesn't fix the endianness, and it behaves completely differently.
"perl -Mencoding=ucs2" can't work, as I already explained to sln.
This fixes endianness?
{56061:37} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{0444}\x{044b}\x{0432}\x{0430}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.
However, since I don't understand why it "can't work", I won't complain
why it can't "locate object method".
[ Lots of irrelevant stuff above, can easily be skipped ]
However, in spite of confessing being scared of F<utf8.pm> features, I
promise to rant anytime I'll find C<use utf8;> instead of
C<use encoding 'utf8';>
--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom
.
- Follow-Ups:
- Re: F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
- From: Peter J. Holzer
- Re: F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
- References:
- XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: MaggotChild
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Ben Bullock
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: MaggotChild
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Ben Bullock
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Ben Bullock
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: sln
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: sln
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Eric Pozharski
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Eric Pozharski
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- XML::LibXML UTF-8 toString() -vs- nodeValue()
- Prev by Date: Re: Regex question. Oh I so cannot do regular expression matching.
- Next by Date: FAQ 4.54 Why does defined() return true on empty arrays and hashes?
- Previous by thread: Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- Next by thread: Re: F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
- Index(es):