Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: "Peter J. Holzer" <hjp-usenet2@xxxxxx>
- Date: Sun, 12 Apr 2009 23:14:39 +0200
On 2009-04-12 14:14, Eric Pozharski <whynot@xxxxxxxxxxxxxx> wrote:
Before anything else, I beg your and everyone else pardon. For some
weird reason, I'd called "tokens" "literals". Now I feel much better.
On 2009-04-11, Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
On 2009-04-11 11:59, Eric Pozharski <whynot@xxxxxxxxxxxxxx> wrote:*SKIP*
On 2009-04-10, Peter J. Holzer <hjp-usenet2@xxxxxx> wrote:
No. Almost all encodings today are supersets of US-ASCII.
Consider these two programs:
*SKIP*$ perl -Mutf8 -wle 'print "фыва"; print "\x{C0}\x{B0}"'
Wide character in print at -e line 1.
фыва
�
{2775:24} [0:0]$ perl -Mencoding=latin1 -wle 'print "фыва"; print "\x{C0}\x{B0}"'
фыва
�
use encoding als sets the binmode for STDOUT and STDERR, so you won't
No, it doesn't (s/STDERR/STDIN/)
Yes, that was a typo. Sorry.
{5665:37} [0:0]$ perl -Mencoding=utf8 -wle 'print STDERR "фыва"'
Wide character in print at -e line 1.
фыва
get a warning here. Again, I was talking only about compile time
effects, not run time, so I didn't mention that (you can read the manual
yourself).
I fail to see any compile time effects -- either in those two above or
this one below
Well, you aren't looking for any compile time effects, so you won't see
any :-).
Let's compare 4 programs, which are all essentially the same:
#!/usr/bin/perl
use XXX ###
use warnings;
use strict;
my $greeting = "Καλημέρα κόσμε";
dumpstr($greeting);
sub dumpstr {
my ($s) = @_;
print utf8::is_utf8($s) ? "char" : "byte";
print "[", length($s), "]";
print ":";
for (split //, $s) {
printf " %#02x", ord($_);
}
print "\n";
}
__END__
The differences are in the encoding of the source file (UTF-8 vs.
ISO-8859-7) and the line marked "use XXX ###" above.
1) encoded in UTF-8, contains "use utf8;"
prints:
char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5
2) encoded in UTF-8, no "use utf8;"
prints:
byte[27]: 0xce 0x9a 0xce 0xb1 0xce 0xbb 0xce 0xb7 0xce 0xbc 0xce 0xad
0xcf 0x81 0xce 0xb1 0x20 0xce 0xba 0xcf 0x8c 0xcf 0x83 0xce 0xbc 0xce
0xb5
3) encoded in ISO-8859-7, contains "use encoding 'ISO-8859-1';"
prints:
char[14]: 0x39a 0x3b1 0x3bb 0x3b7 0x3bc 0x3ad 0x3c1 0x3b1 0x20 0x3ba
0x3cc 0x3c3 0x3bc 0x3b5
4) encoded in ISO-8859-7, no "use encoding 'ISO-8859-1';"
prints:
byte[14]: 0xca 0xe1 0xeb 0xe7 0xec 0xdd 0xf1 0xe1 0x20 0xea 0xfc 0xf3
0xec 0xe5
As you can see, in the two cases where "use utf8" resp. "use encoding"
was used, the string constant was converted to a character string: The
so-called utf8 flag is on, the first character ("Κ") is U+039A ("GREEK
CAPITAL LETTER KAPPA"). In the other two cases the string is left as an
uninterpreted byte string: (0xCE 0x9E) is the UTF-8 encoding of a Kappa,
(0xCA) is the ISO-8859-7 encoding of a Kappa.
You can verify that the compiler really converts the string constant
(and doesn't insert a call to encode which is evaluated at run-time)
with -MO=Concise.
But you can't do something like that:
#!/usr/bin/perl
use Greeting "Καλημέρα κόσμε";
use encoding "iso-8859-7";
use warnings;
use strict;
hello();
__END__
because now the use encoding comes too late: The compiler would have to
go back to the start to parse "Καλημέρα κόσμε" correctly.
You've messed everything up. Since compiler wasn't told about encoding
of C<use Greeting>'s argument, it's treated as latin1,
Wrong: It is treated as an unspecified superset of US-ASCII.
My understanding is based on this -- C<perldoc perlunicode>
"use encoding" needed to upgrade non-Latin-1 byte strings
By default, there is a fundamental asymmetry in Perl's Unicode
model: implicit upgrading from byte strings to Unicode strings
assumes that they were encoded in ISO 8859-1 (Latin-1), but
Unicode strings are downgraded with UTF-8 encoding.
This paragraph is confusing. I have a vague idea what the author wanted
to say but even then it's not quite correct. I doubt somebody can
understand this paragraph unless they already exactly understood the
problems before.
This happens because the first 256 codepoints in Unicode happens
to agree with Latin-1.
If encoding is unknown, it's treated as latin1, even if it's not.
This has nothing to do with "use utf8" and "use encoding". The
"implicit upgrading" which is mentioned here happens (for example) when
you concatenate a byte string to a character string. But then the result
*is* a character string, not a byte string.
Byte strings are *not* implicitely assumed to be ISO-8859-1, as you can
easily check by matching against a character class:
% perl -le '$_ = "\x{FC}"; print /\w/ ? "yes" : "no"'
no
% perl -le '$_ = "\x{FC}"; utf8::upgrade($_); print /\w/ ? "yes" : "no"'
yes
So, in a byte string the code point 0xFC does not count as a word
character, but in a character string it does. If byte strings were
assumed to be ISO-8859-1, then 0xFC would be a word character, so
obviously it isn't. Instead, byte strings are assumed to be some
superset of US-ASCII:
% perl -le '$_ = "\x{6C}"; print /\w/ ? "yes" : "no"'
yes
0x6C is a letter ("l") in ASCII, but 0xFC isn't (ASCII defines only
0x00-0x7F).
(I hear that somebody's working to change this to reduce the differences
in behaviour between byte and character strings)
In case there would be C<use utf8> or C<use encoding 'utf8'>,
then the compiler would complain about a malformed UTF-8 character if
the source file was actually in ISO-8859-7.
But it didn't.
It does for me. If I change "use encoding 'ISO-8859-7'" to "use utf8"
in my ISO-8859-7 encoded file, I get a lot of warnings.
You want to say C<"\x{C0}\x{B0}"> is a welformed UTF-8?
Sort of: It decodes cleanly to U+0030. But the canonical (shortest)
encoding of U+0030 is "\x{30}", and UTF-8 generating programs MUST
always produce the canonical encoding. UTF-8 consuming programs should
complain if they encounter a non-canonical encoding. Perl behaves a bit
weirdly here: It doesn't complain when it reads the string, but it does
complain on some operations on it, e.g. ord(). I consider that a bug.
You missed one important thing -- I dislike this feature,
which feature?
Have you ever seen a program text where tokens are mix of ASCII and
non-ASCII characters? I've seen.
I usually stick to using English names for my subs and variables. But if
I was using German names I might as well use umlauts. Mathematical
symbols might also be handy. I would have a problem if my colleague used
Chinese, though ;-).
(I already wanted to use € in a variable name (it contained a monetary
amount in Euro), but € isn't a work character. OTOH, $ isn't either, so
I guess that's fair)
That's what C<use utf8> is fscking for.
What is it for?
Quoting C<perldoc utf8>
Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8. The utility functions described below
are directly usable without "use utf8;".
I believe I already said that once or twice in this thread.
My understanding of "script" is a program text outside of any quotes in
it.
Bull***. A script is the complete program text, including any string
constants, numeric constants, comments, the __DATA__ stream, if any.
Why would a string constant in a script not be part of it?
But,.. here be dragons...
{3335:27} [0:0]$ echo 'фыва' | xxd
0000000: d184 d18b d0b2 d0b0 0a .........
{3356:28} [0:0]$ echo 'фыва' | recode utf8..ucs-2-internal |xxd
0000000: 4404 4b04 3204 3004 0a00 D.K.2.0...
{3414:29} [0:1]$ perl -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
You've mixed up the endianness. 'ф' is U+0444, not U+4404.
Yes, my fault. And why you skipped the next line? It behaves the same
way with endianess fixed.
You mean:
{3415:30} [0:0]$ perl -Mencoding=ucs2 -wle 'print "\x{4404}\x{4b04}\x{3204}\x{3004}"'
Can't locate object method "cat_decode" via package "Encode::Unicode" at
-e line 1.
That doesn't fix the endianness, and it behaves completely differently.
"perl -Mencoding=ucs2" can't work, as I already explained to sln.
hp
.
- Follow-Ups:
- F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
- From: Eric Pozharski
- F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
- References:
- XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: MaggotChild
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Ben Bullock
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: MaggotChild
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Ben Bullock
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Ben Bullock
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: sln
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: sln
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Eric Pozharski
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Peter J. Holzer
- Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- From: Eric Pozharski
- XML::LibXML UTF-8 toString() -vs- nodeValue()
- Prev by Date: (newbie) need help understanding a few lines of code
- Next by Date: foreach performance
- Previous by thread: Re: XML::LibXML UTF-8 toString() -vs- nodeValue()
- Next by thread: F<utf8.pm> is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
- Index(es):