Wide character notation, was Re: How to NOT use utf8.
From: Alan J. Flavell (flavell_at_ph.gla.ac.uk)
Date: 02/26/05
- Next message: Tad McClellan: "Re: Great new resource for freelancers!"
- Previous message: Bart Lateur: "Re: OOP Tutorial"
- In reply to: pkaluski: "Re: How to NOT use utf8."
- Next in thread: Brian McCauley: "Re: Wide character notation, was Re: How to NOT use utf8."
- Reply: Brian McCauley: "Re: Wide character notation, was Re: How to NOT use utf8."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 26 Feb 2005 14:02:56 +0000
On Fri, 25 Feb 2005, pkaluski wrote:
> (Carp/Heavy.pm)
> 59 # The following handling of "control chars" is direct from
> 60 # the original code - I think it is broken on Unicode though.
> 61 # Suggestions?
> 62 $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;
>
> So the author suggests that there may be a problems for unicode,
This following is basically addressed to any here who have a working
familiarity with the Unicode support in Perl and may be able to
comment on what appears to be a bit of confusion (either in my head or
in the Perl documentation).
If I refer to the documentation that comes with ActivePerl 5.8.6 (yes,
I installed an updated version to see if anything significant had
changed), then under "perlunicode", after it says:
Unicode characters can also be added to a string by using the \x{...}
notation. The Unicode code for the desired character, in hexadecimal,
should be placed in the braces. For instance, a smiley face is
\x{263A}.
- we still see this statement, as was in earlier versions:
This encoding scheme only works for characters with a code
of 0x100 or above.
That last sentence seems to imply that we cannot write notations
such as \x{9} or \x{41} - nor even \x{0009} etc. - or at least that
if we try, the results may not be what we expected.
However, if I take a look at perldata, then I find (in a somewhat
tangential context) this:
A literal of the form v1.20.300.4000 is parsed as a string composed
of characters with the specified ordinals. This form, known as
v-strings, provides an alternative, more readable way to construct
strings, rather than use the somewhat less readable interpolation
form "\x{1}\x{14}\x{12c}\x{fa0}". This is useful for representing
Unicode strings [...]
which seems to me to directly contradict what it says in perlunicode.
Looking now at "uniintro", it says:
| perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
|
| produces a fairly useless mixture of native bytes and UTF-8, as well
| as a warning:
|
| Wide character in print at ...
Let's try to understand what this is getting at...?
If we code "\x{0100}\x{DF}\n" (that's "A macron, sharp s"), then Perl
*knows* that we need Unicode, and will store a unicode string
(which, as we know, is stored internally in its utf8 format).
If we code \x{DF} alone, then it looks to me as if Perl thinks that
iso-8859-1 will suffice, so it stores the character (which is still
"sharp s") in iso-8859-1 format as a single byte. Yes?
However, surely if anything is done with this character(string) which
calls for unicode format, Perl will upgrade it to unicode format,
won't it?
So I don't really understand that:
"produces a fairly useless mixture of native bytes and UTF-8"
which is quoted above. What's *wrong* (as I see it) with what's
quoted above is that there is an attempt to output a "wide" Unicode
character (A macron, \x{100}) without the proper arrangements having
been made.
But if I have executed
binmode STDOUT, ":utf8";
and then execute e.g
print "\x{9}\x{41}\x{a3}\x{df}\n", "\x{9}\x{41}\x{a3}\x{df}\x{100}\n";
then it will print two lines of properly-encoded utf-8 representing
tab, A, pound sterling, sharp s, newline , followed by
tab, A, pound sterling, sharp s, A macron, newline
just as I had intended. Is there some reason why this *shouldn't*
work, or is the statement:
This encoding scheme only works for characters with a code
of 0x100 or above.
misleading, confusing, or what?
Incidentally, the above samples will need to be run under Windows
using the -C option (or equivalent), since Windows needs to be told
that it's to expect utf8 as output.
OK, now let's come back to this piece in the Carp/Heavy.pm source:
> (Carp/Heavy.pm)
> 59 # The following handling of "control chars" is direct from
> 60 # the original code - I think it is broken on Unicode though.
> 61 # Suggestions?
> 62 $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;
Evidently the author is intending to *display* characters which are
either control characters, or non-ASCII, in a \x{...} notation, by
analogy with Perl's "wide character" notation for source code.
As far as I can see (and test), *this code works* when supplied with
character strings as data. For the character strings I mentioned
above, it's displaying
\x{9}A\x{a3}\x{df}\x{100}
just as was, I think, intended.
Now, as we see, there seems to be some uncertainty in the Perl
documentation as to whether this notation is properly usable *in Perl
source code*. But what's happening here is no more than a diagnostic
technique, so I'm not too sure what it is that the author is worrying
about.
Where I /can/ now confirm that something nasty is going on, however,
is with Perl's -d option. I sometimes get invalid utf8 sequences
reported, and sometimes Perl crashes. But I think that's a topic for
a different sub-thread. Whatever is going wrong, I don't think it's
this commented statement, as such.
advice, please?
- Next message: Tad McClellan: "Re: Great new resource for freelancers!"
- Previous message: Bart Lateur: "Re: OOP Tutorial"
- In reply to: pkaluski: "Re: How to NOT use utf8."
- Next in thread: Brian McCauley: "Re: Wide character notation, was Re: How to NOT use utf8."
- Reply: Brian McCauley: "Re: Wide character notation, was Re: How to NOT use utf8."
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|