Wide character notation, was Re: How to NOT use utf8.

From: Alan J. Flavell (flavell_at_ph.gla.ac.uk)
Date: 02/26/05


Date: Sat, 26 Feb 2005 14:02:56 +0000

On Fri, 25 Feb 2005, pkaluski wrote:

> (Carp/Heavy.pm)
> 59 # The following handling of "control chars" is direct from
> 60 # the original code - I think it is broken on Unicode though.
> 61 # Suggestions?
> 62 $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;
>
> So the author suggests that there may be a problems for unicode,

This following is basically addressed to any here who have a working
familiarity with the Unicode support in Perl and may be able to
comment on what appears to be a bit of confusion (either in my head or
in the Perl documentation).

If I refer to the documentation that comes with ActivePerl 5.8.6 (yes,
I installed an updated version to see if anything significant had
changed), then under "perlunicode", after it says:

 Unicode characters can also be added to a string by using the \x{...}
 notation. The Unicode code for the desired character, in hexadecimal,
 should be placed in the braces. For instance, a smiley face is
 \x{263A}.

- we still see this statement, as was in earlier versions:

 This encoding scheme only works for characters with a code
 of 0x100 or above.

That last sentence seems to imply that we cannot write notations
such as \x{9} or \x{41} - nor even \x{0009} etc. - or at least that
if we try, the results may not be what we expected.

However, if I take a look at perldata, then I find (in a somewhat
tangential context) this:

 A literal of the form v1.20.300.4000 is parsed as a string composed
 of characters with the specified ordinals. This form, known as
 v-strings, provides an alternative, more readable way to construct
 strings, rather than use the somewhat less readable interpolation
 form "\x{1}\x{14}\x{12c}\x{fa0}". This is useful for representing
 Unicode strings [...]

which seems to me to directly contradict what it says in perlunicode.

Looking now at "uniintro", it says:

| perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
|
| produces a fairly useless mixture of native bytes and UTF-8, as well
| as a warning:
|
| Wide character in print at ...

Let's try to understand what this is getting at...?

If we code "\x{0100}\x{DF}\n" (that's "A macron, sharp s"), then Perl
*knows* that we need Unicode, and will store a unicode string
(which, as we know, is stored internally in its utf8 format).

If we code \x{DF} alone, then it looks to me as if Perl thinks that
iso-8859-1 will suffice, so it stores the character (which is still
"sharp s") in iso-8859-1 format as a single byte. Yes?

However, surely if anything is done with this character(string) which
calls for unicode format, Perl will upgrade it to unicode format,
won't it?

So I don't really understand that:

  "produces a fairly useless mixture of native bytes and UTF-8"

which is quoted above. What's *wrong* (as I see it) with what's
quoted above is that there is an attempt to output a "wide" Unicode
character (A macron, \x{100}) without the proper arrangements having
been made.

But if I have executed

  binmode STDOUT, ":utf8";

and then execute e.g

  print "\x{9}\x{41}\x{a3}\x{df}\n", "\x{9}\x{41}\x{a3}\x{df}\x{100}\n";

then it will print two lines of properly-encoded utf-8 representing
tab, A, pound sterling, sharp s, newline , followed by
tab, A, pound sterling, sharp s, A macron, newline

just as I had intended. Is there some reason why this *shouldn't*
work, or is the statement:

 This encoding scheme only works for characters with a code
 of 0x100 or above.

misleading, confusing, or what?

Incidentally, the above samples will need to be run under Windows
using the -C option (or equivalent), since Windows needs to be told
that it's to expect utf8 as output.

OK, now let's come back to this piece in the Carp/Heavy.pm source:

> (Carp/Heavy.pm)
> 59 # The following handling of "control chars" is direct from
> 60 # the original code - I think it is broken on Unicode though.
> 61 # Suggestions?
> 62 $arg =~ s/([[:cntrl:]]|[[:^ascii:]])/sprintf("\\x{%x}",ord($1))/eg;

Evidently the author is intending to *display* characters which are
either control characters, or non-ASCII, in a \x{...} notation, by
analogy with Perl's "wide character" notation for source code.

As far as I can see (and test), *this code works* when supplied with
character strings as data. For the character strings I mentioned
above, it's displaying

  \x{9}A\x{a3}\x{df}\x{100}

just as was, I think, intended.

Now, as we see, there seems to be some uncertainty in the Perl
documentation as to whether this notation is properly usable *in Perl
source code*. But what's happening here is no more than a diagnostic
technique, so I'm not too sure what it is that the author is worrying
about.

Where I /can/ now confirm that something nasty is going on, however,
is with Perl's -d option. I sometimes get invalid utf8 sequences
reported, and sometimes Perl crashes. But I think that's a topic for
a different sub-thread. Whatever is going wrong, I don't think it's
this commented statement, as such.

advice, please?



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • How to decode JavaScripts encodeURIComponent in Perl.
    ... who struggle with the Perl language and all it's myriad idiosyncracies. ... character sets, but I acknowledge that if you *are* dealing with what I ... they find they can't use their own native character-set in a URI, ... So now we have Unicode -- a vastly superior term, to some people, ...
    (comp.lang.perl.misc)
  • Re: Optimization of code
    ... external devices that take 8-bit character string commands. ... convert Unicode to ANSI. ... CStringA command; ... that and it could just assume Unicode for all strings, ...
    (microsoft.public.vc.mfc)
  • Re: Creating UNICODE filenames with PERL 5.8
    ... I didn't clue in from the documentation ... It comes back with a two character ... Do you know of a method of reading directories to get the UNICODE file ... >> I have been having distinct trouble creating file names in PERL ...
    (comp.lang.perl.misc)
  • Re: Creating UNICODE filenames with PERL 5.8
    ... :> I have been having distinct trouble creating file names in PERL ... I'm not so sure about UNICODE... ... :> character displays the same as it does in 'charmap'. ... Imagine if all that hardware still used 16 or 24 bit memory addresses. ...
    (comp.lang.perl.misc)