Re: strange effect with [:lower:] in perl

From: Ben Morrow (usenet_at_morrow.me.uk)
Date: 10/31/03


Date: Thu, 30 Oct 2003 23:47:47 +0000 (UTC)


"Alan J. Flavell" <flavell@ph.gla.ac.uk> wrote:
> On Wed, 29 Oct 2003, Ben Morrow wrote:
>
> This is odd. If I execute this code which we discussed before:
<snip>
> Could I summarise that by saying (applies to both versions):
>
> * if the locale does not include utf-8, then "use locale" switches on
> the reporting of lower-case accented letters.
>
> This is what you already explained as being a compatibility feature
> in the absence of "use locale", right?

Yup.

> * but if the locale _does_ imply utf-8, then it seems something
> different happens. In this test, "use locale" doesn't report accented
> lower-case letters, in either Perl version.
>
> As we saw in the earlier discussion: if the string has been forcibly
> upgraded to Perl's unicode format, then those accented letters were
> reported, irrespective of "use locale", which is fine by me.
>
> But it seems that if the string has not been upgraded to unicode
> format, then even with "use locale" in effect, the accented letters
> are not reported - this bit seems, at least, unintuitive (even a
> mistake?).
>
> Are my observations correct? Any insights?

Well, what you say certainly holds on my machine as well... I think
the answer to this is in perlunicode:

| BUGS
| Interaction with Locales
|
| Use of locales with Unicode data may lead to odd results.
| [...] Use of locales with Unicode is discouraged.

and yes, it probably is a bug. Certainly, a UTF8 locale is treated
qualitatively differently from any other.

What seems to be happening in that in 5.6 'use locale' with a UTF8
locale is treated identically to 'use utf8', and in 5.8 it is ignored
(at least as far as character sets/encodings are concerned); perl then
treats all non-upgraded data as though locale support wasn't present,
and assumes it's encoded in iso8859-1 when it needs to be upgraded.

This is arguably incorrect :), but I guess it's a reasonable
compromise. It would be nice to have a 'all data has the utf8 flag on,
all the time, except under 'use bytes'' pragma; or is this what the
new -C flag (or having a UTF8 locale in 5.8.0) does, in effect?

The Right Answer, I guess, is this:

Under 'no locale':

  * Upgraded data is in utf8. [[:lower:]] et al match exactly the same
    as \p{Ll}: i.e., by the definitions given in the Unicode database.

  * All non-upgraded data is considered to be ASCII[2]. Strings
    containing top-bit-set bytes are binary, and cannot be
    upgraded... or maybe all the top-bit-set chars are upgraded to
    their corresponding Unicode codepoints, with or without a
    warning.

    I don't like the current 'let's just randomly assume iso8859-1'
    approach. I would like to say that top-bit-set chars should all be
    upgraded to U+FFFD, but I feel this might cause problems... :)

  * Since non-upgraded data is ASCII, [[:lower:]] == [a-z] [3]. Matching
    against \p{Ll} causes the data to be upgraded (if you're using
    Unicode-y operators, you can't object to Perl upgrading), and
    matched against the Unicode database.

Under 'use locale':

  * Upgraded data is utf8. Non-upgraded data (when treated as text) is
    considered to be encoded as the charset[1] portion of the locale,
    and is upgraded to utf8 on that basis when necessary.

  * [[:lower:]] != \p{Ll}. [[:lower:]] matches (character set implied
    by locale) intersect (\p{Ll}), on both non- and upgraded data.

  * Opened filehandles have an appropriate :encoding() layer
    automatically pushed.

Under 'use bytes' (which overrides 'use locale'):

  * All data is considered to be binary, and the use of any text-y
    regex components such as [[:lower:]] or \p is an error. [a-z] is
    interpreted as [\x61-\x7a] (or the equivalent EBCDIC).

  * Opened filehandles have :raw automatically pushed.

locale should have an two functions, locale::to_local and
locale::from_local which work identically to Encode::(en|de)code with
the appropriate encoding supplied.

Hmm, wonder what p5p's opinion on all that would be? "Go away, it's
working now, the right time to have said this was some time ago" would
certainly be fair enough... :)

Ben

[1] ...in the MIME sense, i.e. an encoding. I am aware of the
    difference, it's just tiresome to be Correct all the time :).

[2] or EBCDIC, as appropriate, throughout.

[3] or rather, [abcd...xyz], to account for EBCDIC.

-- 
"The Earth is degenerating these days. Bribery and corruption abound.
Children no longer mind their parents, every man wants to write a book,
and it is evident that the end of the world is fast approaching."
     -Assyrian stone tablet, c.2800 BC                         ben@morrow.me.uk


Relevant Pages

  • Re: LANG, locale, unicode, setup.py and Debian packaging
    ... strings always, independent of locale. ... A wxPython treeview control (unicode build) ... os.listdirwith a unicode path passed to it ...
    (comp.lang.python)
  • Re: UTF-8
    ... For instance, if your UTF-8 represents Chinese characters, and your locale ... Here's a bit of code that will convert between UTF-8 from a file and Unicode ...
    (microsoft.public.vb.general.discussion)
  • Re: unicode apps?
    ... regardless of whether or not I'm using Unicode. ... // Format a date in the locale convention, ...
    (microsoft.public.vc.mfc)
  • Re: display unicode db text in visual C++ mfc ide 2003
    ... The active input locale will be in ENglish in OS-XP environment. ... 80% of your application us Unicode. ...
    (microsoft.public.vc.mfc)
  • Re: Kernel fun => dreaded "locale not supported" problem
    ... depends on a newer libc6 than Dapper has, ... Edgy main repository and fixed the broken dependency by upgrading ... locale: Cannot set LC_COLLATE to default locale: File not found ...
    (Ubuntu)