Re: regular expressions and the LOCALE flag



Baz Walter wrote:
On 03/08/10 19:40, MRAB wrote:
Baz Walter wrote:
the python docs say that re.LOCALE makes certain character classes
"dependent on the current locale".

re.LOCALE just passes the character to the underlying C library. It
really only works on bytestrings which have 1 byte per character.

the re docs don't specify 8-bit encodings: they just refer to the 'current locale'.

And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
all those string literals starting with the 'u' prefix are Unicode
strings!

not sure what you mean by this: if the string was encoded as utf8, '\w' still wouldn't match any of the non-ascii characters.

Strings with the 'u' prefix are Unicode strings, not bytestrings. They
don't have an encoding. A UTF-8 string is a bytestring in which the
bytes represent Unicode codepoints encoded as UTF-8.

Locale encodings are more trouble than they're worth. Unicode is better.
:-)

yes, i'm really just trying to decide whether i should offer 'locale' as an option in my program. given the unintuitive way re.LOCALE works, i'm not sure that i should.

are you saying that it only really makes sense for *bytestrings* to be used with re.LOCALE?

if so, the re docs certainly don't make that clear.

The re module can match against 3 types of string:

1. ASCII (default in Python 2): bytestring with characters in the ASCII
range (1 byte per character). However, it doesn't complain if it sees
bytes/characters outside the ASCII range.

2. LOCALE: bytestring with characters in the current locale (but only 1
byte per character). Characters are categorised according to the
underlying C library; for example, 'a' is a letter if isalpha('a')
returns true.

3. UNICODE (default in Python 3): Unicode string.
.



Relevant Pages

  • Re: Writing Japanese or Chinese strings in a text file
    ... what locale are you running in? ... So they are right in the excel file. ... > original characters). ... web page (exactly I put with VB the string from a textarea in a chinese ...
    (microsoft.public.vb.general.discussion)
  • Re: [slrn] newbie stuck some what
    ... But you'll need to adjust your locale settings from ISO-Latin to ... That can't be done based on the rightmost five characters, ... But I'm not a stranger to string ... I'm a stranger to all programming (apart from my BASIC experiments ...
    (news.software.readers)
  • REWARD: chr() not working for Chinese "Locale"
    ... I have a real stumper of an issue...I am creating a string, ... Smartphone's "Locale" setting to "English", the string is built of the ... proper individual characters representing the specified values for X. ...
    (microsoft.public.pocketpc.developer)
  • Re: Arabic or Chinese characters in a URL link give error copying
    ... > data contains Unicode characters - by definition. ... If his locale was Arabic or Chinese then I ... > character has a Unicode value outside the 8-bit range. ... then convert it to a Unicode string. ...
    (microsoft.public.vb.general.discussion)
  • Re: How to convert Infix notation to postfix notation
    ... If this is for an error message, why isn't it using stderr for its output? ... array of 15 characters, and you call this function with the limit 15 on ... Making sure that the only string I allocate and append to, ... because mulFactor in all versions must needs incorporate the functions ...
    (comp.lang.c)