Re: Unicode regex and Hindi language



Huh? I thought it was settled. Read Terry Ready's latest message. Read
the bug report it points to (http://bugs.python.org/issue1693050),
especially the contribution from MvL. To paraphrase a remark by the
timbot, Martin reads Unicode tech reports so that we don't have to.
However if you are a doubter or have insomnia, read http://unicode.org/reports/tr18/

To be fair to Python (and SRE), SRE predates TR#18 (IIRC) - atleast
annex C was added somewhere between revision 6 and 9, i.e. in early
2004. Python's current definition of \w is a straight-forward extension
of the historical \w definition (of Perl, I believe), which,
unfortunately, fails to recognize some of the Unicode subtleties.

In any case, the desired definition is very well available in Python
today - one just has to define a character class that contains all
characters that one thinks \w should contain, e.g. with the code below.
While the regular expression source becomes very large, the compiled
form will be fairly compact, and efficient in lookup.

Regards,
Martin

# UTR#18 says \w is
# \p{alpha}\p{gc=Mark}\p{digit}\p{gc=Connector_Punctuation}
#
# In turn, \p{alpha} is \p{Alphabetic}, which, in turn
# is Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic
# Other_Alphabetic can be ignored: it is a fixed list of
# characters from Mn and Mc, which are included, anyway
#
# \p{digit} is \p{gc=Decimal_Number}, i.e. Nd
# \p{gc=Mark} is all Mark category, i.e. Mc, Me, Mn
# \p{gc=Connector_Punctuation} is Pc
def make_w():
import unicodedata, sys
w_chars = []
for i in range(sys.maxunicode):
c = unichr(i)
if unicodedata.category(c) in \
('Lu','Ll','Lt','Lm','Lo','Nl','Nd',
'Mc','Me','Mn','Pc'):
w_chars.append(c)
return u'['+u''.join(w_chars)+u']'

import re
re.compile(make_w())
.



Relevant Pages

  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogate_Al?= =?windows-1252?Q?pha
    ... characters of an exotic eastern language using an ASCII keyboard. ... It is true to say that any keyboard of any language can be simulated ... communicate in large volume with China or Japan using CJK from Unicode ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: How to check variables for uniqueness ?
    ... characters is the sequence SS. ... is simply capitalizing strings. ... The fact that case mapping in English /is/ simple is neither here not ... That is a fair criticism of the Unicode position. ...
    (comp.lang.java.programmer)