python3 raw strings and \u escapes



In python2, "\u" escapes are processed in raw unicode
strings. That is, ur'\u3000' is a string of length 1
consisting of the IDEOGRAPHIC SPACE unicode character.

In python3, "\u" escapes are not processed in raw strings.
r'\u3000' is a string of length 6 consisting of a backslash,
'u', '3' and three '0' characters.

This breaks a lot of my code because in python 2
re.split (ur'[\u3000]', u'A\u3000A') ==> [u'A', u'A']
but in python 3 (the result of running 2to3),
re.split (r'[\u3000]', 'A\u3000A' ) ==> ['A\u3000A']

I can remove the "r" prefix from the regex string but then
if I have other regex backslash symbols in it, I have to
double all the other backslashes -- the very thing that
the r-prefix was invented to avoid.

Or I can leave the "r" prefix and replace something like
r'[ \u3000]' with r'[  ]'. But that is confusing because
one can't distinguish between the space character and
the ideographic space character. It also a problem if a
reader of the code doesn't have a font that can display
the character.

Was there a reason for dropping the lexical processing of
\u escapes in strings in python3 (other than to add another
annoyance in a long list of python3 annoyances?)

And is there no choice for me but to choose between the two
poor choices I mention above to deal with this problem?

.



Relevant Pages

  • Re: python3 raw strings and u escapes
    ... consisting of the IDEOGRAPHIC SPACE unicode character. ... "\u" escapes are not processed in raw strings. ... This breaks a lot of my code because in python 2 ...
    (comp.lang.python)
  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Need help on string manipulation
    ... better to convert strings to UCS-32 before manipulation? ... Characters represented by wchar_t must use one wchar_t per character, ... which may use a multibyte encoding. ... use some newer Unicode characters, if this is a problem for you, then ...
    (comp.lang.c)
  • Re: left$, mid$ and right$ (was: ANN: pldev.org)
    ... Not a type, no, but it does have the concept of "character string". ... the terminator in C is not a character at all but a a ... Use of a terminator for strings was a holdover from B. ... all arrays, and as a result the language is simpler to describe and to ...
    (comp.lang.misc)
  • Re: The Lisp Curse
    ... NUL-terminated strings. ... pointer types can vary) and also a null character. ... A string, that is a string of 'char' elements, each ... C must clear the bits for the entire string terminator ...
    (comp.lang.forth)