Best ways of managing text encodings in source/regexes?



Hi

I've read around quite a bit about Unicode and python's support for
it, and I'm still unclear about how it all fits together in certain
scenarios. Can anyone help clarify?

* When I say "# -*- coding: utf-8 -*-" and confirm my IDE is saving
the source file as UTF-8, do I still need to prefix all the strings
constructed in the source with u as in myStr = u"blah", even when
those strings contain only ASCII or ISO-8859-1 chars? (It would be a
bother for me to do this for the complete source I'm working on, where
I rarely need chars outside the ISO-8859-1 range.)

* Will python figure it out if I use different encodings in different
modules -- say a main source file which is "# -*- coding: utf-8 -*-"
and an imported module which doesn't say this (for which python will
presumably use a default encoding)? This seems inevitable given that
standard library modules such as re don't declare an encoding,
presumably because in that case I don't see any non-ASCII chars in the
source.

* If I want to use a Unicode char in a regex -- say an en-dash, U+2013
-- in an ASCII- or ISO-8859-1-encoded source file, can I say

myASCIIRegex = re.compile('[A-Z]')
myUniRegex = re.compile(u'\u2013') # en-dash

then read the source file into a unicode string with codecs.read(),
then expect re to match against the unicode string using either of
those regexes if the string contains the relevant chars? Or do I need
to do make all my regex patterns unicode strings, with u""?

I've been trying to understand this for a while so any clarification
would be a great help.

Tim
.



Relevant Pages

  • Re: Linguistically correct Python text rendering
    ... It doesn't matter what the encoding is. ... > issue is that for some writing systems simply outputting ... > the characters in a Unicode string, irrespective of encoding, will ...
    (comp.lang.python)
  • Re: Convert DOS Cyrillic text to Unicode
    ... > You would use Encoding.GetEncoding to get the DOS Cyrillic Encoding ... > Encoding.GetString to convert to a Unicode String. ... > Dim bytesAs Byte ...
    (microsoft.public.dotnet.languages.vb)
  • Re: Writing UTF-8 string to UNICODE file
    ... > I am having no fun at all trying to write utf-8 strings to a unicode file. ... it were a distinct encoding. ... is a byte stream and unicode has nothing to do with bytes. ... If you write a unicode string to something that wants a byte stream, ...
    (comp.lang.python)
  • Encodings and printing unicode
    ... How does the print statement decode unicode strings itis passed? ... that I mean which encoding does it use). ... In my understanding unicode is an 'internal representation' - if you ... So when you 'print' a unicode string, ...
    (comp.lang.python)
  • Re: RfD: c-addr/len
    ... octet-addressing platforms, or 1 CHARS may be any other value, and you ... "Unicode" is not just one encoding. ... to work with UTF-16), or UCS4, which will be fixed-size, but is quite ...
    (comp.lang.forth)

Loading