Re: inserting Unicode character in dictionary - Python



On Fri, 17 Oct 2008 11:32:36 -0600, Joe Strout wrote:

On Oct 17, 2008, at 11:24 AM, Marc 'BlackJack' Rintsch wrote:

kw = 'генских'

What do you mean by "does not work"? And you are aware that the above
snipped doesn't involve any unicode characters!? You have a byte
string there -- type `str` not `unicode`.

Just checking my understanding here -- are the following all true:

1. If you had prefixed that literal with a "u", then you'd have Unicode.

Yes.

2. Exactly what Unicode you get would be dependent on Python properly
interpreting the bytes in the source file -- which you can make it do by
adding something like "-*- coding: utf-8 -*-" in a comment at the top of
the file.

Yes, assuming the encoding on that comment matches the actual encoding of
the file.

3. Without the "u" prefix, you'll have some 8-bit string, whose
interpretation is... er... here's where I get a bit fuzzy.

No interpretation at all, just the bunch of bytes that happen to be in
the source file.

What if your source file is set to utf-8? Do you then have a proper
UTF-8 string, but the problem is that none of the standard Python
library methods know how to properly interpret UTF-8?

Well, the decode method knows how to decode that bytes into a `unicode`
object if you call it with 'utf-8' as argument.

4. In Python 3.0, this silliness goes away, because all strings are
Unicode by default.

Yes and no. The problem just shifts because at some point you get into
similar troubles, just in the other direction. Data enters the program
as bytes and must leave it as bytes again, so you have to deal with
encodings at those points.

Ciao,
Marc 'BlackJack' Rintsch
.



Relevant Pages

  • Re: Proposal: require 7-bit source strs
    ... >> After the source file has been converted to Unicode, ... >> It can be used to ensure that the source file doesn't contain national ... >> locale's character set instead of in the source file's character set. ... string B is then translated back to the source ...
    (comp.lang.python)
  • Re: Best ways of managing text encodings in source/regexes?
    ... the source file as UTF-8, do I still need to prefix all the strings ... those strings contain only ASCII or ISO-8859-1 chars? ... I recommend it is safer to only use Unicode ... that will change - string literals will automatically be ...
    (comp.lang.python)
  • Re: Tranfering unicod charcters in Socket programming!
    ... You are telling about conversion b/w MBCS to Unicode. ... If this is not possible Shall I try with string to wstring ... int SendStringAsUnicode ...
    (microsoft.public.win32.programmer.networks)
  • Re: using structs like BROWSEINFO and OPENFILENAME (string members
    ... your discussion of unicode ... vs ansi reminded me to recheck my typelib and found a couple of errors. ... > is declared as string, the other is declared as long. ...
    (microsoft.public.vb.winapi)
  • Re: Tranfering unicod charcters in Socket programming!
    ... As you said I have to use std::wstring for unicode characters .But ... std::string object, which is a wrapper over ANSI string. ... int CParser::RetrieveCmd(string strRecvbuf, string* strCmd, ... bytesRecv - is the number of bytes. ...
    (microsoft.public.win32.programmer.networks)