byte count unicode string



Martin v. Löwis:

>willie schrieb:
>
>> Thank you for your patience and for educating me.
>> (Though I still have a long way to go before enlightenment)
>> I thought Python might have a small weakness in
>> lacking an efficient way to get the number of bytes
>> in a "UTF-8 encoded Python string object" (proper?),
>> but I've been disabused of that notion.
>
>Well, to get to the enlightenment, you have to understand
>that Unicode and UTF-8 are *not* synonyms.
>
>A Python Unicode string is an abstract sequence of
>characters. It does have an in-memory representation,
>but that is irrelevant and depends on what microprocessor
>you use. A byte string is a sequence of quantities with
>8 bits each (called bytes).
>
>For each of them, the notion of "length" exists: For
>a Unicode string, it's the number of characters; for
>a byte string, the number of bytes.
>
>UTF-8 is a character encoding; it is only meaningful
>to say that byte strings have an encoding (where
>"UTF-8", "cp1252", "iso-2022-jp" are really very
>similar). For a character encoding, "what is the
>number of bytes?" is a meaningful question. For
>a Unicode string, this question is not meaningful:
>you have to specify the encoding first.
>
>Now, there is no len(unicode_string, encoding) function:
>len takes a single argument. To specify both the string
>and the encoding, you have to write
>len(unicode_string.encode(encoding)). This, as a
>side effect, actually computes the encoding.
>
>While it would be possible to answer the question
>"how many bytes has Unicode string S in encoding E?"
>without actually encoding the string, doing so would
>require codecs to implement their algorithm twice:
>once to count the number of bytes, and once to
>actually perform the encoding. Since this operation
>is not that frequent, it was chosen not to put the
>burden of implementing the algorithm twice (actually,
>doing so was never even considered).


Thanks for the thorough explanation. One last question
about terminology then I'll go away :)
What is the proper way to describe "ustr" below?

>>> ustr = buf.decode('UTF-8')
>>> type(ustr)
<type 'unicode'>


Is it a "unicode object that contains a UTF-8 encoded
string object?"

.



Relevant Pages

  • Re: byte count unicode string
    ... in a "UTF-8 encoded Python string object", ... A Python Unicode string is an abstract sequence of ... UTF-8 is a character encoding; ...
    (comp.lang.python)
  • Re: byte count unicode string
    ... in a "UTF-8 encoded Python string object", ... A Python Unicode string is an abstract sequence of ... UTF-8 is a character encoding; ...
    (comp.lang.python)
  • Re: Why asci-only symbols?
    ... >> Perhaps string equivalence in keys will be treated like numeric equivalence? ... I know typewill be and in itself contain no encoding information now, ... >and a Unicode string, the system default encoding ...
    (comp.lang.python)
  • Re: Byte Array to String
    ... retrieved text will mismatch the original characters. ... encoding the characters. ... Dim strFileData as String ...
    (microsoft.public.dotnet.framework.aspnet)
  • F is evil (was: XML::LibXML UTF-8 toString() -vs- nodeValue())
    ... And with C<use encoding 'utf8';> you'll get the same character string, ... A script is the complete program text, ...
    (comp.lang.perl.misc)