Re: byte count unicode string



willie wrote:
willie wrote:
>> Marc 'BlackJack' Rintsch:
>>
>> >In <mailman.313.1158732191.10491.python-l...@xxxxxxxxxx>, willie
wrote:
>> >> # What's the correct way to get the
>> >> # byte count of a unicode (UTF-8) string?
>> >> # I couldn't find a builtin method
>> >> # and the following is memory inefficient.

>> >> ustr = "example\xC2\x9D".decode('UTF-8')

>> >> num_chars = len(ustr) # 8

>> >> buf = ustr.encode('UTF-8')

>> >> num_bytes = len(buf) # 9

>> >That is the correct way.

>> # Apologies if I'm being dense, but it seems
>> # unusual that I'd have to make a copy of a
>> # unicode string, converting it into a byte
>> # string, before I can determine the size (in bytes)
>> # of the unicode string. Can someone provide the rational
>> # for that or correct my misunderstanding?

>You initially asked "What's the correct way to get the byte countof a
>unicode (UTF-8) string".
>
>It appears you meant "How can I find how many bytes there are in the
>UTF-8 representation of a Unicode string without manifesting the UTF-8
>representation?".
>
>The answer is, "You can't", and the rationale would have to be that
>nobody thought of a use case for counting the length of the UTF-8 form
>but not creating the UTF-8 form. What is your use case?

# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.

# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.

What is the database API expecting to get as an arg: a Python unicode
object, or a Python str (8-bit, presumably encoded in utf-8) ?


name = post.input('name') # utf-8 string

You are confusing the hell out of yourself. You say that your web app
deals only with UTF-8 strings. Where do you get "the unicode string"
from??? If name is a utf-8 string, as your comment says, then len(name)
is all you need!!!

*PLEASE* print type(name), repr(name) so that we can see what type it
is!!
If it says the type is str, then it's an 8-bit string, (presumably)
encoded in utf-8.
If it says the type is unicode, then please explain "web app that only
deals with UTF-8 strings" ...


# preferable
if bytes(name) > 50:
send_http_headers()
display_page_begin()
display_error_msg('the name is too long')
display_form(name)
display_page_end()

# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:

They'd be garbage collected unless you worked very hard to hang on to
them. How large is "large"?

.



Relevant Pages

  • Re: Interpretation of extensions different from Unix/Linux?
    ... the use of UTF-8 in this way is the recommendation of the ARG. ... (UTF-8 is a problem of its own in Ada. ... a UTF-8 encoded string is a String. ... You can't enumerate roots in Windows, ...
    (comp.lang.ada)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • Re: UTF-8 encoding
    ... I need to pass a UTF-8 encoded writer ... reading that file with the system's default encoding. ... String), but used elsewhere as if it were a StringBuffer. ... There's a very good reason that ...
    (comp.lang.java.programmer)
  • Re: Chinese filenames
    ... Always use simple ASCII characters. ... Ensure your PHP script be properly UTF-8 encoded. ... The name of the file can be acquired as a UTF-8 string: ...
    (comp.lang.php)
  • Seed7 (was: Program compression)
    ... Does Seed7 include a parser that reads Seed7 source-code syntax ... ] structures with string elements) the memory allocated for all ... | The type 'char' describes UNICODE characters. ... UTF-8 coding of a single character, ...
    (comp.programming)