Re: Most pythonic way to truncate unicode?



Andrew Fong <FongAndrew <at> gmail.com> writes:


I need to ...
1) Truncate long unicode (UTF-8) strings based on their length in
BYTES.
2) I don't want to accidentally chop any unicode characters in half.
If the byte truncate length would normally cut a unicode character in
2, then I just want to drop the whole character, not leave an orphaned
byte.
I'm using Python2.6, so I have access to things like bytearray.

Using bytearray saves you from using ord()
but runs the risk of accidental mutation.

Are
there any built-in ways to do something like this already? Or do I
just have to iterate over the unicode string?

Converting each character to utf8 and checking the
total number of bytes so far?
Ooooh, sloooowwwwww!

The whole concept of "truncating unicode"
you mean "truncating utf8") seems
rather unpythonic to me.

Another alternative is to iterate backwards
over the utf8 string looking for a
character-starting byte. It leads to a candidate
for Unpythonic Code of the Year:

def utf8trunc(u8s, maxlen):
assert maxlen >= 1
alen = len(u8s)
if alen <= maxlen:
return u8s
pos = maxlen - 1
while pos >= 0:
val = ord(u8s[pos])
if val & 0xC0 != 0x80:
# found an initial byte
break
pos -= 1
else:
# no initial byte found
raise ValueError("malformed UTF-8 [1]")
if maxlen - pos > 4:
raise ValueError("malformed UTF-8 [2]")
if val & 0x80:
charlen = (2, 2, 3, 4)[(val >> 4) & 3]
else:
charlen = 1
nextpos = pos + charlen
assert nextpos >= maxlen
if nextpos == maxlen:
return u8s[:nextpos]
return u8s[:pos]

if __name__ == "__main__":
tests = [u"", u"\u0000", u"\u007f", u"\u0080",
u"\u07ff", u"\u0800", u"\uffff" ]
for testx in tests:
test = u"abcde" + testx + u"pqrst"
u8 = test.encode('utf8')
print repr(test), repr(u8), len(u8)
for mlen in range(4,
8 + len(testx.encode('utf8'))):
u8t = utf8trunc(u8, mlen)
print " ", mlen, len(u8t), repr(u8t)

Tested to the extent shown. Doesn't pretend to check
for all cases of UTF-8
malformation, just easy ones :-)

Cheers,
John

.



Relevant Pages

  • Re: VB - Ascii to Unicode and then Unicode to UTF-8 conversion (Very desperate!!)
    ... Latin together) then you have to use a Unicode column type. ... AscW returns the real Unicode character ... for Chinese characters, ... then the next thing to worry about is your CSV file. ...
    (microsoft.public.vb.general.discussion)
  • Re: Unicode Support
    ... if two Unicode strings are the same? ... UTF-16 is basically telling everyone "ok we all got to start ... character, and will likely support *both* endians. ... UTF-8 encodings are also easy to learn to ...
    (alt.lang.asm)
  • Re: Determining if a string is Unicode
    ... there's nothing magic about Unicode. ... where each character occupies 2 bytes, as opposed to a Single-Byte Character ... You could load up a string with rubbish, ... > INF file like so: ...
    (microsoft.public.vb.general.discussion)
  • Re: KANJD212
    ... >>Who decides the factors and what are their criteria, Unicode? ... But once a character is defined/get a codepoint in Unicode it ... standard modifies the codepoint of the kanji to a totally new ... I can use a code like JIS X0208 along with a font ...
    (sci.lang.japan)
  • Re: Enhanced Unicode support for "Go" tools
    ... the point to remember is that UNICODE is a _character ... It's the fonts, the OS and the application which work together ... society for the protection of French from English ...
    (alt.lang.asm)