Re: converting to and from octal escaped UTF--8



MonkeeSage wrote:
On Dec 3, 1:31 am, MonkeeSage <MonkeeS...@xxxxxxxxx> wrote:
On Dec 2, 11:46 pm, Michael Spencer <m...@xxxxxxxxxxxxxxxxx> wrote:



Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
Perhaps something along the lines of:
>>> def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
>>> def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
>>> encode(u"Í")
'\\303\\215'
>>> print decode(_)
Í
HTH
Michael
Nice one. :) If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')
Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) > 128):
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')


orig = u"blaÍblub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec

.



Relevant Pages

  • Re: converting to and from octal escaped UTF--8
    ... have non-ascii characters as as octal-escaped UTF-8 codes. ... encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) ... It encodes only non-ascii and non-printables, ... An optimization...in decode() store matches as keys in a dict, ...
    (comp.lang.python)
  • Re: Practical Common Lisp takes apart binary files
    ... of characters from the the ASCII subset of Unicode since it encodes ... all such characters in a single byte, just as they would be if encoded ... However it can also encode any other Unicode character, ...
    (comp.lang.lisp)
  • Re: Practical Common Lisp takes apart binary files
    ... > UTF-8 is a popular encoding for Unicode text that consists primarily ... > of characters from the the ASCII subset of Unicode since it encodes ... > all such characters in a single byte, just as they would be if encoded ...
    (comp.lang.lisp)
  • Re: Practical Common Lisp takes apart binary files
    ... >> of characters from the the ASCII subset of Unicode since it encodes ... >> all such characters in a single byte, just as they would be if encoded ... However it can also encode any other Unicode character, ...
    (comp.lang.lisp)
  • Re: urlencode vs rawurlencode
    ... > rawurlencode is that urlencode translates spaces to '+' characters, ... > rawurlencode translates it into it's hex code. ... The first could be a URI ... encodes certain unreserved characters ). ...
    (comp.lang.php)