converting to and from octal escaped UTF--8

Michael Goerz · Dec 3, 2007

Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Ã" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
vice versa?

I know I can get the code point by doingbut there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael

Michael Goerz · Dec 3, 2007

Michael said:
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Ã" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
vice versa?

I know I can get the code point by doing
but there doesn't seem to be any similar method for getting the octal
escaped version.

Thanks,
Michael

I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_________

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = ""
while len(hexstring) > 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result

def unescape(s):
result = ""
while len(s) > 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape("\303\215")
print unescape('adf\\303\\215adf')

MonkeeSage · Dec 3, 2007

Michael said:
Michael said:

Hi,

Click to expand...

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

Click to expand...

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

Click to expand...

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Click to expand...

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?

Click to expand...

I know I can get the code point by doing
but there doesn't seem to be any similar method for getting the octal
escaped version.

Click to expand...

Thanks,
Michael

Click to expand...

I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_________

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = ""
while len(hexstring) > 0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += "\\" + octbyte[-3:]
return result

def unescape(s):
result = ""
while len(s) > 0:
if s[0] == "\\":
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, 8))
except ValueError:
result += "\\"
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape("\303\215")
print unescape('adf\\303\\215adf')

Looks like escape() can be a bit simpler...

def escape(s):
result = []
for char in s:
result.append("\%o" % ord(char))
return ''.join(result)

Regards,
Jordan

Michael Goerz · Dec 3, 2007

MonkeeSage said:
> Looks like escape() can be a bit simpler...

def escape(s):
result = []
for char in s:
result.append("\%o" % ord(char))
return ''.join(result)

Regards,
Jordan

Very neat! Thanks a lot...
Michael

Michael Spencer · Dec 3, 2007

Michael said:
Hi,

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

For example, the letter "Ã" (latin capital I with acute, code point 205)
would come out as "\303\215".

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Does anyone have any suggestions on how to go from "Ã" to "\303\215" and
vice versa?

Perhaps something along the lines of:
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
... ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
HTH
Michael

MonkeeSage · Dec 3, 2007

Michael said:
Michael said:

Hi,

Click to expand...

I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.

Click to expand...

For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".

Click to expand...

I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.

Click to expand...

Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?

Click to expand...

Perhaps something along the lines of:
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
HTH
Michael

Nice one.

If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

MonkeeSage · Dec 3, 2007

Perhaps something along the lines of:

def encode(source):

Click to expand...

... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...

def decode(encoded):

Click to expand...

... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...

encode(u"Í") '\\303\\215'
print decode(_) Í

Click to expand...

HTH
Michael

Click to expand...

Nice one. If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

Michael Goerz · Dec 3, 2007

MonkeeSage said:
Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
Perhaps something along the lines of:
def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
encode(u"Í")
'\\303\\215'
print decode(_)
Í
HTH
Michael

Click to expand...

Nice one. If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...

def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".

Regards,
Jordan

Click to expand...

err...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) > 128):
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')

orig = u"blaÍblub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec

Piet van Oostrum · Dec 4, 2007

Michael Goerz said:
MG> if (ord(character) < 32) or (ord(character) > 128):

If you encode chars < 32 it seems more appropriate to also encode 127.

Moreover your code is quadratic in the size of the string so if you use
long strings it would be better to use join.

MonkeeSage · Dec 4, 2007

MonkeeSage said:
MonkeeSage said:

Michael Goerz wrote:
Hi,
I am writing unicode stings into a special text file that requires to
have non-ascii characters as as octal-escaped UTF-8 codes.
For example, the letter "Í" (latin capital I with acute, code point 205)
would come out as "\303\215".
I will also have to read back from the file later on and convert the
escaped characters back into a unicode string.
Does anyone have any suggestions on how to go from "Í" to "\303\215" and
vice versa?
Perhaps something along the lines of:
def encode(source):
... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
...
def decode(encoded):
... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
encode(u"Í")
'\\303\\215'
print decode(_)
Í
HTH
Michael
Nice one. If I might suggest a slight variation to handle cases
where the "encoded" string contains plain text as well as octal
escapes...
def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')
This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
as well as "adf\\303\\215adf".
Regards,
Jordan

Click to expand...

err...

Click to expand...

def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

Click to expand...

Great suggestions from both of you! I came up with my "final" solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael

____________

import re

def encode(source):
encoded = ""
for character in source:
if (ord(character) < 32) or (ord(character) > 128):
for byte in character.encode('utf8'):
encoded += ("\%03o" % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')

orig = u"blaÍblub" + chr(10)
enc = encode(orig)
dec = decode(enc)
print orig
print enc
print dec

An optimization...in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...

def decode(encoded):
decoded = encoded.encode('utf-8')
matches = {}
for octc in re.findall(r'\\(\d{3})', decoded):
matches[octc] = None
for octc in matches:
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')

Untested...

Regards,
Jordan

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Converting escaped html to utf-8	2	Jul 26, 2007
Unicode (UTF-8) in C	13	Mar 16, 2014
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
converting octal strings to unicode	2	Dec 24, 2004
UTF-8 and strings	44	Jun 7, 2011
UTF-8 question from Dive into Python 3	19	Jan 17, 2011

converting to and from octal escaped UTF--8

Michael Goerz

Michael Goerz

MonkeeSage

Michael Goerz

Michael Spencer

MonkeeSage

MonkeeSage

Michael Goerz

Piet van Oostrum

MonkeeSage

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads