How to convert between Japanese coding systems?

Dietrich Bollmann · Feb 19, 2009

Hi,

Are there any functions in python to convert between different Japanese
coding systems?

I would like to convert between (at least) ISO-2022-JP, UTF-8, EUC-JP
and SJIS. I also need some function to encode / decode base64 encoded
strings.

I get the strings (which actually are emails) from a server on the
internet with:

import urllib
server = urllib.urlopen(serverURL, parameters)
email = server.read()

The coding systems are given in the response string:

Example:

email = '''[...]
Subject:
=?UTF-8?Q?romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
=?UTF-8?Q?=E3=82=AB=E3=83=8A=E6=BC=A2=E5=AD=97?=
[...]
Content-Type: text/plain; charset=EUC-JP
[...]
Content-Transfer-Encoding: base64
[...]

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K

'''

My idea is to first parse the 'email' string and to extract the email
body as well as the values of the 'Subject: ', the 'Content-Type: ' and
the 'Content-Transfer-Encoding: ' attributes and to after use them to
convert them to some other coding system:

Something in the lines of:

(subject, contentType, contentTransferEncoding, content) =
parseEmail(email)

to = 'utf-8'
subjectUtf8 = decodeSubject(subject, to)

from = contentType
to = 'utf-8'
contentUtf8 = convertCodingSystem(decodeBase64(content), from, to)

The only problem is that I could not find any standard functionality to
convert between different Japanese coding systems.

Thanks,

Dietrich Bollmann

Peter Otten · Feb 19, 2009

Dietrich said:
I get the strings (which actually are emails) from a server on the
internet with:

import urllib
server = urllib.urlopen(serverURL, parameters)
email = server.read()

The coding systems are given in the response string:

Example:

email = '''[...]
Subject:
=?UTF-8?Q?romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
=?UTF-8?Q?=E3=82=AB=E3=83=8A=E6=BC=A2=E5=AD=97?=
[...]
Content-Type: text/plain; charset=EUC-JP
[...]
Content-Transfer-Encoding: base64
[...]

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K

'''

Is that an email? Maybe you can get it in a format that is supported by the
email package in the standard library.

The only problem is that I could not find any standard functionality to
convert between different Japanese coding systems.

Then you didn't look hard enough:
'\x89\xef\x8e\xd0\x8aT\x97v'

See also http://www.amk.ca/python/howto/unicode

Peter

Justin Ezequiel · Feb 19, 2009

Are there any functions in python to convert between different Japanese
coding systems?

I would like to convert between (at least) ISO-2022-JP, UTF-8, EUC-JP
and SJIS. I also need some function to encode / decode base64 encoded
strings.

Example:

email = '''[...]
Subject:
=?UTF-8?Q?romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
=?UTF-8?Q?=E3=82=AB=E3=83=8A=E6=BC=A2=E5=AD=97?=
[...]
Content-Type: text/plain; charset=EUC-JP
[...]
Content-Transfer-Encoding: base64
[...]

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K

'''

from = contentType
to = 'utf-8'
contentUtf8 = convertCodingSystem(decodeBase64(content), from, to)

The only problem is that I could not find any standard functionality to
convert between different Japanese coding systems.

Thanks,

Dietrich Bollmann

import base64

ENCODINGS = ['ISO-2022-JP', 'UTF-8', 'EUC-JP', 'SJIS']

def decodeBase64(content):
return base64.decodestring(content)

def convertCodingSystem(s, _from, _to):
unicode = s.decode(_from)
return unicode.encode(_to)

if __name__ == '__main__':
content = 'cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K'
_from = 'EUC-JP'
for _to in ENCODINGS:
x = convertCodingSystem(decodeBase64(content), _from, _to)
print _to, repr(x)

Justin Ezequiel · Feb 19, 2009

import email
from email.Header import decode_header
from unicodedata import name as un

MS = '''\
Subject: =?UTF-8?Q?
romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
Date: Thu, 19 Feb 2009 09:34:56 -0000
MIME-Version: 1.0
Content-Type: text/plain; charset=EUC-JP
Content-Transfer-Encoding: base64

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K
'''

def get_header(msg, name):
(value, charset), = decode_header(msg.get(name))
if not charset: return value
return value.decode(charset)

if __name__ == '__main__':
msg = email.message_from_string(MS)
s = get_header(msg, 'Subject')
print repr(s)
for c in s:
try: print un(c)
except ValueError: print repr(c)
print

e = msg.get_content_charset()
b = msg.get_payload(decode=True).decode(e)
print repr(b)
for c in b:
try: print un(c)
except ValueError: print repr(c)
print

HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
japanese encoding iso-2022-jp in python vs. perl	4	Oct 23, 2007
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Script to send email not working	1	Apr 10, 2023
tidy to convert google scholar page in xml	1	Oct 8, 2012
How Do I get my Python script to attach multiple files and send as asingle email	3	Aug 8, 2013
how to convert it into utf-8?	5	Jul 11, 2010
Reading Japanese character from Webmail using httpclient	0	Jun 8, 2008

How to convert between Japanese coding systems?

Dietrich Bollmann

Peter Otten

Justin Ezequiel

Justin Ezequiel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads