Encoding sniffer?

Andreas Jung · Jan 5, 2006

Does anyone know of a Python module that is able to sniff the encoding of
text? Please: I know that there is no reliable way to do this but I need
something that works for most of the case...so please no discussion about
the sense of such a module and approach.

Andreas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (Darwin)

iD8DBQFDvTlMCJIWIbr9KYwRAhXWAJ9X7XyhUBFJ34MAl1OIIM+psBY58ACg4iMg
5GC4VEeNhpoH5MueRlGN+as=
=DAfd
-----END PGP SIGNATURE-----

garabik-news-2005-05 · Jan 5, 2006

Andreas Jung said:
[-- text/plain, encoding quoted-printable, charset: us-ascii, 6 lines --]

Does anyone know of a Python module that is able to sniff the encoding of
text? Please: I know that there is no reliable way to do this but I need
something that works for most of the case...so please no discussion about
the sense of such a module and approach.

depends on what exactly you need
one approach is pyenca

the other is:

def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"

for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError:
pass

return None

print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

depending on what language and encodings you expects the text to be in,
the first or second approach is better

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Diez B. Roggisch · Jan 5, 2006

print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

Regards,

Diez

garabik-news-2005-05 · Jan 5, 2006

Diez B. Roggisch said:
print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

Click to expand...

I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

I pasted from a wrong file

See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Diez B. Roggisch · Jan 5, 2006

Diez B. Roggisch said:
Diez B. Roggisch said:

print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

Click to expand...

I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

Click to expand...

I pasted from a wrong file
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))

Ok, I can see that. But still, there would be quite a few overlapping
codepoints.

I think what the OP (and many more people) wants would be something that
tries and guesses encodings based on probabilities for certain trigrams
containing an umlaut for example.

There seems to be a tool called "konwert" out there that does such
things, and recode has some guessing stuff too, AFAIK - but I haven't
seen any special python modules so far.

Diez

Ralf Muschall · Jan 6, 2006

Diez said:
AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

IIRC the range 128-159 (i.e. control codes with the high bit set)
are unused.

Ralf

Neil Hodgson · Jan 6, 2006

Ralf Muschall:

IIRC the range 128-159 (i.e. control codes with the high bit set)
are unused.

ISO 8859-1 and ISO-8859-1 (extra hyphen) differ in that ISO-8859-1
includes the control codes in 128-159 (as well as the low control codes)
as defined by ISO 6429. ISO 6429 is not freely available online but the
equivalent ECMA standard ECMA 48 is:
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf

Neil

ANN: A new version (0.3.5) of python-gnupg has been released.	0	Aug 30, 2013
SOAP with fancy HTTPS requirements	0	Jun 3, 2010
Generating RTF with Python	4	Mar 31, 2005
help function and operetors overloading	0	Feb 6, 2012
Loading modules from files through C++	0	May 17, 2014
Managing multiple packages	0	Nov 20, 2012
Thread._stop() behavior changed in Python 3.4	0	Mar 17, 2014
ANN: Celery 3.0 (chiastic slide) released!	1	Jul 7, 2012

Encoding sniffer?

Andreas Jung

garabik-news-2005-05

Diez B. Roggisch

garabik-news-2005-05

Diez B. Roggisch

Ralf Muschall

Neil Hodgson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads