Encoding sniffer?

A

Andreas Jung

Does anyone know of a Python module that is able to sniff the encoding of
text? Please: I know that there is no reliable way to do this but I need
something that works for most of the case...so please no discussion about
the sense of such a module and approach.

Andreas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (Darwin)

iD8DBQFDvTlMCJIWIbr9KYwRAhXWAJ9X7XyhUBFJ34MAl1OIIM+psBY58ACg4iMg
5GC4VEeNhpoH5MueRlGN+as=
=DAfd
-----END PGP SIGNATURE-----
 
G

garabik-news-2005-05

Andreas Jung said:
[-- text/plain, encoding quoted-printable, charset: us-ascii, 6 lines --]

Does anyone know of a Python module that is able to sniff the encoding of
text? Please: I know that there is no reliable way to do this but I need
something that works for most of the case...so please no discussion about
the sense of such a module and approach.

depends on what exactly you need
one approach is pyenca

the other is:

def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"

for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError:
pass

return None

print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']


depending on what language and encodings you expects the text to be in,
the first or second approach is better


--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 
D

Diez B. Roggisch

print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.


Regards,

Diez
 
G

garabik-news-2005-05

Diez B. Roggisch said:
print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

I pasted from a wrong file :)
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))


--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 
D

Diez B. Roggisch

Diez B. Roggisch said:
print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']

I've fallen into that trap before - it won't work after the iso8859_1.
The reason is that an eight-bit encoding have all 256 code-points
assigned (usually, there are exceptions but you have to be lucky to have
a string that contains a value not assigned in one of them - which is
highly unlikely)

AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.


I pasted from a wrong file :)
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))

Ok, I can see that. But still, there would be quite a few overlapping
codepoints.

I think what the OP (and many more people) wants would be something that
tries and guesses encodings based on probabilities for certain trigrams
containing an umlaut for example.

There seems to be a tool called "konwert" out there that does such
things, and recode has some guessing stuff too, AFAIK - but I haven't
seen any special python modules so far.

Diez
 
R

Ralf Muschall

Diez said:
AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
in your example.

IIRC the range 128-159 (i.e. control codes with the high bit set)
are unused.

Ralf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,275
Messages
2,571,381
Members
48,070
Latest member
nick_tyson

Latest Threads

Top