How to get an encoding a value?

  • Thread starter Golawala, Moiz M (GE Infrastructure)
  • Start date
G

Golawala, Moiz M (GE Infrastructure)

Hi all,

I have a some data is encoded into something thing. I want to find out the encoding of that piece of data. For example
s = u"somedata"
I want to do something like
ThisIsTheEncodingOfS = s.getencoding()

is there are method that tell me that it is unicode value if I provide it with a unicode string?


Thanks
Moiz Golawala
 
D

Diez B. Roggisch

I have a some data is encoded into something thing. I want to find out the
encoding of that piece of data. For example s = u"somedata"
I want to do something like
ThisIsTheEncodingOfS =kc s.getencoding()

is there are method that tell me that it is unicode value if I provide it
with a unicode string?

You are confusing unicode with strings with a certain encoding.

Unicode is an abstract specification of a huge number of characters,
hopefully covering even the close-to-unknown glyphs of some ancient
himalayan mountain tribe to the commonly used latin alphabet. There are no
actual numeric values associated with that glyphs.

An encoding on the other hand maps certain sets of glyphs to actual numbers
- e.g. the subset of common european language glyphs commonly known as
iso-8859-1, and much more - including utf-8, an encoding thats capable of
encoding all glyphs specified in unicode, at the cost of possibly using
more than one byte per glyph.

Now if you have a unicode object u, you can _encode_ it in a certain
encoding like this:

u.encode("utf-8")

If you oth have a string s of known encoding, you can decode it to a
unicode-object like this:

s.decode("latin1")

Thats the basics. Now to your actual question: your example makes no sense,
as you have a unicodeobject - which lacks any encoding whatsoever. And
unfortunately, if you have a string instead of an unicode object, you can
only guess what encoding it has - if you are lucky, that works. But no one
can guarantee that it works out - neither in python, nor in other
programming languages.

A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass
 
P

Peter Otten

Diez said:
A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you
expect for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass

However, you must be very careful with the order in which to test the
encodings. The example code will never detect "utf-8":
True

This equality holds for every encoding where one byte is one character and
uses the full range of 256 bytes/characters. You cannot discriminate
between such encodings using the above method:
False

A statistical approach seems more promising, e. g. some smart variant of
"looking for umlauts" in a text known to be German.

Peter
 
A

Alex Martelli

Diez B. Roggisch said:
A common approach to guessing the encoding of said string is to try
something like this:

s = <some string with unknown encoding>
encodings ['ascii', 'latin1', 'utf-8', ....] # list of encodings you expect
for e in encodings:
try:
if s == s.decode(e).encode(e):
break
except UnicodeError:
pass

Yeah, but it doesn't work. iso-8859-x would break for any value of x;
can't tell this way if it was latin-1, or any of the others...


Alex
 
D

Diez B. Roggisch

Alex said:
Yeah, but it doesn't work. iso-8859-x would break for any value of x;
can't tell this way if it was latin-1, or any of the others...

you and peter are right of cours - first try should be utf-8. And of course,
a one-byte-based encoding will always match. I know that there are tools
out there like recode that try to make an educated guess, by taking the
context o non-ascii chars into account and the like.
 
P

Piet van Oostrum

DBR> You are confusing unicode with strings with a certain encoding.

DBR> Unicode is an abstract specification of a huge number of characters,
DBR> hopefully covering even the close-to-unknown glyphs of some ancient
DBR> himalayan mountain tribe to the commonly used latin alphabet. There are no
DBR> actual numeric values associated with that glyphs.

You mix up characters and glyphs which makes it confusing.
There are no numeric values associated with glyphs in Unicode, but there
are numeric values associated with abstract characters.

(http://www.unicode.org/standard/WhatIsUnicode.html)
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.

These numbers are called `code points'. (It says `unique' above, but later
they relax that).

But you are right regarding the encodings. The Unicode code points can be
encoded in different ways e.g. with the UTF-8 encoding.
 
D

Diez B. Roggisch

You mix up characters and glyphs which makes it confusing.
There are no numeric values associated with glyphs in Unicode, but there
are numeric values associated with abstract characters.
(http://www.unicode.org/standard/WhatIsUnicode.html)

Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.

These numbers are called `code points'. (It says `unique' above, but later
they relax that).

But you are right regarding the encodings. The Unicode code points can be
encoded in different ways e.g. with the UTF-8 encoding.

Just checked - yup, you're right: a character might in fact be composed of
several glyphs. So they are closely related (especially in your common
western language), but not the same.

Sheesh, that stuff is always a bit more complicated than one actually thinks
- I usually get the applicational part of it right, but the inner details
of unicode are still foggy...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,209
Messages
2,571,088
Members
47,687
Latest member
IngridXxj

Latest Threads

Top