recycling internationalized garbage

aaronwmail-usenet · Mar 8, 2006

Hi folks,

Please help me with international string issues:
I put together an AJAX discography search engine

http://www.xfeedme.com/discs/discography.html

using data from the FreeDB music database

http://www.freedb.org/

Unfortunately FreeDB has a lot of junk in it, including
randomly mixed character encodings for international
strings. As an expediency I decided to just delete all
characters that weren't ascii, so I could get the thing
running. Now I look through the log files and notice that
a certain category of user immediatly homes in on this
and finds it amusing to see how badly I've mangled
the strings

. I presume they chuckle and make
disparaging remarks about "united states of ascii"
and then leave never to return.

Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.

Thanks, -- Aaron Watters

===

As someone once remarked to Schubert
"take me to your leider" (sorry about that).
-- Tom Lehrer

Fredrik Lundh · Mar 8, 2006

[email protected] said:
Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.

some alternatives:

braindead bruteforce:

try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.

slightly smarter bruteforce:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/163743

more advanced (but possibly not good enough for very short texts):

http://chardet.feedparser.org/

</F>

garabik-news-2005-05 · Mar 8, 2006

Fredrik Lundh said:
some alternatives:

braindead bruteforce:

try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.

that was a mistake I made once.
Do not use iso8859-1 as python codec, instead create your own codec
called e.g. iso8859-1-ncc like this (just a sketch):

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256))
decoding_map.update({})
encoding_map = codecs.make_encoding_map(decoding_map)

and then use :

def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"

for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError, r:
pass

return None

guessed_unicode_text = try_encodings(text, ['utf-8', 'iso8859-1-ncc', 'cp1252', 'macroman'])

it seems to work surprisingly well, if you know approximately the
language(s) the text is expected to be in (e.g. replace cp1252 with
cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages)

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Ross Ridge · Mar 8, 2006

Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)?

Copy the string unmodified to the WWW page and ensure your page doesn't
identify the encoding used. That way it becomes the browser's problem,
and if the user reading the page can understand the language the string
is written in there's a very good chance the browser will display it
correctly. Unfortunately, that's how text like this is supposed to be
displayed.

The output must be clean utf8 suitable for arbitrary xml parsers.

Oh, you're screwed then.

Ross Ridge

aaronwmail-usenet · Mar 14, 2006

Regarding cleaning of mixed string encodings in
the discography search engine

http://www.xfeedme.com/discs/discography.html

Following </F>'s suggestion I came up with this:

utf8enc = codecs.getencoder("utf8")
utf8dec = codecs.getdecoder("utf8")
iso88591dec = codecs.getdecoder("iso-8859-1")

def checkEncoding(s):
try:
(uni, dummy) = utf8dec(s)
except:
(uni, dummy) = iso88591dec(s, 'ignore')
(out, dummy) = utf8enc(uni)
return out

This works nicely for Nordic stuff like
"björgvin halldórsson - gunnar Þórðarson",
but russian seems to turn into garbage
and I have no idea about chinese.

Unless someone has any other ideas I'm
giving up now.
-- Aaron Watters

===

In theory, theory is the same as practice.
In practice it's more complicated than that.
-- folklore

Ross Ridge · Mar 14, 2006

try:
(uni, dummy) = utf8dec(s)
except:
(uni, dummy) = iso88591dec(s, 'ignore')

Is there really any point in even trying to decode with UTF-8? You
might as well just assume ISO 8859-1.

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Mar 14, 2006

Ross said:
Is there really any point in even trying to decode with UTF-8? You
might as well just assume ISO 8859-1.

The point is that you can tell UTF-8 reliably. If the data decodes
as UTF-8, it *is* UTF-8, because no other encoding in the world
produces the same byte sequences (except for ASCII, which is
an UTF-8 subset).

So if it is not UTF-8, the guessing starts.

Regards,
Martin

Serge Orlov · Mar 15, 2006

Unless someone has any other ideas I'm
giving up now.

Frederick also suggested http://chardet.feedparser.org/ that is port of
Mozilla's character detection algorithm to pure python. It works pretty
good for web pages, since I haven't seen garbled russian text for years
of using Mozilla/Firefox. You should definitely try it.

-- Serge

Ross Ridge · Mar 15, 2006

Martin said:
The point is that you can tell UTF-8 reliably. If the data decodes
as UTF-8, it *is* UTF-8, because no other encoding in the world
produces the same byte sequences (except for ASCII, which is
an UTF-8 subset).

It should be obvious that any 8-bit single-byte character set can
produce byte sequences that are valid in UTF-8. In fact I can't think
of any multi-byte encoding that can't produce valid UTF-8 byte
sequence.

Ross Ridge

Fredrik Lundh · Mar 15, 2006

RFC 3629 says "fairly reliably" rather than "reliably", but they mean
the same thing...

or as the RFC puts it,

"the probability that a string of characters in any other encoding
appears as valid UTF-8 is low, diminishing with increasing string
length".

:::

Ross said:
It should be obvious that any 8-bit single-byte character set can
produce byte sequences that are valid in UTF-8.

it should be fairly obvious that you don't know much about UTF-8...

</F>

Ross Ridge · Mar 15, 2006

Ross said:
It should be obvious that any 8-bit single-byte character set can
produce byte sequences that are valid in UTF-8.

Fredrik said:
it should be fairly obvious that you don't know much about UTF-8...

Despite this malicious and false accusation, your post only confirms
what I wrote above is true and what Martin wrote was false. Even with
the desperate and absurd semantic game you tried to play, like falsely
equating "fairly reliably" with "reliably", in a database as large as
this a low probability of failure does not guarantee "if the data
decodes as UTF-8, it *is* UTF-8".

Ross Ridge

Fredrik Lundh · Mar 15, 2006

Ross said:
Despite this malicious and false accusation, your post only confirms
what I wrote above is true and what Martin wrote was false. Even with
the desperate and absurd semantic game you tried to play, like falsely
equating "fairly reliably" with "reliably", in a database as large as
this a low probability of failure does not guarantee "if the data
decodes as UTF-8, it *is* UTF-8".

are you a complete idiot, or do you only play one on the internet ?

</F>

Fredrik Lundh · Mar 15, 2006

Unless someone has any other ideas I'm

giving up now.

btw, have you looked at using

http://musicbrainz.org/products/server/download.html

instead? they appear to guarantee UTF-8 (to the extent that *they* have managed
to autodecode the FreeDB junk, of course). not sure how complete it is, though...

</F>

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Mar 16, 2006

Ross said:
It should be obvious that any 8-bit single-byte character set can
produce byte sequences that are valid in UTF-8.

It is certainly possible to interpret UTF-8 data as if they were
in a specific single-byte encoding. However, the text you then
obtain is not meaningful in any language of the world.

So "valid" yes; "meaningful" no. Therefore, for all practical
purposes, 8-bit single-byte characters sets *will not* produce
byte sequences that are valid in UTF-8 (although they could -
it just won't happen).

In fact I can't think of any multi-byte encoding that can't produce
valid UTF-8 byte sequence.

The same reasoning applies for them.

Regards,
Martin

Fredrik Lundh · Mar 16, 2006

Martin v. Löwis said:
It is certainly possible to interpret UTF-8 data as if they were
in a specific single-byte encoding. However, the text you then
obtain is not meaningful in any language of the world.

Except those languages that uses words consisting of runs of accented
letters immediately followed by either undefined characters or odd sym-
bols, and never use accented characters in any other way.

(Given that the freedb spec says that it's okay to mix iso-8859-1 with
utf-8 on a record-by-record level, one might assume that they've de-
cided that the number of bands using such languages is very close to
zero...)

</F>

Ross Ridge · Mar 16, 2006

Martin said:
So "valid" yes; "meaningful" no. Therefore, for all practical
purposes, 8-bit single-byte characters sets *will not* produce
byte sequences that are valid in UTF-8 (although they could -
it just won't happen).

The same reasoning applies for them.

While you're reasoning may apply to European single-byte character
sets, it doesn't apply as well to Far East multi-byte encodings. Take
ISO 2202-JP (RFC 1468) for example where any string is valid UTF-8 as
far as Python is concerned. About 1% of the EUC-JP encoded words and
phrases listed in EDICT, a Japanese-English dictionary decode as valid
UTF-8 strings. I get similar results with CEDICT, a Chinese-English
dictionary, about 1% for the Big5 encoded version of the file and about
4.5% for the GB 2312 version.

It would be nearly impossible to find all the strings in in Freedb that
decode as UTF-8 but aren't really encoded in UTF-8, but they do exist.
One example I managed to find are the GB 2312 encoded TTITLE5 and
TTITLE13 records of disc id 020f5210.

Ross Ridge

recycling internationalized garbage

aaronwmail-usenet

Fredrik Lundh

garabik-news-2005-05

Ross Ridge

aaronwmail-usenet

Ross Ridge

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Serge Orlov

Ross Ridge

Fredrik Lundh

Ross Ridge

Fredrik Lundh

Fredrik Lundh

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Fredrik Lundh

Ross Ridge

Members online

Forum statistics

Latest Threads