python encoding bug?

  • Thread starter garabik-news-2005-05
  • Start date
G

garabik-news-2005-05

I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)

The same happens if I use 'latin-1' instead of 'iso8859_1'.

This caught me by surprise, since I was doing some heuristics guessing
string encodings, and 'iso8859_1' gave no errors even if the input
encoding was different.

Is this a known behaviour, or I discovered a terrible unknown bug in python encoding
implementation that should be immediately reported and fixed? :)


happy new year,

--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
 
V

Vincent Wehren

|
| I was playing with python encodings and noticed this:
|
| garabik@lancre:~$ python2.4
| Python 2.4 (#2, Dec 3 2004, 17:59:05)
| [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
| Type "help", "copyright", "credits" or "license" for more information.
| >>> unicode('\x9d', 'iso8859_1')
| u'\x9d'
| >>>
|
| U+009D is NOT a valid unicode character (it is not even a iso8859_1
| valid character)

That statement is not entirely true. If you check the current
UnicodeData.txt (on http://www.unicode.org/Public/UNIDATA/) you'll find:

009D;<control>;Cc;0;BN;;;;;N;OPERATING SYSTEM COMMAND;;;;

Regards,

Vincent Wehren

|
| The same happens if I use 'latin-1' instead of 'iso8859_1'.
|
| This caught me by surprise, since I was doing some heuristics guessing
| string encodings, and 'iso8859_1' gave no errors even if the input
| encoding was different.
|
| Is this a known behaviour, or I discovered a terrible unknown bug in
python encoding
| implementation that should be immediately reported and fixed? :)
|
|
| happy new year,
|
| --
| -----------------------------------------------------------
|| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
|| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
| -----------------------------------------------------------
| Antivirus alert: file .signature infected by signature virus.
| Hi! I'm a signature virus! Copy me into your signature file to help me
spread!
 
B

Benjamin Niemann

I was playing with python encodings and noticed this:

garabik@lancre:~$ python2.4
Python 2.4 (#2, Dec 3 2004, 17:59:05)
[GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
U+009D is NOT a valid unicode character (it is not even a iso8859_1
valid character)

It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
python decoder is correct. The range U+0080 - U+009F is used for various
control characters. There's rarely a valid use for these characters in
documents, so you can be pretty sure that a document using these characters
is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
probably saver to assume windows-1252.

If you want an exception to be thrown, you'll need to implement your own
codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
because I do such a test in one of my projects, too ;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,967
Messages
2,570,148
Members
46,694
Latest member
LetaCadwal

Latest Threads

Top