UnicodeDecodeError issue

Ferrous Cranus · Sep 4, 2013

Î¤Î· Î”ÎµÏ…Ï„ÎÏÎ±, 2 Î£ÎµÏ€Ï„ÎµÎ¼Î²ÏÎ¯Î¿Ï… 2013 9:28:36 Î¼.Î¼. UTC+3, Î¿ Ï‡ÏÎ®ÏƒÏ„Î·Ï‚ Dave Angel ÎÎ³ÏÎ±ÏˆÎµ:

'file' does nothing interesting with the filename, it just opens it and

examines the contents. For example,

file www/cgi-bin/files.py

will examine the Python source file, not run it.

So first in the interpreter, I ran

then at the bash prompt, I ran:

davea@think2:~$ file junk.txt

junk.txt: ISO-8859 text

That is one Clever Idea Dave.

I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?

But wait a minute: What editor do you uses to write these 3 lines?
I mean am a bit confused.

i for example i 'nano tets.py' which has within:

f = open("junk.txt", "w")
f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
f.close()

then when i save the file within nano for example by default in utf-8 charset

how would it be able to detect the bytestring within that is supposed to beof greek-iso's

Dave Angel · Sep 4, 2013

Ôç ÄåõôÝñá, 2 Óåðôåìâñßïõ 2013 9:28:36 ì.ì. UTC+3, ï ÷ñÞóôçò Dave Angel Ýãñáøå:

That is one Clever Idea Dave.

I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?

'file' only guesses the most likely encoding for 'junk.txt' But at
least it can know it's not utf-8, since that would give an decoding
error.

That's why, whenever 'file' makes its verdict, it's up to you to check
it by displaying the data after decoding it with that tentative
encoding.

But wait a minute: What editor do you uses to write these 3 lines?
I mean am a bit confused.

As I said right above, "in the interpreter, I ran"...
And if that's not clear enough, you can see the >>>> prompts that the
Python interpreter uses. By interpeter, I mean I ran Python with no
parameters. I did not run IDLE or any other IDE, that might take it
upon itself to interfere.

i for example i 'nano tets.py' which has within:

f = open("junk.txt", "w")
f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
f.close()

then when i save the file within nano for example by default in utf-8 charset

That's the encoding for the file tets.py, and you'll notice that it's
actually ASCII. Notice that the string I copied from the error message
uses escape sequences for all non-ASCII bytes.

how would it be able to detect the bytestring within that is supposed to be of greek-iso's

I wouldn't be running 'file' on the tets.py file, but on the junk.txt
file created when you run
python tets.py

So since the tets.py file was a sidetrack, I just ran those three lines
in the interpreter.

Ferrous Cranus · Sep 4, 2013

Î£Ï„Î¹Ï‚ 4/9/2013 2:26 Î¼Î¼, Î¿/Î· Dave Angel ÎÎ³ÏÎ±ÏˆÎµ:

'file' only guesses the most likely encoding for 'junk.txt' But at
least it can know it's not utf-8, since that would give an decoding
error.

That's why, whenever 'file' makes its verdict, it's up to you to check
it by displaying the data after decoding it with that tentative
encoding.

As I said right above, "in the interpreter, I ran"...
And if that's not clear enough, you can see the >>>> prompts that the
Python interpreter uses. By interpeter, I mean I ran Python with no
parameters. I did not run IDLE or any other IDE, that might take it
upon itself to interfere.

That's the encoding for the file tets.py, and you'll notice that it's
actually ASCII. Notice that the string I copied from the error message
uses escape sequences for all non-ASCII bytes.

I wouldn't be running 'file' on the tets.py file, but on the junk.txt
file created when you run
python tets.py

So since the tets.py file was a sidetrack, I just ran those three lines
in the interpreter.

I'm still consused about this.

say we save those 3 lines inside junk.txt and we save it by default as utf-8

when we 'file junk.txt'

what will file respond with?

filename's charset?

or

will it llook at the bystering within to decide what encoding it uses?

fi

Dave Angel · Sep 4, 2013

Óôéò 4/9/2013 2:26 ìì, ï/ç Dave Angel Ýãñáøå:
I'm still consused about this.

say we save those 3 lines inside junk.txt and we save it by default as utf-8

when we 'file junk.txt'

what will file respond with?

junk2.txt: ASCII text

filename's charset?

or

will it llook at the bystering within to decide what encoding it uses?

'file' isn't magic. And again, it doesn't look at the filename, it
looks at the content. What heuristics it uses, I don't know, but it has
hundreds of them. ( I wish you hadn't confused the issue by using the
same name junk.txt for an entirely different purpose) When it looks at a
file like this one, it looks only at the bytes within it. In this
case, the instance of 'file' on my machine decides it's an ASCII file.

if I add an silly shebang line

#!/usr/tmp/pyttthon

it says
junk2.txt: a /usr/tmp/pyttthon script, ASCII text executable

It doesn't know it's python, it just trusts the shebang line. And it
identifies it as ASCII, not utf-8, since there are no non-ascii
characters in it. It certainly does not try to interpret the b'xxxx'
byte string by Python syntax rules.

wxjmfauth · Sep 4, 2013

Le mercredi 4 septembre 2013 10:01:50 UTC+2, Antoon Pardon a écrit :

Ferrous Cranus · Sep 4, 2013

Î£Ï„Î¹Ï‚ 4/9/2013 3:38 Î¼Î¼, Î¿/Î· Dave Angel ÎÎ³ÏÎ±ÏˆÎµ:

'file' isn't magic. And again, it doesn't look at the filename, it
looks at the content.

So, you are saying that it looks a the content of the file and not of
what encoding we used to save the file into?

But the contents have within:

f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1
\xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')

so it should have said greek-iso and not ascii.

Dave Angel · Sep 5, 2013

Óôéò 4/9/2013 3:38 ìì, ï/ç Dave Angel Ýãñáøå:
So, you are saying that it looks a the content of the file and not of
what encoding we used to save the file into?

That's right. There's no place where your text editor stores the
encoding it used, so 'file' has to guess, based only on the content.

But the contents have within:

f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1
\xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')

so it should have said greek-iso and not ascii.

No, that line is totally ASCII. Only when it's EXECUTED by Python will
a non ASCII byte string object be created. Like I said, 'file' doesn't
know the first thing about Python syntax, nor should it.

Steven D'Aprano · Sep 5, 2013

That's right. There's no place where your text editor stores the
encoding it used, so 'file' has to guess, based only on the content.

Correct. The thing that people often fail to understand is that there is
no *reliable* way to store the encoding used for a text file in the text
file itself. The encoding is *metadata*, not data: it is data about the
data, and consequently it has to be stored "out of band". It has to be
stored somewhere else, outside of the file.

In the case of text files, it is usually not stored anywhere at all. IBM
mainframes assume that text files are using EBCDIC; modern Linux systems
assume text files are UTF-8; old DOS applications assume text files are
ASCII. Some text editors will try to guess the encoding, using various
heuristics such as "if the file starts with \xFE\xFF it is UTF-16" but
none of them are foolproof:

http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx

sometimes with amusing consequences:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

But the above byte string is also valid ISO-8859-5 (Cyrillic):

'Ð–ÑƒÑÑ™Ñ“Ñ”ÑÑœÑÑÑŒÑÑ“\x0fÑ“Ñ”Ð¾ÑŒÑÑ”ÑÑ’\n'

ISO-8859-2 (Central European):

'Å›ÄƒÃÅ¯Ã³Ã´ÄÃ¼ÃÄÄ›Ã¡Ã³\x0fÃ³Ã´Å¢Ä›Ã¡Ã´ÄÅˆ\n'

and ISO-8859-4 (Baltic):

'Ä¼Ã£ÃÅ³Ä·Ã´Ä«Ã¼ÃÄ«Ä—Ã¡Ä·\x0fÄ·Ã´ÅªÄ—Ã¡Ã´Ä«Å\n'

Surely you don't expect the file utility to actually recognise that
'Î†Î³Î½Ï‰ÏƒÏ„Î¿ÏŒÎ½Î¿Î¼Î±Ïƒ\x0fÏƒÏ„Î®Î¼Î±Ï„Î¿Ï‚\n' makes a valid Greek phrase while the others
are not meaningful?

No, that line is totally ASCII. Only when it's EXECUTED by Python will
a non ASCII byte string object be created. Like I said, 'file' doesn't
know the first thing about Python syntax, nor should it.

Technically, it's not ASCII, since ASCII only knows about bytes \x00
through \x7F (decimal 0 through 127). That's why it isn't correct to
describe Python bytes strings as "ASCII strings". They're byte strings
that happen to be displayed as ASCII-plus-other-stuff.

Chris Angelico · Sep 5, 2013

Technically, it's not ASCII, since ASCII only knows about bytes \x00
through \x7F (decimal 0 through 127). That's why it isn't correct to
describe Python bytes strings as "ASCII strings". They're byte strings
that happen to be displayed as ASCII-plus-other-stuff.

The line of code is itself entirely ASCII. The sequence REVERSE
SOLIDUS, LATIN SMALL LETTER X, LATIN SMALL LETTER B, DIGIT SIX is four
Unicode characters that are in the ASCII set. That Python interprets
them as representing the byte value 182 doesn't change that; the line
of code *is* ASCII.

ChrisA

Steven D'Aprano · Sep 5, 2013

The line of code is itself entirely ASCII.

.......^^^^^^^^^^^^^^^^^^^^^^

Ah, so it is. Sorry, I got confused about what was being spoken about.
Apologies to Dave for casting aspersions on his knowledge

Output confusion	2	Mar 9, 2023
WSGI/wsgiref: modifying output on windows ?	2	Jun 3, 2007
u'a' in string.letters fails: a Python 2.3 bug?	2	Oct 10, 2003
perl regexp to ruby one conversion ?	13	Mar 23, 2006
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
elementtree and gbk encoding	12	Mar 14, 2006
Interested SMS/Serial Programmer/Developers Resource	0	Oct 9, 2003

UnicodeDecodeError issue

Ferrous Cranus

Dave Angel

Ferrous Cranus

Dave Angel

wxjmfauth

Ferrous Cranus

Dave Angel

Steven D'Aprano

Chris Angelico

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads