UnicodeDecodeError issue

F

Ferrous Cranus

Τη ΔευτέÏα, 2 ΣεπτεμβÏίου 2013 9:28:36 μ.μ. UTC+3, ο χÏήστης Dave Angel έγÏαψε:
'file' does nothing interesting with the filename, it just opens it and

examines the contents. For example,



file www/cgi-bin/files.py



will examine the Python source file, not run it.



So first in the interpreter, I ran






then at the bash prompt, I ran:



davea@think2:~$ file junk.txt

junk.txt: ISO-8859 text


That is one Clever Idea Dave.

I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?

But wait a minute: What editor do you uses to write these 3 lines?
I mean am a bit confused.

i for example i 'nano tets.py' which has within:

f = open("junk.txt", "w")
f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
f.close()

then when i save the file within nano for example by default in utf-8 charset

how would it be able to detect the bytestring within that is supposed to beof greek-iso's
 
D

Dave Angel

Ôç ÄåõôÝñá, 2 Óåðôåìâñßïõ 2013 9:28:36 ì.ì. UTC+3, ï ÷ñÞóôçò Dave Angel Ýãñáøå:


That is one Clever Idea Dave.

I take it that the charset of the file 'junk.txt' gets identified by the characters encoding that read form within the file?

'file' only guesses the most likely encoding for 'junk.txt' But at
least it can know it's not utf-8, since that would give an decoding
error.

That's why, whenever 'file' makes its verdict, it's up to you to check
it by displaying the data after decoding it with that tentative
encoding.
But wait a minute: What editor do you uses to write these 3 lines?
I mean am a bit confused.

As I said right above, "in the interpreter, I ran"...
And if that's not clear enough, you can see the >>>> prompts that the
Python interpreter uses. By interpeter, I mean I ran Python with no
parameters. I did not run IDLE or any other IDE, that might take it
upon itself to interfere.

i for example i 'nano tets.py' which has within:

f = open("junk.txt", "w")
f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')
f.close()

then when i save the file within nano for example by default in utf-8 charset

That's the encoding for the file tets.py, and you'll notice that it's
actually ASCII. Notice that the string I copied from the error message
uses escape sequences for all non-ASCII bytes.
how would it be able to detect the bytestring within that is supposed to be of greek-iso's

I wouldn't be running 'file' on the tets.py file, but on the junk.txt
file created when you run
python tets.py

So since the tets.py file was a sidetrack, I just ran those three lines
in the interpreter.
 
F

Ferrous Cranus

Στις 4/9/2013 2:26 μμ, ο/η Dave Angel έγÏαψε:
'file' only guesses the most likely encoding for 'junk.txt' But at
least it can know it's not utf-8, since that would give an decoding
error.

That's why, whenever 'file' makes its verdict, it's up to you to check
it by displaying the data after decoding it with that tentative
encoding.


As I said right above, "in the interpreter, I ran"...
And if that's not clear enough, you can see the >>>> prompts that the
Python interpreter uses. By interpeter, I mean I ran Python with no
parameters. I did not run IDLE or any other IDE, that might take it
upon itself to interfere.



That's the encoding for the file tets.py, and you'll notice that it's
actually ASCII. Notice that the string I copied from the error message
uses escape sequences for all non-ASCII bytes.


I wouldn't be running 'file' on the tets.py file, but on the junk.txt
file created when you run
python tets.py

So since the tets.py file was a sidetrack, I just ran those three lines
in the interpreter.
I'm still consused about this.

say we save those 3 lines inside junk.txt and we save it by default as utf-8

when we 'file junk.txt'

what will file respond with?

filename's charset?

or

will it llook at the bystering within to decide what encoding it uses?

fi
 
D

Dave Angel

Óôéò 4/9/2013 2:26 ìì, ï/ç Dave Angel Ýãñáøå:
I'm still consused about this.

say we save those 3 lines inside junk.txt and we save it by default as utf-8

when we 'file junk.txt'

what will file respond with?

junk2.txt: ASCII text
filename's charset?

or

will it llook at the bystering within to decide what encoding it uses?

'file' isn't magic. And again, it doesn't look at the filename, it
looks at the content. What heuristics it uses, I don't know, but it has
hundreds of them. ( I wish you hadn't confused the issue by using the
same name junk.txt for an entirely different purpose) When it looks at a
file like this one, it looks only at the bytes within it. In this
case, the instance of 'file' on my machine decides it's an ASCII file.

if I add an silly shebang line

#!/usr/tmp/pyttthon

it says
junk2.txt: a /usr/tmp/pyttthon script, ASCII text executable

It doesn't know it's python, it just trusts the shebang line. And it
identifies it as ASCII, not utf-8, since there are no non-ascii
characters in it. It certainly does not try to interpret the b'xxxx'
byte string by Python syntax rules.
 
F

Ferrous Cranus

Στις 4/9/2013 3:38 μμ, ο/η Dave Angel έγÏαψε:
'file' isn't magic. And again, it doesn't look at the filename, it
looks at the content.
So, you are saying that it looks a the content of the file and not of
what encoding we used to save the file into?

But the contents have within:

f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1
\xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')

so it should have said greek-iso and not ascii.
 
D

Dave Angel

Óôéò 4/9/2013 3:38 ìì, ï/ç Dave Angel Ýãñáøå:
So, you are saying that it looks a the content of the file and not of
what encoding we used to save the file into?

That's right. There's no place where your text editor stores the
encoding it used, so 'file' has to guess, based only on the content.
But the contents have within:

f.write(b'\xb6\xe3\xed\xf9\xf3\xf4\xef\xfc\xed\xef\xec\xe1
\xf3\xf\xf3\xf4\xde\xec\xe1\xf4\xef\xf2\n')

so it should have said greek-iso and not ascii.

No, that line is totally ASCII. Only when it's EXECUTED by Python will
a non ASCII byte string object be created. Like I said, 'file' doesn't
know the first thing about Python syntax, nor should it.
 
S

Steven D'Aprano

That's right. There's no place where your text editor stores the
encoding it used, so 'file' has to guess, based only on the content.

Correct. The thing that people often fail to understand is that there is
no *reliable* way to store the encoding used for a text file in the text
file itself. The encoding is *metadata*, not data: it is data about the
data, and consequently it has to be stored "out of band". It has to be
stored somewhere else, outside of the file.

In the case of text files, it is usually not stored anywhere at all. IBM
mainframes assume that text files are using EBCDIC; modern Linux systems
assume text files are UTF-8; old DOS applications assume text files are
ASCII. Some text editors will try to guess the encoding, using various
heuristics such as "if the file starts with \xFE\xFF it is UTF-16" but
none of them are foolproof:

http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx

sometimes with amusing consequences:

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html



But the above byte string is also valid ISO-8859-5 (Cyrillic):

'ЖуÑљѓєÑÑœÑÑÑŒÑÑ“\x0fѓєоьÑÑ”ÑÑ’\n'

ISO-8859-2 (Central European):

'śăíůóôÄüíÄěáó\x0fóôŢěáôÄň\n'

and ISO-8859-4 (Baltic):

'ļãíųķôīüíīėáķ\x0fķôŪėáôīÅ\n'


Surely you don't expect the file utility to actually recognise that
'Άγνωστοόνομασ\x0fστήματος\n' makes a valid Greek phrase while the others
are not meaningful?


No, that line is totally ASCII. Only when it's EXECUTED by Python will
a non ASCII byte string object be created. Like I said, 'file' doesn't
know the first thing about Python syntax, nor should it.

Technically, it's not ASCII, since ASCII only knows about bytes \x00
through \x7F (decimal 0 through 127). That's why it isn't correct to
describe Python bytes strings as "ASCII strings". They're byte strings
that happen to be displayed as ASCII-plus-other-stuff.
 
C

Chris Angelico

Technically, it's not ASCII, since ASCII only knows about bytes \x00
through \x7F (decimal 0 through 127). That's why it isn't correct to
describe Python bytes strings as "ASCII strings". They're byte strings
that happen to be displayed as ASCII-plus-other-stuff.

The line of code is itself entirely ASCII. The sequence REVERSE
SOLIDUS, LATIN SMALL LETTER X, LATIN SMALL LETTER B, DIGIT SIX is four
Unicode characters that are in the ASCII set. That Python interprets
them as representing the byte value 182 doesn't change that; the line
of code *is* ASCII.

ChrisA
 
S

Steven D'Aprano

The line of code is itself entirely ASCII.
.......^^^^^^^^^^^^^^^^^^^^^^


Ah, so it is. Sorry, I got confused about what was being spoken about.
Apologies to Dave for casting aspersions on his knowledge :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,102
Messages
2,570,645
Members
47,245
Latest member
ShannonEat

Latest Threads

Top