UnicodeDecodeError issue

F

Ferrous Cranus

Στις 1/9/2013 5:08 μμ, ο/η Ferrous Cranus έγÏαψε:
Στις 1/9/2013 11:35 πμ, ο/η Steven D'Aprano έγÏαψε:
Τη Σάββατο, 31 ΑυγοÏστου 2013 9:41:27 Ï€.μ. UTC+3, ο χÏήστης Ferrous
Cranus έγÏαψε:
Suddenly my webiste superhost.gr running my main python script presents

me with this error:



Code:

UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef

\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,

'invalid start byte')





Does anyone know what this means?





--

Webhost <http://superhost.gr>

Good morning Steven,

Ye i'm aware that i need to define variables before i try to make use of
them. I have study all of your examples and then re-view my code and i
can *assure* you that the line statement that tied to set the 'host'
variable is very early at the top of the script(of course after
imports), and the cur.execute comes after.

The problem here is not what you say, that i try to drink k a coffee
before actually making one first but rather than i cannot drink the
coffee although i know *i have tried* to make one first.


i will upload the code for you to prove my sayings at pastebin.

http://pastebin.com/J97guApQ


You are mistaken. In line 20-25, you have this:

try:
gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0]
or "Proxy Detected"
except Exception as e:
print( repr(e), file=open( '/tmp/err.out', 'w' ) )


An error occurs inside that block, *before* host gets set. Who knows what
the error is? You have access to the err.out file, but apparently you
aren't reading it to find out.

Then, 110 lines later, at line 135, you try to access the value of "host"
that never got set.

Your job is to read the error in /tmp/err.out, see what is failing, and
fix it.

But i'm Steven! That why i make use of it to read it immediately after
my script run at browser time.

i have even included a sys.exit(0) after the try:/except block:

Here is it:


errout = open( '/tmp/err.out', 'w' ) # opens and truncates the
error output file
try:
gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy
Detected"
except Exception as e:
print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
errout.flush()

sys.exit(0)

and the output of error file is:


(e-mail address removed) [~]# cat /tmp/err.out
UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
'invalid start byte')


But i noticed that err.out and /usr/local/apache/logs/error_log produced
different output.

In any case i check both:


(e-mail address removed) [~]# chmod 777 /tmp/err2.out

ouput of error_log
(e-mail address removed) [~]# [Sun Sep 01 14:23:46 2013] [error] [client
173.245.49.120] Premature end of script headers: metrites.py
[Sun Sep 01 14:23:46 2013] [error] [client 173.245.49.120] File does not
exist: /home/nikos/public_html/500.shtml



Also i have even changed output error filename.
turns out empty.

(e-mail address removed) [~]# cat /tmp/err2.out
 
D

Dave Angel

On 1/9/2013 10:08, Ferrous Cranus wrote:

Here is it:


errout = open( '/tmp/err.out', 'w' ) # opens and truncates the error
output file
try:
gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy
Detected"
except Exception as e:
print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
errout.flush()

sys.exit(0)

and the output of error file is:


(e-mail address removed) [~]# cat /tmp/err.out
UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
'invalid start byte')

Nope. The label "Xyzzy exception" is not in that file, so that's not
the file you created in this run. Further, if that line existed before,
it would have been wiped out by the open with mode "w".

i suggest you add yet another write to that file, immediately after
opening it:

errout = open( '/tmp/err.out', 'w' ) # opens and truncates the error
print("starting run", file=errorout)
errout.flush()

Until you can reliably examine the same file that was logging your
errors, you're just spinning your wheels. you might even want to write
the time to the file, so that you can tell whether it was now, or 2 days
ago that the run was made.
 
D

Dave Angel

Óôéò 1/9/2013 1:35 ìì, ï/ç Dave Angel Ýãñáøå:
This is my first crack at it (untested):

errout = open("/tmp/err.out", "w") #opens and truncates the error
output file
try:
gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
host =socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or
"Proxy Detected"
except Exception as e:
print( "Xyzzy exception-", repr(sys.exc_info()), file=errout)
errout.flush()


Note that I haven't had to use exc_info() in my own code, so I'm sure it
could be formatted prettier. But right now, you need to stop throwing
away useful information.

First of all thank you for your detailed information Dave.
I have tried all you said, the above example you provided me, but i'm
afraid even with your approach which should have given more error
specific information the output of the err file remains.


(e-mail address removed) [~]# cat /tmp/err.out
UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
'invalid start byte')

See my other response. The above file did NOT result from running the
code above. It is missing the "Xyzzy" label.
 
F

Ferrous Cranus

Στις 1/9/2013 6:36 μμ, ο/η Dave Angel έγÏαψε:
On 1/9/2013 10:08, Ferrous Cranus wrote:

Here is it:


errout = open( '/tmp/err.out', 'w' ) # opens and truncates the error
output file
try:
gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy
Detected"
except Exception as e:
print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
errout.flush()

sys.exit(0)

and the output of error file is:


(e-mail address removed) [~]# cat /tmp/err.out
UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
'invalid start byte')

Nope. The label "Xyzzy exception" is not in that file, so that's not
the file you created in this run. Further, if that line existed before,
it would have been wiped out by the open with mode "w".

i suggest you add yet another write to that file, immediately after
opening it:

errout = open( '/tmp/err.out', 'w' ) # opens and truncates the error
print("starting run", file=errorout)
errout.flush()

Until you can reliably examine the same file that was logging your
errors, you're just spinning your wheels. you might even want to write
the time to the file, so that you can tell whether it was now, or 2 days
ago that the run was made.


I tried it and it printed nothing.
But suddenly thw ebpage sttaed to run and i get n invalid byte entried
and no weird messge files.py is working as expcted.
what on earht?

Now i ahve thso error:

#
=================================================================================================================
# DATABASE INSERTS - do not increment the counter if a Cookie is set to
the visitors browser already
#
=================================================================================================================
if( not vip and re.search(
r'(msn|gator|amazon|yandex|reverse|cloudflare|who|fetch|barracuda|spider|google|crawl|pingdom)',
host ) is None ):

print( "i'm in and data is: ", host )
try:
#find the needed counter for the page URL
if os.path.exists( path + page ) or os.path.exists( cgi_path + page ):
cur.execute('''SELECT ID FROM counters WHERE url = %s''', page )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
#first time for page; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO counters (url) VALUES (%s)''', page )
cID = cur.lastrowid #get the primary key value of the new record
else:
#found the page, save primary key and use it to issue hit UPDATE
cID = data[0]
cur.execute('''UPDATE counters SET hits = hits + 1 WHERE ID = %s''',
cID )

#find the visitor record for the (saved) cID and current host
cur.execute('''SELECT * FROM visitors WHERE counterID = %s and host =
%s''', (cID, host) )
data = cur.fetchone() #cID&host are unique

if not data:
#first time for this host on this page, create new record
cur.execute('''INSERT INTO visitors (counterID, host, city, useros,
browser, lastvisit) VALUES (%s, %s, %s, %s, %s, %s)''', (cID, host,
city, useros, browser, date) )
else:
#found the page, save its primary key for later use
vID = data[0]
#UPDATE record using retrieved vID
cur.execute('''UPDATE visitors SET city = %s, useros = %s, browser =
%s, hits = hits + 1, lastvisit = %s
WHERE counterID = %s and host = %s''', (city, useros, browser,
date, vID, host) )

con.commit() #if we made it here, the transaction is complete

except pymysql.ProgrammingError as e:
print( repr(e) )
con.rollback() #something failed, rollback the entire transaction
sys.exit(0)


i get no counter increment when visitors visit my webpage.
What on eart is going on?

How the previous error with the invalid byte somehtign got solved?
 
F

Ferrous Cranus

Στις 1/9/2013 7:10 μμ, ο/η Ferrous Cranus έγÏαψε:
Στις 1/9/2013 6:36 μμ, ο/η Dave Angel έγÏαψε:
On 1/9/2013 10:08, Ferrous Cranus wrote:

Here is it:


errout = open( '/tmp/err.out', 'w' ) # opens and truncates the
error
output file
try:
gi = pygeoip.GeoIP('/usr/local/share/GeoIPCity.dat')
city = gi.time_zone_by_addr( os.environ['REMOTE_ADDR'] ) or
gi.time_zone_by_addr( os.environ['HTTP_CF_CONNECTING_IP'] )
host = socket.gethostbyaddr( os.environ['REMOTE_ADDR'] )[0] or
socket.gethostbyaddr( os.environ['HTTP_CF_CONNECTING_IP'] )[0] or "Proxy
Detected"
except Exception as e:
print( "Xyzzy exception-", repr( sys.exc_info() ), file=errout )
errout.flush()

sys.exit(0)

and the output of error file is:


(e-mail address removed) [~]# cat /tmp/err.out
UnicodeDecodeError('utf-8', b'\xb6\xe3\xed\xf9\xf3\xf4\xef
\xfc\xed\xef\xec\xe1 \xf3\xf5\xf3\xf4\xde\xec\xe1\xf4\xef\xf2', 0, 1,
'invalid start byte')

Nope. The label "Xyzzy exception" is not in that file, so that's not
the file you created in this run. Further, if that line existed before,
it would have been wiped out by the open with mode "w".

i suggest you add yet another write to that file, immediately after
opening it:

errout = open( '/tmp/err.out', 'w' ) # opens and truncates the
error
print("starting run", file=errorout)
errout.flush()

Until you can reliably examine the same file that was logging your
errors, you're just spinning your wheels. you might even want to write
the time to the file, so that you can tell whether it was now, or 2 days
ago that the run was made.


I tried it and it printed nothing.
But suddenly thw ebpage sttaed to run and i get n invalid byte entried
and no weird messge files.py is working as expcted.
what on earht?

Now i ahve thso error:

#
=================================================================================================================

# DATABASE INSERTS - do not increment the counter if a Cookie is set to
the visitors browser already
#
=================================================================================================================

if( not vip and re.search(
r'(msn|gator|amazon|yandex|reverse|cloudflare|who|fetch|barracuda|spider|google|crawl|pingdom)',
host ) is None ):

print( "i'm in and data is: ", host )
try:
#find the needed counter for the page URL
if os.path.exists( path + page ) or os.path.exists( cgi_path +
page ):
cur.execute('''SELECT ID FROM counters WHERE url = %s''',
page )
data = cur.fetchone() #URL is unique, so should only
be one

if not data:
#first time for page; primary key is automatic, hit is
defaulted
cur.execute('''INSERT INTO counters (url) VALUES (%s)''',
page )
cID = cur.lastrowid #get the primary key value of
the new record
else:
#found the page, save primary key and use it to issue hit
UPDATE
cID = data[0]
cur.execute('''UPDATE counters SET hits = hits + 1 WHERE ID
= %s''', cID )

#find the visitor record for the (saved) cID and current host
cur.execute('''SELECT * FROM visitors WHERE counterID = %s and
host = %s''', (cID, host) )
data = cur.fetchone() #cID&host are unique

if not data:
#first time for this host on this page, create new record
cur.execute('''INSERT INTO visitors (counterID, host, city,
useros, browser, lastvisit) VALUES (%s, %s, %s, %s, %s, %s)''', (cID,
host, city, useros, browser, date) )
else:
#found the page, save its primary key for later use
vID = data[0]
#UPDATE record using retrieved vID
cur.execute('''UPDATE visitors SET city = %s, useros = %s,
browser = %s, hits = hits + 1, lastvisit = %s
WHERE counterID = %s and host =
%s''', (city, useros, browser, date, vID, host) )

con.commit() #if we made it here, the transaction is
complete

except pymysql.ProgrammingError as e:
print( repr(e) )
con.rollback() #something failed, rollback the entire
transaction
sys.exit(0)


i get no counter increment when visitors visit my webpage.
What on eart is going on?

How the previous error with the invalid byte somehtign got solved?
i still wonder how come the invalid byte messge dissapeared
 
D

Dave Angel

On 1/9/2013 18:23, Ferrous Cranus wrote:

i still wonder how come the invalid byte messge dissapeared

Too bad you never bothered to narrow it down to its source. It could
be anywhere on those three lines. If I had to guess, I'd figure it was
one of those environment variables. The Linux environment variables are
strings of bytes, and the os.environ is a dict of strings. Apparently
it converts them using utf-8, and if you've somehow set them using some
other encoding, you could be getting that error.

Have you tried to decode those bytes in various encodings other than
utf-8 ?
 
F

Ferrous Cranus

Στις 2/9/2013 2:14 πμ, ο/η Dave Angel έγÏαψε:
On 1/9/2013 18:23, Ferrous Cranus wrote:



Too bad you never bothered to narrow it down to its source.


if only i knew how up until yesterday when they were appearing.

It could
be anywhere on those three lines. If I had to guess, I'd figure it was
one of those environment variables. The Linux environment variables are
strings of bytes, and the os.environ is a dict of strings. Apparently
it converts them using utf-8, and if you've somehow set them using some
other encoding, you could be getting that error.

Have you tried to decode those bytes in various encodings other than
utf-8 ?


No, because i wasn't aware of what string/variable they were pertaining at.
 
D

Dave Angel

No, because i wasn't aware of what string/variable they were pertaining at.

http://pypi.python.org/pypi/chardet

is a package which tries to 'guess' an encoding for a string of bytes.
I happen to have the 2.7 version installed, but not the 3.x version, so
the following is in 2.7. Same thing should work in 3.3....
¶γνωστοόνομα συστήματος


I don't have a clue what it might be; it's not English, and I don't
know whatever language it may be in.

Does that string make any sense to you? You may want to try it on your
own machine, since the email may obscure the encoding. Or you might
want to do the decode using whatever the default encoding is for that
server.

The Linux 'file' utility thinks this string is in ISO-8859, so you might
want to try a decode('ISO-8859-1') as well. (and maybe ISO-8859-2, -3,
-4, and -5)
 
F

Ferrous Cranus

Στις 2/9/2013 2:38 μμ, ο/η Dave Angel έγÏαψε:
http://pypi.python.org/pypi/chardet

is a package which tries to 'guess' an encoding for a string of bytes.
I happen to have the 2.7 version installed, but not the 3.x version, so
the following is in 2.7. Same thing should work in 3.3....

¶γνωστοόνομα συστήματος


I don't have a clue what it might be; it's not English, and I don't
know whatever language it may be in.

Does that string make any sense to you?

Yes it does, it mean "Unknown Hostname"
The Linux 'file' utility thinks this string is in ISO-8859, so you might
want to try a decode('ISO-8859-1') as well. (and maybe ISO-8859-2, -3,
-4, and -5)

How did you test it? The utility afaik analyzes a file's encodings not
string encodings.

(e-mail address removed) [~]# file www/cgi-bin/files.py
www/cgi-bin/files.py: a /usr/bin/python script text executable
 
M

MRAB

http://pypi.python.org/pypi/chardet

is a package which tries to 'guess' an encoding for a string of bytes.
I happen to have the 2.7 version installed, but not the 3.x version, so
the following is in 2.7. Same thing should work in 3.3....

¶γνωστοόνομα συστήματος

I don't have a clue what it might be; it's not English, and I don't
know whatever language it may be in.
You don't recognise Greek?
Does that string make any sense to you? You may want to try it on your
own machine, since the email may obscure the encoding. Or you might
want to do the decode using whatever the default encoding is for that
server.

The Linux 'file' utility thinks this string is in ISO-8859, so you might
want to try a decode('ISO-8859-1') as well. (and maybe ISO-8859-2, -3,
-4, and -5)
It's ISO-8859-7 (Greek).
 
D

Dave Angel

Does that string make any sense to you?

Yes it does, it mean "Unknown Hostname"
The Linux 'file' utility thinks this string is in ISO-8859, so you might
want to try a decode('ISO-8859-1') as well. (and maybe ISO-8859-2, -3,
-4, and -5)

How did you test it? The utility afaik analyzes a file's encodings not
string encodings.
[/QUOTE]

Starting with the byte string in the error message:

(e-mail address removed) [~]# file www/cgi-bin/files.py
www/cgi-bin/files.py: a /usr/bin/python script text executable
No point in doing that, as the string in question doesn't exist there.
 
D

Dave Angel

On 02/09/2013 12:38, Dave Angel wrote:

You don't recognise Greek?

I recognize most of those as Greek characters, but as I said, I don't
know Greek. And because I can't recognize words, I can't assume it
might not be some other language that uses the same glyphs.
 
M

MRAB

I recognize most of those as Greek characters, but as I said, I don't
know Greek. And because I can't recognize words, I can't assume it
might not be some other language that uses the same glyphs.
I don't know Greek either, and I don't think there's any other language
that uses the Greek alphabet.
 
F

Ferrous Cranus

Στις 2/9/2013 3:21 μμ, ο/η Dave Angel έγÏαψε:
Starting with the byte string in the error message:


Ιndeed but yet again, file checks out the encoding of the filename that
consists of these lines above, not of the actual strings.
 
D

Dave Angel

Óôéò 2/9/2013 3:21 ìì, ï/ç Dave Angel Ýãñáøå:


Éndeed but yet again, file checks out the encoding of the filename that
consists of these lines above, not of the actual strings.

'file' does nothing interesting with the filename, it just opens it and
examines the contents. For example,

file www/cgi-bin/files.py

will examine the Python source file, not run it.

So first in the interpreter, I ran

then at the bash prompt, I ran:

davea@think2:~$ file junk.txt
junk.txt: ISO-8859 text
davea@think2:~$
 
C

Chris Angelico

I don't know Greek either, and I don't think there's any other language
that uses the Greek alphabet.

Assuming you don't count mathematics as a language.

ChrisA
 
S

Steven D'Aprano

Assuming you don't count mathematics as a language.


There are a few languages which use the Greek alphabet, with variations.
Coptic is the main one, although Greek and Coptic letters have their own
Unicode symbols, in order to support works which need to distinguish them.

Armenian and, of course, Cyrillic, are derived from the Greek alphabet;
actually so is the Latin alphabet.

Other languages that used, or use, the Greek alphabet include quite a few
ancient languages, including Gaulish and Bactrian. Old Nubian in the
Middle Ages used the Greek alphabet plus a few additional letters. A
number of Slavic languages used the Greek alphabet, although now they use
Cyrillic. Some Albanian dialects still use the Greek alphabet, as do a
couple of Turkic languages from the Balkans. See the Wikipedia entry on
the Greek alphabet for more.
 
W

wxjmfauth

Le lundi 2 septembre 2013 16:44:34 UTC+2, MRAB a écrit :
I don't know Greek either, and I don't think there's any other language

that uses the Greek alphabet.

--------

The Latin alphabet uses Greek lettering.

The Cyrillic alphabet uses Greek lettering.

Greek: One should not confuse modern Greek
with ancient Greek, polytonic Greek full
of diacritics.

Plenty of European languages (~15) based on the Latin
alphabet uses some ancient Greek diacritics.

Now unicode.

Everything is working very smoothly with the endorsed coding
schemes of Unicode.org.

Expectedly it fails (behaves badly) with Python and its
Flexible Sting Representation, mainly because it relies on
the latin-1 (iso-8859-1) set.

To take the problem the other way, one can take these
linguistic ascpects to illustrate the wrong design of
the FSR.

jmf
 
A

Antoon Pardon

Op 03-09-13 17:23, (e-mail address removed) schreef:
--------

The Latin alphabet uses Greek lettering.

The Cyrillic alphabet uses Greek lettering.

Greek: One should not confuse modern Greek
with ancient Greek, polytonic Greek full
of diacritics.

Plenty of European languages (~15) based on the Latin
alphabet uses some ancient Greek diacritics.

Now unicode.

Everything is working very smoothly with the endorsed coding
schemes of Unicode.org.

Expectedly it fails (behaves badly) with Python and its
Flexible Sting Representation, mainly because it relies on
the latin-1 (iso-8859-1) set.

You really seem obsessed. There is no reason at all to think that is
problem is related to the FSR. You are only bringing this up, because
you are looking for opportunities to complain about the FSR.
To take the problem the other way, one can take these
linguistic ascpects to illustrate the wrong design of
the FSR.

No you can't, you are just assuming so because you feel it would
confirm your bias against the FSR.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,102
Messages
2,570,645
Members
47,245
Latest member
ShannonEat

Latest Threads

Top