UnicodeEncodeError when not running script from IDE

M

Magnus Pettersson

I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors.

But almost in every script that handle some form of special characters likeswedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!

Here is the error i get now when running the script with python.exe:
UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>

what can i do to fix this?
 
A

Andrew Berg

I am using Eclipse to write my python scripts and when i run them from inside eclipse they work fine without errors.

But almost in every script that handle some form of special characters like swedish åäö and chinese characters etc i get Unicode errors when running the script externally with python.exe or pythonw.exe (but the scripts run completely fine from within Eclipse (standard pydev projects, python2.7). I have usually launched the script gui from wihin eclipse because of this error but now i want to get the bottom of this so i dont have to open eclipse everytime i want to run a script!

Here is the error i get now when running the script with python.exe:
UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 32: character maps to <undefined>

what can i do to fix this?
Since you didn't say what code actually does this, I'll turn to my
crystal ball. It says you are trying to print characters to a terminal
that doesn't support them. If that is the case, you could try changing
the code page (but only 3.3 supports cp65001, so that probably won't
help) or use replacement characters when printing.
 
S

Steven D'Aprano

Magnus said:
I am using Eclipse to write my python scripts and when i run them from
inside eclipse they work fine without errors.

But almost in every script that handle some form of special characters
like swedish åäö and chinese characters etc

A comment: they are not "special" characters. They're merely not American.

i get Unicode errors when
running the script externally with python.exe or pythonw.exe (but the
scripts run completely fine from within Eclipse (standard pydev projects,
python2.7). I have usually launched the script gui from wihin eclipse
because of this error but now i want to get the bottom of this so i dont
have to open eclipse everytime i want to run a script!

Here is the error i get now when running the script with python.exe:
UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in
position 32: character maps to <undefined>

Please show the *complete* traceback, including the line of code that causes
the exception.

what can i do to fix this?

My guess is that you are trying to print a character which your terminal
cannot display. My terminal is set to use UTF-8, and so it can display it
fine:

py> c = u'\u898b'
py> print(c)
見


(or at least it would display fine if the font used had a glyph for that
character). Why there are still terminals in the world that don't default
to UTF-8 is beyond me.

If I manually change the terminal's encoding to Western European ISO 8859-1,
I get some moji-bake:

py> print(c)
è¦


I can't replicate the exception you give, so I assume it is specific to
Windows.
 
M

Magnus Pettersson

Ahh so its the actual printing that makes it error out outside of eclipse because its a different terminal that its printing to. Its the default DOS terminal in windows that runs then i run the script with python.exe and i guess its the same when i run with pythonw.exe just that the terminal window is not opened up, only the pyqt gui in this case.

I will try to fix it now when i know what it is :)

I never thought about the terminal, last time i had the same problem i justwere playing around for hours with unicode encode and decode and all that not-so-fun stuff :)

Andrew Berg: Thanks, your crystal ball seems to be right :p
 
M

Magnus Pettersson

I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):

with open(filepath,"a") as f:
for card in cardlist:
f.write(card+"\n")

The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:

File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)
 
M

Magnus Pettersson

I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):

with open(filepath,"a") as f:
for card in cardlist:
f.write(card+"\n")

The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:

File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)
 
P

Peter Otten

Magnus said:
I have tried now to take away printing to terminal and just keeping the
writing to a .txt file to disk (which is what the scripts purpose is):

with open(filepath,"a") as f:
for card in cardlist:
f.write(card+"\n")

The file it writes to exists and im just appending to it, but when i run
the script trough eclipse, all is fine. When i run in terminal i get this
error instead:

File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in
position 3 2: ordinal not in range(128)

Are you sure you are writing the same data? That would mean that pydev
changes the default encoding -- which is evil.

A portable approach would be to use codecs.open() or io.open() instead of
the built-in:

import io
with io.open(filepath, "a") as f:
...

io.open() uses UTF-8 by default, but you can specify other encodings with
io.open(filepath, mode, encoding=whatever).
 
M

Magnus Pettersson

Are you sure you are writing the same data? That would mean that pydev
changes the default encoding -- which is evil.



A portable approach would be to use codecs.open() or io.open() instead of

the built-in:



import io

with io.open(filepath, "a") as f:

...



io.open() uses UTF-8 by default, but you can specify other encodings with

io.open(filepath, mode, encoding=whatever).


Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:

f.write(card+"\n")
File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>

.....

io.open(filepath, "a", encoding="UTF-8") as f:

Then it works in eclipse. But I seem to be having an encoding problem all over the place that works in eclipse but dosnt work outside of eclipse pydev.

Here is the flow of my data, im terrible at using unicode/encode/decode so could use some help here:

kanji_anki_gui.py:

def on_addButton_clicked(self):
#code
# self.kanji.text() comes from a kanji letter written into a pyqt4 QLineEdit
kanji = unicode(self.kanji.text())
card = kanji_anki.scrapeKanji(kanji,tags)
#more code

kanji_anki.py:

def scrapeKanji(kanji, tags="", onlymeaning=False):
baseurl = unicode("http://www.romajidesu.com/kanji/")
url = unicode(baseurl+kanji)
#test to write out url to disk, works outside of eclipse now
savefile()

#getting webpage works fine in eclipse, prints "Oh no..." in terminal
try:
page = urllib2.urlopen(url)
except:
print "OH no website dont work"
return None

#Code that does some scraping and returns a string containing kanji letters
return card

def savefile(cardlist,filepath="D:/iknow_kanji.txt"):
with io.open(filepath, "a") as f:
for card in cardlist:
f.write(card+"\n")
return True
 
M

Magnus Pettersson

Are you sure you are writing the same data? That would mean that pydev
changes the default encoding -- which is evil.



A portable approach would be to use codecs.open() or io.open() instead of

the built-in:



import io

with io.open(filepath, "a") as f:

...



io.open() uses UTF-8 by default, but you can specify other encodings with

io.open(filepath, mode, encoding=whatever).


Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:

f.write(card+"\n")
File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>

.....

io.open(filepath, "a", encoding="UTF-8") as f:

Then it works in eclipse. But I seem to be having an encoding problem all over the place that works in eclipse but dosnt work outside of eclipse pydev.

Here is the flow of my data, im terrible at using unicode/encode/decode so could use some help here:

kanji_anki_gui.py:

def on_addButton_clicked(self):
#code
# self.kanji.text() comes from a kanji letter written into a pyqt4 QLineEdit
kanji = unicode(self.kanji.text())
card = kanji_anki.scrapeKanji(kanji,tags)
#more code

kanji_anki.py:

def scrapeKanji(kanji, tags="", onlymeaning=False):
baseurl = unicode("http://www.romajidesu.com/kanji/")
url = unicode(baseurl+kanji)
#test to write out url to disk, works outside of eclipse now
savefile()

#getting webpage works fine in eclipse, prints "Oh no..." in terminal
try:
page = urllib2.urlopen(url)
except:
print "OH no website dont work"
return None

#Code that does some scraping and returns a string containing kanji letters
return card

def savefile(cardlist,filepath="D:/iknow_kanji.txt"):
with io.open(filepath, "a") as f:
for card in cardlist:
f.write(card+"\n")
return True
 
P

Peter Otten

Magnus said:
io.open() uses UTF-8 by default, but you can specify other encodings with

io.open(filepath, mode, encoding=whatever).


Interesting. Pydev must be doing something behind the scenes because when
i changed open() to io.open() i get error inside of eclipse now:

f.write(card+"\n")
File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in
position 32: character maps to <undefined>

....

io.open(filepath, "a", encoding="UTF-8") as f:

Then it works in eclipse. But I seem to be having an encoding problem all
over the place that works in eclipse but dosnt work outside of eclipse
pydev.

No, I was wrong about the default; it is actually
locale.getpreferredencoding(). Sorry for the confusion.
 
D

Dave Angel

I think you are using Python 2.x, not Python 3. So you'd better be
explicit what encodings you want for each file.
Interesting. Pydev must be doing something behind the scenes because when i changed open() to io.open() i get error inside of eclipse now:

What encoding is this file? Since you're appending to it, you really
need to match the pre-existing encoding, or the next program to deal
with it is in big trouble. So using the io.open() without the encoding=
keyword is probably a mistake.
f.write(card+"\n")
File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in position 32: character maps to <undefined>

....
 
T

Terry Reedy

Ahh so its the actual printing that makes it error out outside of
eclipse because its a different terminal that its printing to. Its
the default DOS terminal in windows that runs then i run the script
with python.exe and i guess its the same when i run with pythonw.exe
just that the terminal window is not opened up, only the pyqt gui in
this case.

Writing

txt = <expression involving coding>
print(txt)

rather than

print(<expression involving coding>)

makes it easier to tell whether a UnicodeError comes from evaluating the
expression or from the print operation.

Using 3.3 instead of 2.7 will make using unicode somewhat easier.
 
M

Magnus Pettersson

What encoding is this file? Since you're appending to it, you really
need to match the pre-existing encoding, or the next program to deal

with it is in big trouble. So using the io.open() without the encoding=

keyword is probably a mistake.

The .txt file is in UTF-8

I have got it to work now in the terminal, but i dont understand what im doing and why i didnt need to do all the unicode strings and encode mumbo jumbo in eclipse

#Here kanji = u"ç§"
baseurl = u"http://www.romajidesu.com/kanji/"
url = baseurl+kanji
savefile() #this test works now. uses: io.open(filepath, "a",encoding="UTF-8") as f:
# This made the fetching of the website work. Why did i have to write url.encode("UTF-8") when url already is unicode? I feel i dont have a good understanding of this.
page = urllib2.urlopen(url.encode("UTF-8"))


.....
 
D

Dave Angel

#Here kanji = u"ç§"
baseurl = u"http://www.romajidesu.com/kanji/"
url = baseurl+kanji
savefile() #this test works now. uses: io.open(filepath, "a",encoding="UTF-8") as f:
# This made the fetching of the website work.


You don't show the code that actually does the io.open(), nor the
url.encode, so I'm not going to guess what you're actually doing.

Why did i have to write url.encode("UTF-8") when url already is unicode? I feel i dont have a good understanding of this.
page = urllib2.urlopen(url.encode("UTF-8"))

utf-8 is NOT unicode; they are entirely different. Unicode is
conceptually 32 bits per character, and is an internal representation.
There are a million or so characters defined. Nearly always when you're
talking to an external device, you need bytes. Since you can't cram 32
bits into 8, you have to encode it. Examples of devices would be any
file, or the console. Notice that sometimes you can use unicode
directly for certain functions. For example, the Windows file name is
composed of Unicode characters, so Windows has function calls that
accept Unicode directly. But back to 8 bits:

One encoding is called ASCII, which is simply the bottommost 7 bits.
But of course it gets an error if there are any characters above 127.

Other encodings try to pick an 8 bit subset of the million possible
characters. Again, if you happen to have a character that's not in that
subset, you'll get an error.

There are also other encodings which are hard to describe, but
fortunately pretty rare these days.

Then there's utf-8, which uses a variable length bunch of bytes for
each character. It's designed to use the ASCII encoding for characters
which are below 128, but uses two or more bytes for all the other
characters. So it works out well when most characters happen to be ASCII.

Once encoded, a stream of bytes can only be successfully interpreted if
you use the same decoding when processing them.[/QUOTE]
 
M

MRAB

I have tried now to take away printing to terminal and just keeping the writing to a .txt file to disk (which is what the scripts purpose is):

with open(filepath,"a") as f:
for card in cardlist:
f.write(card+"\n")

The file it writes to exists and im just appending to it, but when i run the script trough eclipse, all is fine. When i run in terminal i get this error instead:

File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)
When you open the file, tell it what encoding to use. For example:

with open(filepath, "a", encoding="utf-8") as f:
for card in cardlist:
f.write(card + "\n")
 
M

Magnus Pettersson

You don't show the code that actually does the io.open(), nor the

url.encode, so I'm not going to guess what you're actually doing.

Hmm im not sure what you mean but I wrote all code needed in a previous post so maybe you missed that one :)
In short I basically just have:
import io
io.open(myfile,"a",encode="UTF-8") as f:
f.write(my_ustring_with_kanji)

the url.encode() is my unicode string variable named "url" using the type built in function .encode() which was the thing i wondered why i needed to use, which you explained very well, thank you!

Just one more question since all this is still a little fuzzy in my head.

When do i need to use .decode() in my code? is it when i read lines from f.ex a UTF-8 file? And why didn't I have to use .encode() on my unicode string when running from within eclipse pydev? someone wrote that it has a default codec setting so maybe that handles it for me there (which is kinda dangerous since my programs wont work running outside of eclipse since i didnt do any encoding or using of unicode strings before in my script and it still worked)

--Magnus
 
S

Steven D'Aprano

Magnus Pettersson wrote:

# This made the fetching of the website work. Why did i have to write
# url.encode("UTF-8") when url already is unicode? I feel i dont have a
# good understanding of this.
page = urllib2.urlopen(url.encode("UTF-8"))


Start here:

"The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)"

http://www.joelonsoftware.com/articles/Unicode.html


Basically, Unicode is an in-memory data format. Python knows about Unicode
characters (to be technical: code points), but files on disk do not.
Neither do network protocols, or terminals, or other simple devices. They
only understand bytes.

So when you have Unicode text, and you want to write it to a file on disk,
or print it, or send it over the network to another machine, it has to be
*encoded* into bytes, and then *decoded* back into Unicode when you read it
from the file again. Sometimes the system will "helpfully" do that encoding
and decoding automatically for you, which is fine when it works but when it
doesn't it can be perplexing.

There are many, many, many different *encoding schemes*. ASCII is one. UTF-8
is another. And then there are about a bazillion legacy encodings which, if
you are lucky, you will never need to care about. Only some encodings can
deal with the entire range of Unicode characters, most can only deal with a
(typically small) subset of possible characters. E.g. ASCII only knows
about 127 characters out of the million-plus that Unicode deals with.
Latin-1 can handle close to 256 different characters. If you have a say in
the matter, always use UTF-8, since it can handle the full set of Unicode
characters in the most efficient manner.
 
M

Magnus Pettersson

Thanks a lot Steven, you gave me a good AHA experience! :)

Now I understand why I had to use encoding when calling the urllib2! So basically Eclipse PyDev does this in the background for me, and its console supports utf-8, so thats why i never had to think about it before (and why some scripts tends to fail with unicode errors when run outside of the Eclipse IDE).

cheers
Magnus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,708
Latest member
SherleneF1

Latest Threads

Top