UnicodeEncodeError in compile

P

pyscripter

Using python 3.2 in Windows 7 I am getting the following:
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names areprovided.
 
T

Terry Reedy

Using python 3.2 in Windows 7 I am getting the following:

UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

Can anybody explain why the compile statement tries to convert the unicode filename using mbcs? I know that sys.getfilesystemencoding returns 'mbcs' in Windows, but I thought that this is not used when unicode file names are provided.

I get the same error running 3.2.2 under IDLE but not when pasting into
Command Prompt. However, Command Prompt may be cheating by replacing the
Chinese chars with '??' upon pasting, so that Python never gets them --
whereas they appear just fine in IDLE.
 
J

jmfauth

1) If I copy/paste these CJK chars from Google Groups in two of my
interactive
interpreters (no "dos/cmd console"), I have no problem.

2) It semms the mbcs codec has some difficulties with
these chars.
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid characterb'\x00\x00]\xe5'

3) On the usage of mbcs in files IO interaction --> core devs.

My conclusion.
The bottle neck is on the mbcs side.

jmf
 
8

88888 Dihedral

Terry Reedyæ–¼ 2012å¹´1月10日星期二UTC+8下åˆ4時08分40秒寫é“:
I get the same error running 3.2.2 under IDLE but not when pasting into
Command Prompt. However, Command Prompt may be cheating by replacing the
Chinese chars with '??' upon pasting, so that Python never gets them --
whereas they appear just fine in IDLE.

Thank you about the trick.
Use some wildcat pattern to get the name.py compiled to pwc in some
directory with utf-8 encoded chars.
 
8

88888 Dihedral

Terry Reedyæ–¼ 2012å¹´1月10日星期二UTC+8下åˆ4時08分40秒寫é“:
I get the same error running 3.2.2 under IDLE but not when pasting into
Command Prompt. However, Command Prompt may be cheating by replacing the
Chinese chars with '??' upon pasting, so that Python never gets them --
whereas they appear just fine in IDLE.

Thank you about the trick.
Use some wildcat pattern to get the name.py compiled to pwc in some
directory with utf-8 encoded chars.
 
J

jmfauth

Terry Reedyæ–¼ 2012å¹´1月10日星期二UTC+8下åˆ4時08分40秒寫é“:


Tested with *my* Windows GUI interactive intepreters.

It seems to me there is a problem with the mbcs codec.
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character
'\u5de5'.encode('utf-8') b'\xe5\xb7\xa5'
'\u5de5'.encode('utf-32-be') b'\x00\x00]\xe5'
sys.version '3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)]'
'\u5de5'.encode('mbcs', 'replace')
b'?'

----------
u'\u5de5'.encode('mbcs', 'replace') '?'
repr(u'\u5de5'.encode('utf-8')) "'\\xe5\\xb7\\xa5'"
repr(u'\u5de5'.encode('utf-32-be')) "'\\x00\\x00]\\xe5'"
sys.version
'2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)]'


jmf
 
J

jmfauth

Addendum, Python console ("dos box")

D:\>c:\python32\python.exe
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: inval
id character

D:\>c:\python27\python.exe
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.

D:\>

jmf
 
T

Terry Reedy

D:\>c:\python32\python.exe
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in<module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: inval
id character
D:\>c:\python27\python.exe
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.'?'

mbcs encodes according to the current codepage. Only the chinese
codepage(s) can encode the chinese char. So the unicode error is correct
and 2.7 has a bug in that it is doing "errors='replace'" when it
supposedly is doing "errors='strict'". The Py3 fix was done in
http://bugs.python.org/issue850997
2.7 was intentionally left alone because of back-compatibility
considerations. (None of this addresses the OP's question.)
 
T

Terry Reedy

Is this a filename that could be an actual, valid filename on your system?

Good question. I believe this holdover from 2.x should be deleted.
I argued that in http://bugs.python.org/issue10114
(which was about a different problem) and now, directly, in
http://bugs.python.org/issue13758

If you (or anyone) can make a better argument for the requested change,
or for also changing compile on *nix, than I did, please do so.
 
J

jmfauth

On 1/10/2012 8:43 AM, jmfauth wrote:

...

mbcs encodes according to the current codepage. Only the chinese
codepage(s) can encode the chinese char. So the unicode error is correct
and 2.7 has a bug in that it is doing "errors='replace'" when it
supposedly is doing "errors='strict'". The Py3 fix was done inhttp://bugs.python.org/issue850997
2.7 was intentionally left alone because of back-compatibility
considerations. (None of this addresses the OP's question.)

--

win7, cp1252

Ok. I was not aware of this.
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid characterTraceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: invalid character

jmf
 
J

jmfauth

D:\>c:\python32\python.exe
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
'\u5de5'.encode('utf-8') b'\xe5\xb7\xa5'
'\u5de5'.encode('mbcs')
Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
UnicodeEncodeError: 'mbcs' codec can't encode characters in position
0--1: inval
id character
D:\>c:\python27\python.exe
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
u'\u5de5'.encode('utf-8') '\xe5\xb7\xa5'
u'\u5de5'.encode('mbcs')
'?'

mbcs encodes according to the current codepage. Only the chinese
codepage(s) can encode the chinese char. So the unicode error is correct
and 2.7 has a bug in that it is doing "errors='replace'" when it
supposedly is doing "errors='strict'". The Py3 fix was done inhttp://bugs.python.org/issue850997
2.7 was intentionally left alone because of back-compatibility
considerations. (None of this addresses the OP's question.)

--

Ok. I was not aware of this.
PS Prev. post gets lost.
 
P

pyscripter

On 1/10/2012 3:08 AM, Terry Reedy wrote:
Is this a filename that could be an actual, valid filename on your system?

Yes it is. open works on that file.
Good question. I believe this holdover from 2.x should be deleted.
I argued that in http://bugs.python.org/issue10114
(which was about a different problem) and now, directly, in
http://bugs.python.org/issue13758
Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names.

But I think the real issue is why on modern Windows systems the file system encoding is mbcs. Shouldn't it be utf-16?
 
P

pyscripter

On 1/10/2012 3:08 AM, Terry Reedy wrote:
Is this a filename that could be an actual, valid filename on your system?

Yes it is. open works on that file.
Good question. I believe this holdover from 2.x should be deleted.
I argued that in http://bugs.python.org/issue10114
(which was about a different problem) and now, directly, in
http://bugs.python.org/issue13758
Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names.

But I think the real issue is why on modern Windows systems the file system encoding is mbcs. Shouldn't it be utf-16?
 
D

Dave Angel

<SNIP>
Maybe the example of this question can be added to the issue 13785 as a proof that compile fails on valid file names.

But I think the real issue is why on modern Windows systems the file system encoding is mbcs. Shouldn't it be utf-16?
Depends what you mean by modern. The following isn't true for Windows
95, 98, nor ME. But they weren't modern when they were first released.

NT systems, (which includes Win2k, XP, Vista, and Win7) for at least
the last 15 years, have used Unicode for the file system. They also
supply an "ASCII" interface. If Python is using the latter, then it
won't be able to access all possible files.

Now, it may be the fault of the C library that CPython uses. I haven't
looked at any of the code for CPython.

This is all from memory, as I haven't actively used Windows for some
time now. But I think the DLL name is kernel32.dll, and the entry
points have names like CreateFileW() for the unicode open, and
CreateFileA() for the "ASCII" open.
 
P

pyscripter

Indeed, on Windows NT the file system encoding should not be mbcs, since it creates UnicodeEncodeErrors on perfectly valid file names.
 
P

pyscripter

Indeed, on Windows NT the file system encoding should not be mbcs, since it creates UnicodeEncodeErrors on perfectly valid file names.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,710
Latest member
bernietqt

Latest Threads

Top