S
Stuart McGraw
I just spent a $*#@!*&^&% hour registering at ^$#@#%^
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...
================================================
Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)
Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.
Note: To reproduce this problem, it helps to have East Asian font
support installed on the test system. In Windows 2000:
Control Panel,
Regional Options, General tab
check mark in Japanese in the "Language seetings..." area.
Python also needs either the cjkcodecs (http://cjkpython.berlios.de/)
or Tamito KAJIYAMA's japanese codecs
(http://www.asahi-net.or.jp/~rd6t-kjym/python/)
installed.
To reproduce the problem...
1. Create a python file, test.py:
test.py:
----------------
import sys, urllib, cjkcodecs
f = urllib.urlopen (sys.argv[1])
for ln in f:
ln = ln.decode ("cjkcodecs.euc-jp")
print ln.encode("utf-8"),
----------------
2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1
----------------
3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt
4. out1.txt and out2.txt should be identical. But they are not.
The url used will return a EUC-JP encoded page with some japanese
characters in it. Test.py reads the page line by line, decodes
the lines to unicode, reencodes them to UTF-8, and writes to a file.
Thus the output file should be a UTF-8 version of the EUC-JP web page.
The first command runs test.py directly. The second command runs
the identical command from a Windows batch file. One should expect
out1.txt and out2.txt to be identical.
out1.txt (created by running test.py from the command line) is
correct (verify by opening out1.txt in notepad, and selecting a
Japanese capable font, e.g. Lucida Sans Unicode). The string in
the first cell of the html table is the three japanese characters
for word "taberu".
But in out2.txt (created by running test.py from a windows .bat
file), instead of japanese characters there, we see an ascii text
string "A9D9EB". (The EUC-JP value of the actual japanese characters
that should be there are \xBF\xA9\xA4\xD9\XA4\xEB, so the printed
hex digits seems to come from alternate bytes of the EUC-JP string.
In other lines with japanese characters a similar effect is seen:
the first two japanese character are replaced with with a string of
hex digits. Strangely, remaining japanese characters on the line
are not corrupted.
Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.
So it looks like some bad mojo between urllib and the Windows
batch environment.
Sourceforce and trying to submit a Python bug report
but it still won't let me. I give up. Maybe someone who
cares will see this post, or maybe it will save time for
someone else who runs into this problem...
================================================
Environment:
- Microsoft Windows 2000 Pro
- Python 2.3.4
- urllib (version shipped with Python-2.3.4)
Problem:
urllib returns corrupted data when reading an EUC-JP encoded
web page, from a python script run from a MS Windows .BAT
file, but not when the same script is run from the command line.
Note: To reproduce this problem, it helps to have East Asian font
support installed on the test system. In Windows 2000:
Control Panel,
Regional Options, General tab
check mark in Japanese in the "Language seetings..." area.
Python also needs either the cjkcodecs (http://cjkpython.berlios.de/)
or Tamito KAJIYAMA's japanese codecs
(http://www.asahi-net.or.jp/~rd6t-kjym/python/)
installed.
To reproduce the problem...
1. Create a python file, test.py:
test.py:
----------------
import sys, urllib, cjkcodecs
f = urllib.urlopen (sys.argv[1])
for ln in f:
ln = ln.decode ("cjkcodecs.euc-jp")
print ln.encode("utf-8"),
----------------
2. Create a batch file that will run test.py:
test.bat:
----------------
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1
----------------
3. In a cmd.exe window run the following two commands:
python test.py http://etext.lib.virginia.edu/cgi-local/breen/wwwjdic?1W%BF%A9%A4%D9%A4%EB_v1 >out1.txt
test.bat >out2.txt
4. out1.txt and out2.txt should be identical. But they are not.
The url used will return a EUC-JP encoded page with some japanese
characters in it. Test.py reads the page line by line, decodes
the lines to unicode, reencodes them to UTF-8, and writes to a file.
Thus the output file should be a UTF-8 version of the EUC-JP web page.
The first command runs test.py directly. The second command runs
the identical command from a Windows batch file. One should expect
out1.txt and out2.txt to be identical.
out1.txt (created by running test.py from the command line) is
correct (verify by opening out1.txt in notepad, and selecting a
Japanese capable font, e.g. Lucida Sans Unicode). The string in
the first cell of the html table is the three japanese characters
for word "taberu".
But in out2.txt (created by running test.py from a windows .bat
file), instead of japanese characters there, we see an ascii text
string "A9D9EB". (The EUC-JP value of the actual japanese characters
that should be there are \xBF\xA9\xA4\xD9\XA4\xEB, so the printed
hex digits seems to come from alternate bytes of the EUC-JP string.
In other lines with japanese characters a similar effect is seen:
the first two japanese character are replaced with with a string of
hex digits. Strangely, remaining japanese characters on the line
are not corrupted.
Running with a debugger shows that the corruption is in the text
received from urllib; it is not a result of the euc-jp decoding,
UTF-8 encoding, or writing to the output file.
So it looks like some bad mojo between urllib and the Windows
batch environment.