C
Chuck Bearden
I'm having a tough time understanding how to manage Unicode when loading
data into an MS SQL server. I'm still pretty new to Unicode, but I
think I have a grasp of the basic concepts. I'm running ActivePython
2.3.2 Build 230 on Windows XP. I have the Egenix mx.ODBC package
version 2.0.1 (thanks, Marc-Andre).
I have a script that is loading the contents of selected HTML files into
a database, along with information identifying the file. Here is a
sample script:
-------------------------begin snippet-------------------------
import sys
import mx.ODBC.Windows
#-- initialize the db connection
dbname = 'theDb'
uname = 'theUser'
password = 'thePassword'
dsn = "DSN=%s;UID=%s;PWD=%s" % (dbname, uname, password)
con = mx.ODBC.Windows.DriverConnect(dsn)
#-- handle UTF-8 encoded Unicode; this worked when loading XML files
con.encoding = 'utf-8'
con.stringformat = mx.ODBC.Windows.UNICODE_STRINGFORMAT
cur = con.cursor()
#-- get the contents of our file (crudely: filename is 2nd arg)
html_f = open(sys.argv[1], 'r')
htmldata = html_f.read()
html_f.close()
#-- make statement string and insert values tuple, and execute
stmnt = """
INSERT INTO pmLinkHTML
(PMID, Ord, HTML, HTMLlen)
VALUES
(?, ?, ?, ?)
"""
val_t = (549, 0, htmldata, len(htmldata))
cur.execute(stmnt, val_t)
cur.close()
con.close()
--------------------------end snippet--------------------------
For my pains I am rewarded with:
Traceback (most recent call last):
File "./unitest.py", line 27, in ?
cur.execute(stmnt, val_t)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position
45662: unexpected code byte
Byte 45662 of the HTML file is indeed "\xBE". I don't think that should
be a problem.
What am I doing wrong? I have spent a fair bit of time googling the
ng in various ways, and consulting Python in a Nutshell and the online
standard library docs at python.org. It may be something quite
obvious to a better-informed coder, but I am prepared to learn.
Many thanks in advance.
Chuck Bearden
data into an MS SQL server. I'm still pretty new to Unicode, but I
think I have a grasp of the basic concepts. I'm running ActivePython
2.3.2 Build 230 on Windows XP. I have the Egenix mx.ODBC package
version 2.0.1 (thanks, Marc-Andre).
I have a script that is loading the contents of selected HTML files into
a database, along with information identifying the file. Here is a
sample script:
-------------------------begin snippet-------------------------
import sys
import mx.ODBC.Windows
#-- initialize the db connection
dbname = 'theDb'
uname = 'theUser'
password = 'thePassword'
dsn = "DSN=%s;UID=%s;PWD=%s" % (dbname, uname, password)
con = mx.ODBC.Windows.DriverConnect(dsn)
#-- handle UTF-8 encoded Unicode; this worked when loading XML files
con.encoding = 'utf-8'
con.stringformat = mx.ODBC.Windows.UNICODE_STRINGFORMAT
cur = con.cursor()
#-- get the contents of our file (crudely: filename is 2nd arg)
html_f = open(sys.argv[1], 'r')
htmldata = html_f.read()
html_f.close()
#-- make statement string and insert values tuple, and execute
stmnt = """
INSERT INTO pmLinkHTML
(PMID, Ord, HTML, HTMLlen)
VALUES
(?, ?, ?, ?)
"""
val_t = (549, 0, htmldata, len(htmldata))
cur.execute(stmnt, val_t)
cur.close()
con.close()
--------------------------end snippet--------------------------
For my pains I am rewarded with:
Traceback (most recent call last):
File "./unitest.py", line 27, in ?
cur.execute(stmnt, val_t)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position
45662: unexpected code byte
Byte 45662 of the HTML file is indeed "\xBE". I don't think that should
be a problem.
What am I doing wrong? I have spent a fair bit of time googling the
ng in various ways, and consulting Python in a Nutshell and the online
standard library docs at python.org. It may be something quite
obvious to a better-informed coder, but I am prepared to learn.
Many thanks in advance.
Chuck Bearden