Removing Unicode from Python?

Paradox · Oct 30, 2003

In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?". I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

Thanks Joey

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Oct 30, 2003

Paradox said:
In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?".

Can you give an example of such string? Reporting its repr() would help.

If you want to encode arbitrary Unicode strings into byte strings, you
can use "utf-8" as the encoding.

Regards,
Martin

Neil Hodgson · Oct 30, 2003

Paradox:

I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string.

In VB a string is a BSTR is a Unicode string.
http://msdn.microsoft.com/library/d...98/html/vbconpassingstringstodllprocedure.asp

Neil

George Kinney · Oct 30, 2003

There is no way that a byte can not be

between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

All MS products use unicode strings. All the time. Its integral to
the OS and all its libraries.

VB and other MS offspring allow you to ignore that fact, but they
don't make it go away.

Python is just doing what it should do: handle unicode strings as unicode
strings.

Tim Roberts · Nov 1, 2003

Brian Quinlan said:
This statement is obviously false.

Not really. The core Windows 2000 and XP operating systems are exclusively
Unicode. When you call one of the ASCII APIs, it converts every string to
Unicode, calls the Unicode API which does the real work, converts any
output parameters back to ASCII, and returns them to you.

As you might imagine, all of those conversions cost time. Thus,
Microsoft's application products work natively in Unicode and use the
Unicode APIs when they are available.

But the SQL Server "text" type is not a Unicode type.

And that means, among other things, that it cannot handle international
character sets reasonably. There is no agreement as to what the character
0xBF is, whereas there IS standards-based agreement on the meaning of the
Unicode code point u00BF.

jack · Nov 4, 2003

In general I love Python for text manipulation but at our company we
have the need to manipulate large text values stored in either a SQL
Server database or text files. This data is stored in a "text" field
type and is definitely not unicode though it is often very strange
text since it is either OCR or some kinda electronic file extraction.
Unfortunately when it is retrieved into a string type in python it is
invariably a unicode type string. The best I can do is try and encode
it to 'latin-1' but that will often throw and error if I use the
ignore parameter then it will wack my data with a bunch of "?". I am
just not understanding why python is thinking stuff is unicode and why
it is failing on conversion. There is no way that a byte can not be
between 0 and 255 right? This problem can be so haunting that I will
start to wish I had coded the solution in VB where at least a string
is a string is a string. Is there a way to modify Python so that all
strings will always be single byte strings since we have no need for
Unicode support? Any solutions or suggestions to my biggest Python
annoyance would be greatly appreciated.

Thanks Joey

i had a simpilar problem with SQL Server. my solution was to create a
sitecustomize.py file containing:

import sys
sys.setdefaultencoding("utf-8")

this works for me and turns off unicode for everything. i was unable to
find any other solution that i could understand. (i'm not a programmer and
have only just started with python).

jack
sidelined in order to prevent discrimination on the gender front

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Thinking Unicode	0	Aug 8, 2013
Unicode and Python - how often do you index strings?	33	Jun 4, 2014
split lines from stdin into a list of unicode strings	0	Aug 28, 2013
Python dict as unicode	1	Nov 24, 2010
Python client/server that reads HTML body from server	1	Apr 12, 2023
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
portable unicode literals	4	Oct 15, 2012

Removing Unicode from Python?

Paradox

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Neil Hodgson

George Kinney

Tim Roberts

jack

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads