umlauts

Arian Kuschki · Oct 17, 2009

Hi all

this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)

<forecast_information><cit

y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bewï¿½kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87ï¿½%"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>

As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.

I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.

Cheers
Arian

Diez B. Roggisch · Oct 17, 2009

Arian said:
Hi all

this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)

<forecast_information><cit

Click to expand...

y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bewï¿½kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87ï¿½%"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>

As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.

The encoding of the python-source-file has nothing to do with this. It's
only relevant for unicode-literals (in python 2.x, that's u"...")

I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.

If we assume the data on the website is correct (it appears to be when I
open it in FF), then your problem is most probably your display/terminal.

What does this show you in your interactive interpreter?
Ã¶

For me, it's o-umlaut, Ã¶. This is because the above bytes are the
sequence for Ã¶ in utf-8.

If this shows something else, you need to adjust your terminal settings.

Diez

StarWing · Oct 17, 2009

Hi all

this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit

y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bew kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87 %"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>

As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.

I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.

Cheers
Arian

try this?

# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler

f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()

class my_handler(handler.ContentHandler):
def startElement(self, name, attrs):
print "begin:", name, attrs

def endElement(self, name):
print "end:", name

sax.parseString(xml, my_handler())

Diez B. Roggisch · Oct 17, 2009

StarWing said:
Hi all

this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit

y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bew kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87 %"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>

As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.

I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.

Cheers
Arian

Click to expand...

try this?

# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler

f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()

class my_handler(handler.ContentHandler):
def startElement(self, name, attrs):
print "begin:", name, attrs

def endElement(self, name):
print "end:", name

sax.parseString(xml, my_handler())

This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez

Diez B. Roggisch · Oct 17, 2009

StarWing said:
Hi all

this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit

y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bew kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87 %"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>

As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.

I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.

Cheers
Arian

Click to expand...

try this?

# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler

f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()

class my_handler(handler.ContentHandler):
def startElement(self, name, attrs):
print "begin:", name, attrs

def endElement(self, name):
print "end:", name

sax.parseString(xml, my_handler())

This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez

StarWing · Oct 17, 2009

StarWing schrieb:

Hi all
this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:
In [1]: import urllib
In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
In [3]: xml = f.read()
In [4]: f.close()
In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bew kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87 %"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>
As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.
I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.
I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.
Cheers
Arian

Click to expand...

Click to expand...

try this?

Click to expand...

# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler

Click to expand...

f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()

Click to expand...

class my_handler(handler.ContentHandler):
Â Â def startElement(self, name, attrs):
Â Â Â Â print "begin:", name, attrs

Click to expand...

Â Â def endElement(self, name):
Â Â Â Â print "end:", name

Click to expand...

sax.parseString(xml, my_handler())

Click to expand...

This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

Â - is outside your default-system-encoding (ususally ascii)
Â - the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print Â and the decode() don't do - smells like
cargo-cult to me.

Diez

yes, XML is a *byte*-based format, and so as utf-8 and code-page
(cp936, cp1252, etc.). so usually XML will sign its coding at head.
but this didn't work now.

in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
sys.setdefaultcoding(), and f.read() return a str. so it must be a
undecoded, byte-base format (i.e. raw XML data). so use the right code-
page to decode it is safe.(notice the webpage is google.de).

in Python3.1, read() returns a bytes object. so we *must* decode it,
nor we can't pass it into a parser.

Arian Kuschki · Oct 17, 2009

Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate

What does this show you in your interactive interpreter?

Ã¶

For me, it's o-umlaut, Ã¶. This is because the above bytes are the
sequence for Ã¶ in utf-8.

If this shows something else, you need to adjust your terminal settings.

for me it also prints the correct o-umlaut (Ã¶), so that was not the problem.

All of the below result in xml that shows all umlauts correctly when printed:

xml.decode("cp1252")
xml.decode("cp1252").encode("utf-8")
xml.decode("iso-8859-1")
xml.decode("iso-8859-1").encode("utf-8")

But when I want to parse the xml then, it only works if I
do both decode and encode. If I only decode, I get the following error:
SAXParseException: <unknown>:1:1: not well-formed (invalid token)

Do I understand right that since the encoding was not specified in the xml
response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
would not have had the encoding problem in the first place?

Anyway, thanks everybody, this has helped me a lot.

Arian

StarWing said:
StarWing said:

StarWing schrieb:

On 10æœˆ17æ—¥, ä¸‹åˆ9æ—¶54åˆ†, Arian Kuschki <[email protected]>
wrote:
Hi all
this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:
In [1]: import urllib
In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
In [3]: xml = f.read()
In [4]: f.close()
In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bew kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87 %"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>
As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.
I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.
I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.
Cheers
Arian
try this?
# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler
f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()
class my_handler(handler.ContentHandler):
def startElement(self, name, attrs):
print "begin:", name, attrs
def endElement(self, name):
print "end:", name
sax.parseString(xml, my_handler())
This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez

Click to expand...

yes, XML is a *byte*-based format, and so as utf-8 and code-page
(cp936, cp1252, etc.). so usually XML will sign its coding at head.
but this didn't work now.

in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
sys.setdefaultcoding(), and f.read() return a str. so it must be a
undecoded, byte-base format (i.e. raw XML data). so use the right code-
page to decode it is safe.(notice the webpage is google.de).

in Python3.1, read() returns a bytes object. so we *must* decode it,
nor we can't pass it into a parser.

Click to expand...

You didn't get my point. A XML-parser only *takes* a byte-string.
Decoding is it's business. So your above last sentence is wrong.

Because regardless of the python-version, if you feed the parser a
unicode-object, python will first encode that to a byte-string,
possibly giving a UnicodeError (maybe this automated conversion has
gone in Py3K, but then you get a type-error instead).

So to make the above work (if one wants to parse the xml), the
proper thing to do would be

xml = xml.decode("cp1252").encode("utf-8")

and then feed that. Of course the really good thing would be to fix
the webpage, but that's beyond our capabilities I fear...

Diez

--

Diez B. Roggisch · Oct 17, 2009

StarWing said:
StarWing schrieb:

On 10æœˆ17æ—¥, ä¸‹åˆ9æ—¶54åˆ†, Arian Kuschki <[email protected]>
wrote:
Hi all
this has been bugging me for a long time and I do not seem to be able to
understand what to do. I always have problems when dealing input text that
contains umlauts. Consider the following:
In [1]: import urllib
In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
In [3]: xml = f.read()
In [4]: f.close()
In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"><forecast_information><cit
y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2009-10-17"/><current_date_time data="2009-10
-17 14:20:00 +0000"/><unit_system
data="SI"/></forecast_information><current_conditions><condition data="Meistens
bew kt"/><temp_f data="43"/><temp_c data="6"/><h
umidity data="Feuchtigkeit: 87 %"/><icon
data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit
Windgeschwindigkeiten von 13 km/h"/></curr
ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
data="1"/><high data="7"/><icon
data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
data="So."/><low data="-1"/><high data="8"/><icon
data="/ig/images/weather/chance_of_sno
w.gif"/><condition data="Vereinzelt
Schnee"/></forecast_conditions><forecast_conditions><day_of_week
data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
mages/weather/mostly_sunny.gif"/><condition data="Teils
sonnig"/></forecast_conditions><forecast_conditions><day_of_week
data="Di."/><low data="0"/><high data="8"
/><icon data="/ig/images/weather/sunny.gif"/><condition
data="Klar"/></forecast_conditions></weather></xml_api_reply>
As you can see the umlauts in the XML are not displayed properly. When I want
to process this text (for example with xml.sax), I get error messages because
the parses can't read this.
I've tried to read up on this and there is a lot of information on the web, but
nothing seems to work for me. For example setting the coding to UTF like this:
# -*- coding: utf-8 -*- or using the decode() string method.
I always have this kind of problem when input contains umlauts, not just in
this case. My locale (on Ubuntu) is en_GB.UTF-8.
Cheers
Arian
try this?
# vim: set fencoding=utf-8:
import urllib
import xml.sax as sax, xml.sax.handler as handler
f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
xml = f.read()
xml = xml.decode("cp1252")
f.close()
class my_handler(handler.ContentHandler):
def startElement(self, name, attrs):
print "begin:", name, attrs
def endElement(self, name):
print "end:", name
sax.parseString(xml, my_handler())

Click to expand...

This is wrong. XML is a *byte*-based format, which explicitly states
encodings. So decoding a byte-string to a unicode-object and then
passing it to a parser is not working in the very moment you have data that

- is outside your default-system-encoding (ususally ascii)
- the system-encoding and the declared decoding differ

Besides, I don't see where the whole SAX-stuff is supposed to do
anything the direct print and the decode() don't do - smells like
cargo-cult to me.

Diez

Click to expand...

yes, XML is a *byte*-based format, and so as utf-8 and code-page
(cp936, cp1252, etc.). so usually XML will sign its coding at head.
but this didn't work now.

in Python2.6, sys.getdefaultcoding() return 'ascii', and I can't use
sys.setdefaultcoding(), and f.read() return a str. so it must be a
undecoded, byte-base format (i.e. raw XML data). so use the right code-
page to decode it is safe.(notice the webpage is google.de).

in Python3.1, read() returns a bytes object. so we *must* decode it,
nor we can't pass it into a parser.

You didn't get my point. A XML-parser only *takes* a byte-string.
Decoding is it's business. So your above last sentence is wrong.

Because regardless of the python-version, if you feed the parser a
unicode-object, python will first encode that to a byte-string, possibly
giving a UnicodeError (maybe this automated conversion has gone in Py3K,
but then you get a type-error instead).

So to make the above work (if one wants to parse the xml), the proper
thing to do would be

xml = xml.decode("cp1252").encode("utf-8")

and then feed that. Of course the really good thing would be to fix the
webpage, but that's beyond our capabilities I fear...

Diez

Diez B. Roggisch · Oct 18, 2009

Arian said:
Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate

for me it also prints the correct o-umlaut (Ã¶), so that was not the problem.

All of the below result in xml that shows all umlauts correctly when printed:

xml.decode("cp1252")
xml.decode("cp1252").encode("utf-8")
xml.decode("iso-8859-1")
xml.decode("iso-8859-1").encode("utf-8")

But when I want to parse the xml then, it only works if I
do both decode and encode. If I only decode, I get the following error:
SAXParseException: <unknown>:1:1: not well-formed (invalid token)

Do I understand right that since the encoding was not specified in the xml
response, it should have been utf-8 by default? And that if it had indeed been utf-8 I
would not have had the encoding problem in the first place?

Yes. XML without explicit encoding is implicitly UTF-8, and the page is
borked using cp* or latin* without saying so.

Diez

Diez B. Roggisch · Oct 18, 2009

Diez said:
Yes. XML without explicit encoding is implicitly UTF-8, and the page is
borked using cp* or latin* without saying so.

Ok, after reading some other posts in this thread this assumption seems
not to hold. HTTP-protocol allows for other encodings to be implicitly
given. Which I think is an atrocity.

Diez

Help with my responsive home page	2	Dec 14, 2022
Get XML content using XML::Twig	19	Apr 21, 2010
2011-42056--Senior Java Developer--Albany, NY.-- 7+ Months	0	Mar 24, 2011
<atom:source> element ... is it supported by any free blogging service?	0	Mar 15, 2010
XML-RPC calls with real UTF8 data failed	0	Jul 20, 2006
Namespace problem???	0	Apr 1, 2005
ANN: PyTables 0.9.1 is out	0	Dec 4, 2004
Converting EBCDIC to Unicode	3	Sep 28, 2010

umlauts

Arian Kuschki

Diez B. Roggisch

StarWing

Diez B. Roggisch

Diez B. Roggisch

StarWing

Arian Kuschki

Diez B. Roggisch

Diez B. Roggisch

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads