XML and UnicodeError

Pinke Panke · Oct 4, 2004

Dear people,

I wrote a python script to create html files. The structure was stored in
a nested array. For easier maintaining at any time I desribe the
structure in XML, using the minidom parser and a small function to
convert the XML structure into the array structure. So far so good.

Then the mess started. The XML document is described as utf-8, stored as
utf-8. iso-8859-1 makes no difference in this case.

When an character > 128, e.g. an umlaut, occurs my string raises errors.
An example:

headline = structure[0]
pagetext = structure[1]
foo = headline + "bar" + pagetext
In my script there are many of such operations. The simple example is
solved easily with appending .encode('iso-8859-1') at the structure
statements. So far not so nice but ok. I hope there would be a simpler
solution.

But there are also string replacements via regexes. An example to make a
picture of it:
pat = re.compile('<putithere>')
foo = 'def'
bar = 'abc<putithere>ghi'
htmlcode = pat.sub(foo,bar)

Appending .encode(...) to foo and bar does not fix the UnicodeError.

Is there any solution, something I forgot or I could make better? Is
there any logic behind it? ;-)

TIA.
Martin

Paul Boddie · Oct 5, 2004

Pinke Panke said:
Dear people,

I wrote a python script to create html files. The structure was stored in
a nested array. For easier maintaining at any time I desribe the
structure in XML, using the minidom parser and a small function to
convert the XML structure into the array structure. So far so good.

Note that any access to textual data in your DOM (XML) document will
yield Unicode values, not strings - this is relevant below.

Then the mess started. The XML document is described as utf-8, stored as
utf-8. iso-8859-1 makes no difference in this case.

After you've parsed the XML document, none of the encodings are
relevant - until you serialise the document, everything should be
Unicode (although I'm sure I've seen some XML libraries use plain
strings to represent values which consist only of ASCII characters).

When an character > 128, e.g. an umlaut, occurs my string raises errors.
An example:

headline = structure[0]
pagetext = structure[1]
foo = headline + "bar" + pagetext

Are you not addings strings to Unicode values here? I can imagine that
at some point you've decided to change headline or pagetext to
something other than that extracted from the DOM document. However, if
you've used plain Python strings with non-ASCII characters, Python has
no way of knowing how to combine such strings with Unicode values,
since the encoding used in your strings is never made explicit.

In my script there are many of such operations. The simple example is
solved easily with appending .encode('iso-8859-1') at the structure
statements. So far not so nice but ok. I hope there would be a simpler
solution.

The solution is to use Unicode throughout.

But there are also string replacements via regexes. An example to make a
picture of it:
pat = re.compile('<putithere>')
foo = 'def'
bar = 'abc<putithere>ghi'
htmlcode = pat.sub(foo,bar)

Appending .encode(...) to foo and bar does not fix the UnicodeError.

Is there any solution, something I forgot or I could make better? Is
there any logic behind it? ;-)

Yes, but it's complicated, so my advice is to...

1. Let minidom provide you with Unicode values.
2. Convert any other text to Unicode as soon as possible.
3. Manipulate only Unicode values - don't mix them up with
plain strings.
4. Serialise to your chosen encoding only when preparing
output.

Paul

Pinke Panke · Oct 5, 2004

Hello Paul,

thanky you for your answer.

The solution is to use Unicode throughout.

I thought so, but it seemed to me not easy enough.

1. Let minidom provide you with Unicode values.

Yes, I assume this is the default behaviour of the minidom parser.

2. Convert any other text to Unicode as soon as possible.

Ok, i.e.

headline = structure[0] # is unicode
pagetext = structure[1] # is unicode
fill = "bar".encode('utf-8') # lets make it unicode
foo = headline + fill + pagetext # foo is unicode, too

?

3. Manipulate only Unicode values - don't mix them up with
plain strings.

It makes sense, but I need some string concatenations. E.g. I set
default values in the python script and try to concatenate them with
XML values.

But now, I would think the safest way is to transfer all plain strings
in the python script into a second XML file and use them, because
after reading in they would be in Unicode. Right?

Or saving the python script in utf-8 would make the difference?

4. Serialise to your chosen encoding only when preparing
output.

Every string concatenation in my script is preparing output.

I am looking forward to your answer.

Martin

Just · Oct 5, 2004

2. Convert any other text to Unicode as soon as possible.

Ok, i.e.

headline = structure[0] # is unicode
pagetext = structure[1] # is unicode
fill = "bar".encode('utf-8') # lets make it unicode[/QUOTE]

That's not making it unicode; you mean

fill = unicode("bar", "utf-8")

(Or "bar".decode("utf-8"), which does the same; I prefer using the
unicode builtin.)

foo = headline + fill + pagetext # foo is unicode, too

?

It makes sense, but I need some string concatenations. E.g. I set
default values in the python script and try to concatenate them with
XML values.

But now, I would think the safest way is to transfer all plain strings
in the python script into a second XML file and use them, because
after reading in they would be in Unicode. Right?

Yes, but there's no need to. Are you perhaps using string literals
containing non-ascii chars, yet don't use the 'u' prefix? u"\xff" as
opposed to "\xff".

Or saving the python script in utf-8 would make the difference?
Depends...

Every string concatenation in my script is preparing output.

Do _all_ manipulations using unicode, and convert to utf-8 as late as
poosible, ie. when you're passing the result to code that expects
non-unicode data. That's basically what he was saying.

Just

Pinke Panke · Oct 5, 2004

Hello Just

Are you perhaps using string literals containing non-ascii chars,
Yes.

yet don't use the 'u' prefix? u"\xff" as opposed to "\xff".

No.

E.g. I convert umlauts to html entities or change symbols to ascii
strings for file names. Instead of using the x-notation I typed the
character itself. In the case of my script no character is over chr
(255). An example:
def foo (name):
name = re.sub(r'®','_registered_',name)
... and many more substitutions

I think instead of r'' I should use u''?

It is possible to compile a RE object with the U flag:
matchreg = re.compile(u'®', re.U)
name = matchreg.sub('_registered_',name)

But maybe not neccessary. In my tests using any u-switches and u-
flags makes no difference. The only crucial things were
1. using unicode().
2. using a coding flag as described in [1]
3. storing the python script as utf-8

For me using unicode() is ok.

[1] http://python.org/peps/pep-0263.html

Martin

Paul Boddie · Oct 5, 2004

Just said:
That's not making it unicode; you mean

fill = unicode("bar", "utf-8")

(Or "bar".decode("utf-8"), which does the same; I prefer using the
unicode builtin.)

So do I - it can be confusing to think of performing a decoding
operation on a string which yields a Unicode object as a result.

Yes, but there's no need to. Are you perhaps using string literals
containing non-ascii chars, yet don't use the 'u' prefix? u"\xff" as
opposed to "\xff".

Having non-ASCII characters appear in string literals in the source
code can be somewhat risky, but there's always the encoding
declaration added in Python 2.3 to control the situation. Just
remember to convert any plain strings to Unicode in the program code.

Depends...

....on that encoding declaration amongst other things.

Do _all_ manipulations using unicode, and convert to utf-8 as late as
poosible, ie. when you're passing the result to code that expects
non-unicode data. That's basically what he was saying.

Yes, if you introduce a plain string anywhere, convert it to Unicode
as soon as you can, especially since such data always has a habit of
getting into those Unicode-related operations even though you didn't
think that it would. Once the different substitutions and other
processing is done, you may want to convert the Unicode values back to
plain strings, although I often find that this doesn't need doing
before a program's final output is prepared (and in the case of XML
serialisation, you want to leave this to the serialiser anyway, since
it will also write out which encoding it used when serialising).

Paul

Diez B. Roggisch · Oct 5, 2004

Just for the record: Don't confuse unicode with utf-8 - the former beeing a
specification of more or less all characters used on this planet, the
latter an actual encoding of these that maps common ascii characters to
their well-known values and has escapes defined to encode all others, like
umlauts.

So

u'some text'

is not UTF-8 - its a unicode object. If you do this:

u'some text'.encode('utf-8')

it becomes a binary string which is encoded using utf-8. Specifying the
coding of the python file using the

# -*- encoding: iso-8859-1 -*-

syntax means that <some-text> found in

u'<some-text>'

are interpreted using the latin1-codec - so u'<some-text>' is a shorthand
for

'some-text'.decode('iso-8859-1')

Regards,

diez

python and parsing an xml file	3	Feb 21, 2011
Lib to generate XML/JSON[P] output from a DTD/XSD/JSON Schema/etc	1	Feb 14, 2013
Search and replace text in XML file?	5	Jul 28, 2012
Create xml with elementtree ET and xml escaping	4	Dec 11, 2012
ChatBot	4	Jan 19, 2021
generate and send mail with python: tutorial	8	Aug 11, 2011
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
ISO dict => xml converter	3	Jun 20, 2008

XML and UnicodeError

Pinke Panke

Paul Boddie

Pinke Panke

Just

Pinke Panke

Paul Boddie

Diez B. Roggisch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads