ElementTree, XML and Unicode -- C0 Controls

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?= · Dec 11, 2006

Hi all,

The unicode code points in the 0000-001F range --
except newline, tab, carriage return -- are not legal
XML 1.0 characters.

Attempts to serialize and deserialize such strings
with ElementTree will fail:
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 12

Good ! But I was expecting a failure *earlier*, in
the "tostring" function -- I basically assumed that
ElementTree would refuse to generate a XML
fragment that is not well-formed.

Could anyone comment on the rationale behind
the current behavior ? Is it a performance issue,
the search for non-valid unicode code points being
too expensive ?

Cheers,

SB

Fredrik Lundh · Dec 11, 2006

Sébastien Boisgérault said:
Could anyone comment on the rationale behind
the current behavior ? Is it a performance issue,
the search for non-valid unicode code points being
too expensive ?

the default serializer doesn't do any validation or well-formedness checks at all; it assumes
that you know what you're doing.

</F>

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?= · Dec 11, 2006

the default serializer doesn't do any validation or well-formedness checks at all; it assumes
that you know what you're doing.

</F>

Fair enough !

Thanks Fredrik.

SB

ElementTree and Unicode	6	Aug 2, 2006
ElementTree XML parsing problem	8	Apr 27, 2011
Parsing XML with ElementTree (unicode problem?)	13	Jul 23, 2007
ElementTree cannot parse UTF-8 Unicode?	14	Jan 19, 2005
xml file structure for use with ElementTree?	7	Oct 9, 2004
the tostring and XML methods in ElementTree	7	May 7, 2006
python and parsing an xml file	3	Feb 21, 2011
Preventing control characters from entering an XML file	3	Jan 1, 2006

ElementTree, XML and Unicode -- C0 Controls

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

Fredrik Lundh

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads