Preventing control characters from entering an XML file

Frank Niessink · Jan 1, 2006

Hi list,

First of all, I wish you all a happy 2006. I have a small question that
googling didn't turn up an answer for. So hopefully you'll be kind
enough to send me in the right direction.

I'm developing a desktop application, called Task Coach, that saves its
domain objects (tasks, mostly

in an XML file. Users have reported
that sometimes their Task Coach file would become unreadable by Task
Coach after copying information from some other application into e.g. a
task description. Looking at the 'corrupted' file showed that control
characters ended up in the XML file (Control-K for example). Task Coach
uses xml.dom to create an XML document and save it, like this:

class XMLWriter:
...

def write(self, taskList):
domImplementation = xml.dom.getDOMImplementation()
self.document = domImplementation.createDocument(None, 'tasks',
None)
...
for task in taskList.rootTasks():
self.document.documentElement.appendChild(self.taskNode(task))
self.document.writexml(self.__fd) # __fd is a file open for writing

...

Apparently, the writexml method of xml.dom (which comes from
xml.dom.minidom if pyxml is not installed I think) does not feel that
writing control characters in an XML file is wrong, but the parser does:

Traceback (most recent call last):
....
File "c:\Program Files\Python24\lib\xml\dom\expatbuilder.py", line
207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 77,
column 147

Rightfully so, because ^K is not valid XML 1.0, according to
http://www.w3.org/TR/REC-xml/:

"Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646. [...] Consequently, XML
processors MUST accept any character in the range specified for Char.

Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]"

So, all this leads me to the following questions:
- Why does the writexml method of the document created by the object
returned by domImplementation() allow control characters? Isn't that a bug?
- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?

Thanks, Frank

Scott David Daniels · Jan 1, 2006

Frank said:
...
Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]"

- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?

drop_controls = [None] * 0x20
for c in '\t\r\n':
drop_controls[c] = unichr(c)
...
some_unicode_string = some_unicode_string.translate(drop_controls)

--Scott David Daniels
(e-mail address removed)

Frank Niessink · Jan 5, 2006

Scott said:
Frank said:

- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?

Click to expand...

drop_controls = [None] * 0x20
for c in '\t\r\n':
drop_controls[c] = unichr(c)
...
some_unicode_string = some_unicode_string.translate(drop_controls)

Hi Scott,

Your code gave me a "TypeError: an integer is required". Anyway, it was
sufficient to push me in the right direction. This is my version:

UNICODE_CONTROL_CHARACTERS_TO_WEED = {}
for ordinal in range(0x20):
if chr(ordinal) not in '\t\r\n':
UNICODE_CONTROL_CHARACTERS_TO_WEED[ordinal] = None

Which let you do:
u'Test\t'

Thanks, Frank

Scott David Daniels · Jan 6, 2006

Frank said:
Scott said:

Frank said:

- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?

Click to expand...

drop_controls = [None] * 0x20
for c in '\t\r\n':
drop_controls[c] = unichr(c)
...
some_unicode_string = some_unicode_string.translate(drop_controls)

Click to expand...

Your code gave me a "TypeError: an integer is required"....

Sorry about that.

>> drop_controls[c] = unichr(c) should have been:
>> drop_controls[ord(c)] = unichr(c)

Click to expand...

Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
xmlrpclib and binary data as normal parameter strings	3	Apr 19, 2005
How to create an UTF-16 text file with iostream ?	5	Dec 20, 2009
How to pass Chinese characters from XML to a web page control	1	Feb 16, 2005
Problem: Custom control receives postback events for other controls	1	Sep 9, 2004
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
FAQ update (roundup of pending requests - for comment)	6	Jan 7, 2004

Preventing control characters from entering an XML file

Frank Niessink

Scott David Daniels

Frank Niessink

Scott David Daniels

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads