F
Frank Niessink
Hi list,
First of all, I wish you all a happy 2006. I have a small question that
googling didn't turn up an answer for. So hopefully you'll be kind
enough to send me in the right direction.
I'm developing a desktop application, called Task Coach, that saves its
domain objects (tasks, mostly in an XML file. Users have reported
that sometimes their Task Coach file would become unreadable by Task
Coach after copying information from some other application into e.g. a
task description. Looking at the 'corrupted' file showed that control
characters ended up in the XML file (Control-K for example). Task Coach
uses xml.dom to create an XML document and save it, like this:
class XMLWriter:
...
def write(self, taskList):
domImplementation = xml.dom.getDOMImplementation()
self.document = domImplementation.createDocument(None, 'tasks',
None)
...
for task in taskList.rootTasks():
self.document.documentElement.appendChild(self.taskNode(task))
self.document.writexml(self.__fd) # __fd is a file open for writing
...
Apparently, the writexml method of xml.dom (which comes from
xml.dom.minidom if pyxml is not installed I think) does not feel that
writing control characters in an XML file is wrong, but the parser does:
Traceback (most recent call last):
....
File "c:\Program Files\Python24\lib\xml\dom\expatbuilder.py", line
207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 77,
column 147
Rightfully so, because ^K is not valid XML 1.0, according to
http://www.w3.org/TR/REC-xml/:
"Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646. [...] Consequently, XML
processors MUST accept any character in the range specified for Char.
Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]"
So, all this leads me to the following questions:
- Why does the writexml method of the document created by the object
returned by domImplementation() allow control characters? Isn't that a bug?
- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?
Thanks, Frank
First of all, I wish you all a happy 2006. I have a small question that
googling didn't turn up an answer for. So hopefully you'll be kind
enough to send me in the right direction.
I'm developing a desktop application, called Task Coach, that saves its
domain objects (tasks, mostly in an XML file. Users have reported
that sometimes their Task Coach file would become unreadable by Task
Coach after copying information from some other application into e.g. a
task description. Looking at the 'corrupted' file showed that control
characters ended up in the XML file (Control-K for example). Task Coach
uses xml.dom to create an XML document and save it, like this:
class XMLWriter:
...
def write(self, taskList):
domImplementation = xml.dom.getDOMImplementation()
self.document = domImplementation.createDocument(None, 'tasks',
None)
...
for task in taskList.rootTasks():
self.document.documentElement.appendChild(self.taskNode(task))
self.document.writexml(self.__fd) # __fd is a file open for writing
...
Apparently, the writexml method of xml.dom (which comes from
xml.dom.minidom if pyxml is not installed I think) does not feel that
writing control characters in an XML file is wrong, but the parser does:
Traceback (most recent call last):
....
File "c:\Program Files\Python24\lib\xml\dom\expatbuilder.py", line
207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 77,
column 147
Rightfully so, because ^K is not valid XML 1.0, according to
http://www.w3.org/TR/REC-xml/:
"Legal characters are tab, carriage return, line feed, and the legal
characters of Unicode and ISO/IEC 10646. [...] Consequently, XML
processors MUST accept any character in the range specified for Char.
Character Range
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]"
So, all this leads me to the following questions:
- Why does the writexml method of the document created by the object
returned by domImplementation() allow control characters? Isn't that a bug?
- What is the easiest/most pythonic (preferably build-in) way of
checking a unicode string for control characters and weeding those
characters out?
Thanks, Frank