Problem round-tripping with xml.dom.minidom pretty-printer

Ben Butler-Cole · Feb 29, 2008

Hello

I have run into a problem using minidom. I have an HTML file that I
want to make occasional, automated changes to (adding new links). My
strategy is to parse it with minidom, add a node, pretty print it and
write it back to disk.

However I find that every time I do a round trip minidom's pretty
printer puts extra blank lines around every element, so my file grows
without limit. I have found that normalizing the document doesn't make
any difference. Obviously I can fix the problem by doing without the
pretty-printing, but I don't really like producing non-human readable
HTML.

Here is some code that shows the behaviour:

import xml.dom.minidom as dom
def p(t):
d = dom.parseString(t)
d.normalize()
t2 = d.toprettyxml()
print t2
p(t2)
p('<a><c/></a>')

Does anyone know how to fix this behaviour? If not, can anyone
recommend an alternative XML tool for simple tasks like this?

Thanks
Ben

Robert Bossy · Feb 29, 2008

Ben said:
Hello

I have run into a problem using minidom. I have an HTML file that I
want to make occasional, automated changes to (adding new links). My
strategy is to parse it with minidom, add a node, pretty print it and
write it back to disk.

However I find that every time I do a round trip minidom's pretty
printer puts extra blank lines around every element, so my file grows
without limit. I have found that normalizing the document doesn't make
any difference. Obviously I can fix the problem by doing without the
pretty-printing, but I don't really like producing non-human readable
HTML.

Here is some code that shows the behaviour:

import xml.dom.minidom as dom
def p(t):
d = dom.parseString(t)
d.normalize()
t2 = d.toprettyxml()
print t2
p(t2)
p('<a><c/></a>')

Does anyone know how to fix this behaviour? If not, can anyone
recommend an alternative XML tool for simple tasks like this?

Hi,

The last line of p() calls itself: it is an unconditional recursive call
so, no matter what it does, it will never stop. And since p() also
prints something, calling it will print endlessly. By removing this
line, you get something like:

<?xml version="1.0" ?>
<a>

<c/>

</a>

That seems sensible, imo. Was that what you wanted?

An additional thing to keep in mind is that toprettyxml does not print
an XML identical to the original DOM tree: it adds newlines and tabs.
When parsed again these blank characters are inserted in the DOM tree as
character nodes. If you toprettyxml an XML document twice in a row, then
the second one will also add newlines and tabs around the newlines and
tabs added by the first. Since you call toprettyxml an infinite number
of times, it is expected that lots of blank characters appear.

Finally, normalize() is supposed to merge consecutive sibling character
nodes, however it will never remove character contents even if they are
blank. That means that several character
nodes will be replaced by a single one whose content is the
concatenation of the respective content of the original nodes. Clear enough?

Cheers,
RB

Ben Butler-Cole · Feb 29, 2008

The last line of p() calls itself: it is an unconditional recursive call

so, no matter what it does, it will never stop. And since p() also
prints something, calling it will print endlessly.

Sorry, I wasn't clear. I realize that this recurses endlessly. The
problem is that it also adds blank lines endlessly.

By removing this line, you get something like:

<?xml version="1.0" ?>
<a>

<c/>

</a>

That seems sensible, imo. Was that what you wanted?

Sure. That's fine unless you then re-parse this out put and print it
again in which case you get the behaviour you describe:

An additional thing to keep in mind is that toprettyxml does not print
an XML identical to the original DOM tree: it adds newlines and tabs.
When parsed again these blank characters are inserted in the DOM tree as
character nodes. If you toprettyxml an XML document twice in a row, then
the second one will also add newlines and tabs around the newlines and
tabs added by the first. Since you call toprettyxml an infinite number
of times, it is expected that lots of blank characters appear.

Right. That's the behaviour I'm asking about, which I consider to be
problematic. I would expect a module providing a parser and pretty-
printer (not just for XML parsers) to be able to conservatively round-
trip.

As far as I can see (and your comments back this up) minidom doesn't
have this property. Unless anyone knows how to get it to behave that
way...

Ben

Robert Bossy · Feb 29, 2008

Ben said:
Right. That's the behaviour I'm asking about, which I consider to be
problematic. I would expect a module providing a parser and pretty-
printer (not just for XML parsers) to be able to conservatively round-
trip.

As far as I can see (and your comments back this up) minidom doesn't
have this property. Unless anyone knows how to get it to behave that
way...

minidom --any DOM parser, btw-- has no means to know which blank
character is a pretty print artefact or actual blank content from the
original XML.

You could write a function that strips all-blank nodes recursively down
the elements tree, before doing so I suggest you take a look at section
2.10 of http://www.w3.org/TR/REC-xml/.

RB

Problem parsing namespaces with xml.dom.minidom	5	Jan 18, 2005
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Help with python functions?	24	Sep 23, 2013
performance problem with time.strptime()	1	Jul 2, 2009
Monitor Events of Printer in a PC's Network	1	Jul 26, 2007
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
toy list processing problem: collect similar terms	43	Sep 26, 2010
please solve my problem	2	Jan 13, 2013

Problem round-tripping with xml.dom.minidom pretty-printer

Ben Butler-Cole

Robert Bossy

Ben Butler-Cole

Robert Bossy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads