Minimally intrusive XML editing using Python

T

Thomas Lotze

I wonder what Python XML library is best for writing a program that makes
small modifications to an XML file in a minimally intrusive way. By that I
mean that information the program doesn't recognize is kept, as are
comments and whitespace, the order of attributes and even whitespace
around attributes. In short, I want to be able to change an XML file while
producing minimal textual diffs.

Most libraries don't allow controlling the order of and the whitespace
around attributes, so what's generally left to do is store snippets of
original text along with the model objects and re-use that for writing the
edited XML if the model wasn't modified by the program. Does a library
exist that helps with this? Does any XML library at all allow structured
access to the text representation of a tag with its attributes?

Thank you very much.
 
S

Stefan Behnel

Thomas Lotze, 18.11.2009 13:55:
I wonder what Python XML library is best for writing a program that makes
small modifications to an XML file in a minimally intrusive way. By that I
mean that information the program doesn't recognize is kept, as are
comments and whitespace, the order of attributes and even whitespace
around attributes. In short, I want to be able to change an XML file while
producing minimal textual diffs.

Take a look at canonical XML (C14N). In short, that's the only way to get a
predictable XML serialisation that can be used for textual diffs. It's
supported by lxml.

Stefan
 
C

Chris Rebert

I wonder what Python XML library is best for writing a program that makes
small modifications to an XML file in a minimally intrusive way. By that I
mean that information the program doesn't recognize is kept, as are
comments and whitespace, the order of attributes and even whitespace
around attributes. In short, I want to be able to change an XML file while
producing minimal textual diffs.

Most libraries don't allow controlling the order of and the whitespace
around attributes, so what's generally left to do is store snippets of
original text along with the model objects and re-use that for writing the
edited XML if the model wasn't modified by the program. Does a library
exist that helps with this? Does any XML library at all allow structured
access to the text representation of a tag with its attributes?

Have you considered using an XML-specific diff tool such as:
* One off this list:
http://www.manageability.org/blog/stuff/open-source-xml-diff-in-java
* xmldiff (it's in Python even): http://www.logilab.org/859
* diffxml: http://diffxml.sourceforge.net/

[Note: I haven't actually used any of these.]

Cheers,
Chris
 
T

Thomas Lotze

Stefan said:
Take a look at canonical XML (C14N). In short, that's the only way to get a
predictable XML serialisation that can be used for textual diffs. It's
supported by lxml.

Thank you for the pointer. IIUC, c14n is about changing an XML document so
that its textual representation is reproducible. While this representation
would certainly solve my problem if I were to deal with input that's
already in c14n form, it doesn't help me handling arbitrarily formatted
XML in a minimally intrusive way.

IOW, I don't want the XML document to obey the rules of a process, but
instead I want a process that respects the textual form my input happens
to have.
 
T

Thomas Lotze

Chris said:
Have you considered using an XML-specific diff tool such as:

I'm afraid I'll have to fall back to using such a thing if I don't find a
solution to what I actually want to do.

I do realize that XML isn't primarily about its textual representation, so
I guess I shouldn't be surprised if what I'm looking for doesn't exist.
Still, it would be nice if it did...
 
A

Anthra Norell

Thomas said:
Chris Rebert wrote:



I'm afraid I'll have to fall back to using such a thing if I don't find a
solution to what I actually want to do.

I do realize that XML isn't primarily about its textual representation, so
I guess I shouldn't be surprised if what I'm looking for doesn't exist.
Still, it would be nice if it did...
Thomas,
I just might have what you are looking for. But I want to be sure I
understand your problem. So, please show a small sample and explain how
you want it modified.
Frederic
 
D

Dave Angel

Thomas said:
Chris Rebert wrote:



I'm afraid I'll have to fall back to using such a thing if I don't find a
solution to what I actually want to do.

I do realize that XML isn't primarily about its textual representation, so
I guess I shouldn't be surprised if what I'm looking for doesn't exist.
Still, it would be nice if it did...
What's your real problem, or use case? Are you just concerned with
diffing, or are others likely to read the xml, and want it formatted the
way it already is? And how general do you need this tool to be? For
example, if the only thing you're doing is modifying existing attributes
or existing tags, the "minimal change" would be pretty unambiguous. But
if you're adding tags, or adding content on what was an empty element,
then the requirement gets fuzzy And finding an existing library for
something "fuzzy" is unlikely.

Sample input, change list, and desired output would be very useful.

DaveA
 
N

Nobody

I wonder what Python XML library is best for writing a program that makes
small modifications to an XML file in a minimally intrusive way. By that I
mean that information the program doesn't recognize is kept, as are
comments and whitespace, the order of attributes and even whitespace
around attributes. In short, I want to be able to change an XML file while
producing minimal textual diffs.

Most libraries don't allow controlling the order of and the whitespace
around attributes, so what's generally left to do is store snippets of
original text along with the model objects and re-use that for writing the
edited XML if the model wasn't modified by the program. Does a library
exist that helps with this? Does any XML library at all allow structured
access to the text representation of a tag with its attributes?

Expat parsers have a CurrentByteIndex field, while SAX parsers have
locators. You can use this to identify the portions of the input which
need to be processed, and just copy everything else. One downside is that
these only report either the beginning (Expat) or end (SAX) of the tag;
you'll have to deduce the other side yourself.

OTOH, "diff" is probably the wrong tool for the job.
 
T

Thomas Lotze

Please consider this a reply to any unanswered messages I received in
response to my original post.

Dave said:
What's your real problem, or use case? Are you just concerned with
diffing, or are others likely to read the xml, and want it formatted the
way it already is?

I'd like to put the XML under revision control along with other stuff.
Other people should be able to make sense of the diffs and I'd rather not
require them to configure their tools to use some XML differ.
And how general do you need this tool to be? For
example, if the only thing you're doing is modifying existing attributes
or existing tags, the "minimal change" would be pretty unambiguous. But
if you're adding tags, or adding content on what was an empty element,
then the requirement gets fuzzy And finding an existing library for
something "fuzzy" is unlikely.

Sure. I guess it's something like an 80/20 problem: Changing attributes in
a way that keeps the rest of the XML intact will go a long way and as
we're talking about XML that is supposed to be looked at by humans, I
would base any further requirements on the assumption that it's
pretty-printed in some way so that removing an element, for example, can
be defined by touching as few lines as possible, and adding one can be
restricted to adding a line in the appropriate place. If more complex
stuff isn't as well-defined, that would be entirely OK with me.
Sample input, change list, and desired output would be very useful.

I'd like to be able to reliably produce a diff like this using a program
that lets me change the value in some useful way, which might be dragging
a point across a map with the mouse in this example:

--- foo.gpx 2009-05-30 19:45:45.000000000 +0200
+++ bar.gpx 2009-11-23 17:41:36.000000000 +0100
@@ -11,7 +11,7 @@
<speed>0.792244</speed>
<fix>2d</fix>
</trkpt>
-<trkpt lat="50.605995000" lon="10.709680000">
+<trkpt lat="50.605985000" lon="10.709680000">
<ele>508.300000</ele>
<time>2009-05-30T16:37:10Z</time>
<course>15.150000</course>
 
N

Nobody

I'd like to put the XML under revision control along with other stuff.
Other people should be able to make sense of the diffs and I'd rather not
require them to configure their tools to use some XML differ.

In which case, the data format isn't "XML", but a subset of it (and
probably an under-defined subset at that, unless you invest a lot of
effort in defining it).

That defeats one of the advantages of using a standardised "container"
format such as XML, i.e. being able to take advantage of existing tools
and libraries (which will be written to the offical standard, not to your
private "standard").

One option is to require the files to be in a canonical form. Depending
upon your revision control system, you may be able to configure it to
either check that updated files are in this form, or even to convert them
automatically.

If your existing files aren't in such a form, there will be a one-time
penalty (i.e. a huge diff) when converting the files.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top