truncating specific lines from xml

foolproofplan · Jan 30, 2007

I have a somewhat simple task I need to do, but since I am new at xml,
I need help:

Right now, I have xml files that are output from tests I do with an
automated testing program. I want to compare these files back to the
originals I have, but there is one little complication: the xml files
have lines of code added in them with unique ids which are included in
the xml file when it is run. These unique ids are currently throwing
off the xml tester. How can I go about getting rid of these lines of
unique ids so that the files compared are the same again?

Thanks in advance!

p.lepin · Jan 30, 2007

Right now, I have xml files that are output from tests I
do with an automated testing program. I want to compare
these files back to the originals I have, but there is
one little complication: the xml files have lines of code
added in them with unique ids which are included in the
xml file when it is run. These unique ids are currently
throwing off the xml tester. How can I go about getting
rid of these lines of unique ids so that the files
compared are the same again?

You question is pretty much impossible to answer as it is.
You should've provided some (possibly simplified) examples
to get your meaning across to group readers. For one thing,
speaking of 'lines' in XML is quite meaningless.

It sounds as if XSLT would fit the bill, but that would
depend on some factors. If you need to remove some easily
distinguishable nodes, there probably isn't a better
solution than XSLT identity with exclusions. But in case
the stuff you need removed is buried within the text nodes,
XSLT suddenly becomes a much less attractive proposition--
it's just not that good at juggling strings, it was never
meant for that.

Andy Dingley · Jan 30, 2007

How can I go about getting rid of these lines of
unique ids so that the files compared are the same again?

You need to suppress these ids (and datestamps / usernames etc.) and
also to canonicalise the XML serialisation. Ideally we wouldn't need
to do the second, we'd just just use an XML-aware comparison tool.
However you're probably using some old unix command-line textfile
comparator that doesn't understand XML whitespace equivalence.
Serialise it first to something with each tag unindented on its own
line, and a repeatable text format output for comparable XML input.
XSLT can do this.

Run them through XSLT, using the "identity copy" template (search for
it) modified to recognise the ids and to output nothing for them

foolproofplan · Jan 30, 2007

The tester is using a python script (which i did not create) to
compare the xml files. Is there the way we can work with this?

foolproofplan · Jan 30, 2007

here is an example of two xml files that are exactly the same, except
for the fact that they have different ids:

XML file one:

<?xml version="1.0" encoding="UTF-8"?>

<EnCapta>
<Document type="Part" id=":1156453195:1262379012:" name="New
Document" >
<FileName>\New Document</FileName>
<Unit/>
<ApplicationData id=":1156453207:1327785362:" name="CAD_Note" >
<ApplicationReference id_ref=":91005593:790373312:" >
<Name>CAD_Note</Name>
<MajorVersion>0</MajorVersion>
<MinorVersion>0</MinorVersion>
</ApplicationReference>
<Note template_id=":96227828:304003723:" id=":
1156453207:1116306377:" name="Note1" >
<Name type="FixedString" >Note1</Name>
<Author type="FixedString" >SHO</Author>
<CreationDate type="DateTime" >2006-08-24T17:00:07</CreationDate>
<ModificationDate type="DateTime" >2006-08-24T17:00:07</
ModificationDate>
<RelatingTo type="FixedString" >Engineering</RelatingTo>
<Description type="String" >1234</Description>
</Note>
</ApplicationData>
</Document>
</EnCapta>

XML file two:

<?xml version="1.0" encoding="UTF-8"?>

<EnCapta>
<Document type="Part" id=":1170176183:1209286222:" name="New
Document" >
<FileName>\New Document</FileName>
<Unit/>
<ApplicationData id=":1170176190:357510851:" name="CAD_Note" >
<ApplicationReference id_ref=":91005593:790373312:" >
<Name>CAD_Note</Name>
<MajorVersion>0</MajorVersion>
<MinorVersion>0</MinorVersion>
</ApplicationReference>
<Note template_id=":96227828:304003723:" id=":
1170176190:655829958:" name="Note1" >
<Name type="FixedString" >Note1</Name>
<Author type="FixedString" >SHO</Author>
<CreationDate type="DateTime" >2000-01-01T12:00:01</CreationDate>
<ModificationDate type="DateTime" >2000-01-01T12:00:01</
ModificationDate>
<RelatingTo type="FixedString" >Engineering</RelatingTo>
<Description type="String" >1234</Description>
</Note>
</ApplicationData>
</Document>
</EnCapta>

Andy Dingley · Jan 30, 2007

The tester is using a python script (which i did not create) to
compare the xml files. Is there the way we can work with this?

Use XSLT first, as I described.

Or re-write the Python comparator so as to ignore the ids as well as
any other XML whitespace it presumably already ignores.

p.lepin · Jan 31, 2007

Please don't top-post. Top-posting fixed.

here is an example of two xml files that are exactly the
same, except for the fact that they have different ids:

[snip]

It seems it wouldn't be possible without transforming both
files (unless you're willing to write a tool for comparing
them in XSLT). The following transformation strips the id
attributes from all elements:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@id"/>
</xsl:stylesheet>

Testing results:

pavel@debian:~/dev/xslt$ saxon -novw test1.xml strip_id.xsl

test1_prc.xml

pavel@debian:~/dev/xslt$ saxon -novw test2.xml strip_id.xsl

test2_prc.xml

pavel@debian:~/dev/xslt$ diff test1_prc.xml test2_prc.xml
14,15c14,15
< <CreationDate type="DateTime">2006-08-24T17:00:07</CreationDate>
< <ModificationDate type="DateTime">2006-08-24T17:00:07</
ModificationDate>
---

<CreationDate type="DateTime">2000-01-01T12:00:01</CreationDate>
<ModificationDate type="DateTime">2000-01-01T12:00:01</ModificationDate>

Uh oh. It seems there are a couple more differences in
those files. Anyway, if you know precisely what you need
stripped, the transformation given about should serve as a
good starting point.

Joe Kesselman · Jan 31, 2007

It seems it wouldn't be possible without transforming both
files (unless you're willing to write a tool for comparing
them in XSLT).

Or in another programming language, eg by using a SAX or DOM parser and
writing a parallel tree-walker that understands which differences are
meaningful and which aren't.

Note that a text diff is often not the right tool anyway, because there
are things which XML itself doesn't consider meaningful -- order of
attributes, whitespace in some places, that sort of thing. So if you're
doing a serious test suite, you usually wind up having to write some
special-purpose code anyway, or find something you can swipe for the
purpose.

For example: You might want to look at the compare code used in the
Xalan processor's regression test suite, and either adapt that to also
ignore the things you don't consider meaningful or (as Pavel suggested)
preprocess those away before comparing. Another approach I've seen
(which again would require preprocessing) involved canonicalizing the
two documents (which theoretically suppresses most or all of the
insignificant differences) and then doing a text diff against the results.

Generating XML Schemas from RDF	0	Apr 4, 2013
Python point location of intersect between two lines	0	Feb 28, 2018
new at xml	4	Jan 31, 2013
remove specail character from xml using xslt	0	Nov 2, 2011
How to select only specific rows from xml using xsl	1	Feb 6, 2009
XML in XMPP	8	Jul 6, 2012
Here's the XML validation tool the world is waiting for...	3	Nov 10, 2012
Parsing multiple lines from text file using regex	0	Oct 27, 2013

truncating specific lines from xml

foolproofplan

p.lepin

Andy Dingley

foolproofplan

foolproofplan

Andy Dingley

p.lepin

Joe Kesselman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads