how to detect that one XML is a subset of another one?

P

Phlip

XMLers:

How, using XSL or some similar generic tool, can we detect that this
XML...

<b> <c d='e' /> </b>

....is a subset of this one?

<ig> <b> <nore> <c d='e' ig='nore'> ig </c> </nore> </b> </ig>

Note we detect success when c is anywhere inside b, not just its
immediate child, and c contains at least one attribute d='e'.

Historically, the best attack I could figure out required recursing
thru the objects in the reference XML, and for each node creating an
XPath specifying each node contained certain arguments, and had a
descendant:: which specified the next inner node.

For extra credit, this item...

<b> <c/> <d/> </b>

....would NOT match this...

<b> <d/> <c/> </b>

....because the nodes' document order matters.

So what's the best algorithm here?
 
P

Peter Flynn

Phlip said:
XMLers:

How, using XSL or some similar generic tool, can we detect that this
XML...

<b> <c d='e' /> </b>

...is a subset of this one?

<ig> <b> <nore> <c d='e' ig='nore'> ig </c> </nore> </b> </ig>

Note we detect success when c is anywhere inside b, not just its
immediate child, and c contains at least one attribute d='e'.

XPath: ig/b[descendant::c[@d='e']]

eg
<xsl:template match="ig">
<xsl:if test="b[descendant::c[@d='e']]">
...something...
</xsl:if>
For extra credit, this item...

<b> <c/> <d/> </b>

...would NOT match this...

<b> <d/> <c/> </b>

...because the nodes' document order matters.

<xsl:template match="b">
<xsl:if test="c/following-sibling::d">
...something...
</xsl:if>
</xsl:template>

or, if there mustn't be anything between c and d,

<xsl:template match="b">
<xsl:if test="c/following-sibling::*[1][name()='d']">
...something...
</xsl:if>
</xsl:template>

///Peter
 
P

Phlip

Peter Flynn wrote:

Txbut!
XPath: =//b[descendant::c[@d='e']]

eg
<xsl:template match="ig">
   <xsl:if test="b[descendant::c[@d='e']]">

Now make it generic, such that /any/ reference XML goes into a
Generator G, and then G matches the sample XML as if it used such
XPaths.

I can write G; I have written one in Ruby and one in Python so far. I
am asking, if XML is a real format and these XSL things are _almost_
real languages, am I overlooking some clever meta-pattern in XSL, so I
can write G in the system that was intended for it? (And I don't mean
G produces a stack of XPaths; I mean it behaves like them.)

Another way to state the problem: How to 'diff' two XMLs and return
only the nodes that differ. If only the sample contains diffs (the
equivalent of many > ticks and no < ticks on a command-line diff of a
normal text file), then the reference XML is a subset of the sample.
So how to write the diff? (_Without_ pretty-printing the XML and
running a command-line diff - that wouldn't detect nesting correctly!)

Apologies for the confusing question. The purpose of G is to write
this assertion:

self.assert_xml(sample, lambda x:
x.b(
x.c(d='e'),
x.f() # must be after c
)
)

The assertion is easy to write, it skips irrelevant intermediate
nodes, it skips irrelevant attributes, it detects nesting correctly,
and it detects relative document order. This would make _unit_ tests
on generated HTML absurdly easy to write! User-developers can then
change the HTML art and layout freely, without damaging the tests that
detect where business logic injects the correct values.
 
J

Joe Kesselman

Phlip said:
Another way to state the problem: How to 'diff' two XMLs and return
only the nodes that differ.

Not hard to do if you do it in a proper programming language. IBM's
Alphaworks site had a particularly nice XML diff/merge for a while, but
it relied upon a nonstandard feature of IBM's original DOM. It could be
recreated, though, and similar tools do still exist from other sources.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
J

Joe Kesselman

Re doing it in XSLT... XSLT is oriented toward tasks where there is only
one "current node" at a time, and is a nonprocedural language. You
could make a diff, but doing so would _not_ be easy.

I'd suggest coding directly against DOM, or against a custom data model.
XSLT/XQuery aren't intended to be the best answer for every possible
task; they're just good "swiss army knives" which handle most of the
most common requirements.
 
P

Phlip

Not hard to do if you do it in a proper programming language. IBM's
Alphaworks site had a particularly nice XML diff/merge for a while, but
it relied upon a nonstandard feature of IBM's original DOM. It could be
recreated, though, and similar tools do still exist from other sources.

Here y'all go, in a proper programming language!

The complete code, in Python via lxml, is below my sig. The highlights
are...

for node in doc.xpath('//*'):
nodes = [self._node_to_predicate(a) for a in
node.xpath('ancestor-or-self::*')]
path = '//' + '/descendant::'.join(nodes)
node = self.assert_xml(sample, path, **kw)
location = len(node.xpath('preceding::*'))
self.assertTrue(doc_order <= location, 'Node out of order!
' + path)
doc_order = location

That converts each node into an XPath representing its
characteristics, complete with sufficient predicates and descendant::
axes to help the XPath skip over irrelevancies.

The XPaths look like these:

//XMLPayRequest
//XMLPayRequest/descendant::RequestData
//XMLPayRequest/descendant::RequestData/
descendant::Vendor[ contains(text(), "LOGIN") ]
//XMLPayRequest/descendant::RequestData/
descendant::partner[ contains(text(), "PayPal") ]
//XMLPayRequest/descendant::RequestData/descendant::Transactions
....

The trick with the preceding::* axis seems to count the nodes between
the head and the current one, so comparing their counts seems to
enforce document order.

For extra credit, someone could think of a clever, recursive way to
cram all that into one humongous XPath. And someone could also find
away to replace all the "%s" strings with replacement tokens, so
embedded " marks cause no trouble.

This method trivially builds each predicate:

def _node_to_predicate(self, node):
path = node.tag

for key, value in node.attrib.items():
path += '[ contains(@%s, "%s") ]' % (key, value)

if node.text:
path += '[ contains(text(), "%s") ]' % node.text

return path

So I just need reassurance that I'm indeed at the industry's coal
face, here, and that nobody else has solved this in some humiliatingly
trivial way that I should use instead!

--
Phlip
http://zeekland.zeroplayer.com/Pigleg_Too/1

def _xml_to_tree(self, xml, forgiving=False):
from lxml import etree
self._xml = xml

if not isinstance(xml, basestring):
self._xml = str(xml)
return xml

if '<html' in xml[:200]:
parser = etree.HTMLParser(recover=forgiving)
return etree.HTML(str(xml), parser)
else:
parser = etree.XMLParser(recover=forgiving)
return etree.XML(str(xml))

def assert_xml(self, xml, xpath, **kw):
'Check that a given extent of XML or HTML contains a given
XPath, and return its first node'

if hasattr(xpath, '__call__'):
return self.assert_xml_tree(xml, xpath, **kw)

tree = self._xml_to_tree(xml, forgiving=kw.get('forgiving',
False))
nodes = tree.xpath(xpath)
self.assertTrue(len(nodes) > 0, xpath + ' should match ' +
self._xml)
node = nodes[0]
if kw.get('verbose', False): self.reveal_xml(node)
return node

def assert_xml_tree(self, sample, block, **kw):
from lxml.builder import ElementMaker
doc = block(ElementMaker())
doc_order = -1

for node in doc.xpath('//*'):
nodes = [self._node_to_predicate(a) for a in
node.xpath('ancestor-or-self::*')]
path = '//' + '/descendant::'.join(nodes)
node = self.assert_xml(sample, path, **kw)
location = len(node.xpath('preceding::*'))
self.assertTrue(doc_order <= location, 'Node out of order!
' + path)
doc_order = location

def _node_to_predicate(self, node):
path = node.tag

for key, value in node.attrib.items():
path += '[ contains(@%s, "%s") ]' % (key, value)

if node.text:
path += '[ contains(text(), "%s") ]' % node.text

return path
 
P

Peter Flynn

Phlip said:
Another way to state the problem: How to 'diff' two XMLs and return
only the nodes that differ. If only the sample contains diffs (the
equivalent of many > ticks and no < ticks on a command-line diff of a
normal text file), then the reference XML is a subset of the sample.
So how to write the diff? (_Without_ pretty-printing the XML and
running a command-line diff - that wouldn't detect nesting
correctly!)

It's actually not quite the same as your original question, but:

onsgmls -wxml xml.dec a.xml >a.esis
onsgmls -wxml xml.dec b.xml >b.esis
diff a.esis b.esis

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,735
Latest member
HikmatRamazanov

Latest Threads

Top