Not hard to do if you do it in a proper programming language. IBM's
Alphaworks site had a particularly nice XML diff/merge for a while, but
it relied upon a nonstandard feature of IBM's original DOM. It could be
recreated, though, and similar tools do still exist from other sources.
Here y'all go, in a proper programming language!
The complete code, in Python via lxml, is below my sig. The highlights
are...
for node in doc.xpath('//*'):
nodes = [self._node_to_predicate(a) for a in
node.xpath('ancestor-or-self::*')]
path = '//' + '/descendant::'.join(nodes)
node = self.assert_xml(sample, path, **kw)
location = len(node.xpath('preceding::*'))
self.assertTrue(doc_order <= location, 'Node out of order!
' + path)
doc_order = location
That converts each node into an XPath representing its
characteristics, complete with sufficient predicates and descendant::
axes to help the XPath skip over irrelevancies.
The XPaths look like these:
//XMLPayRequest
//XMLPayRequest/descendant::RequestData
//XMLPayRequest/descendant::RequestData/
descendant::Vendor[ contains(text(), "LOGIN") ]
//XMLPayRequest/descendant::RequestData/
descendant:
artner[ contains(text(), "PayPal") ]
//XMLPayRequest/descendant::RequestData/descendant::Transactions
....
The trick with the preceding::* axis seems to count the nodes between
the head and the current one, so comparing their counts seems to
enforce document order.
For extra credit, someone could think of a clever, recursive way to
cram all that into one humongous XPath. And someone could also find
away to replace all the "%s" strings with replacement tokens, so
embedded " marks cause no trouble.
This method trivially builds each predicate:
def _node_to_predicate(self, node):
path = node.tag
for key, value in node.attrib.items():
path += '[ contains(@%s, "%s") ]' % (key, value)
if node.text:
path += '[ contains(text(), "%s") ]' % node.text
return path
So I just need reassurance that I'm indeed at the industry's coal
face, here, and that nobody else has solved this in some humiliatingly
trivial way that I should use instead!
--
Phlip
http://zeekland.zeroplayer.com/Pigleg_Too/1
def _xml_to_tree(self, xml, forgiving=False):
from lxml import etree
self._xml = xml
if not isinstance(xml, basestring):
self._xml = str(xml)
return xml
if '<html' in xml[:200]:
parser = etree.HTMLParser(recover=forgiving)
return etree.HTML(str(xml), parser)
else:
parser = etree.XMLParser(recover=forgiving)
return etree.XML(str(xml))
def assert_xml(self, xml, xpath, **kw):
'Check that a given extent of XML or HTML contains a given
XPath, and return its first node'
if hasattr(xpath, '__call__'):
return self.assert_xml_tree(xml, xpath, **kw)
tree = self._xml_to_tree(xml, forgiving=kw.get('forgiving',
False))
nodes = tree.xpath(xpath)
self.assertTrue(len(nodes) > 0, xpath + ' should match ' +
self._xml)
node = nodes[0]
if kw.get('verbose', False): self.reveal_xml(node)
return node
def assert_xml_tree(self, sample, block, **kw):
from lxml.builder import ElementMaker
doc = block(ElementMaker())
doc_order = -1
for node in doc.xpath('//*'):
nodes = [self._node_to_predicate(a) for a in
node.xpath('ancestor-or-self::*')]
path = '//' + '/descendant::'.join(nodes)
node = self.assert_xml(sample, path, **kw)
location = len(node.xpath('preceding::*'))
self.assertTrue(doc_order <= location, 'Node out of order!
' + path)
doc_order = location
def _node_to_predicate(self, node):
path = node.tag
for key, value in node.attrib.items():
path += '[ contains(@%s, "%s") ]' % (key, value)
if node.text:
path += '[ contains(text(), "%s") ]' % node.text
return path