XML equality

O

onetitfemme

Hi *,

I have been looking for a definition or at least some workable concept
of "XML equality".

Searching on "XML equality" in comp.text.xml, microsoft.public.xsl and
microsoft.public.xml resulted in no hits

I also searched for: XML equality schema (single words) on the same
newsgroups gave very little and not-to-the-point links

I have read about from the commercial "XMLBooster" that it now
addresses these issues by generating code to:
- Check for equality among XML instances
- Compute the distance between two XML instances
- Compute the minimal set of changes required to go from one instance
to another, similar in spirit to what the diff Unix command does for
text files.

But it is hard to tell what is it exactly they mean by "equality among
XML instances" and "distance between two XML instances". I spent some
time at their web site and I think they are just using sale pitches. I
couldn't find any docs exacting or at least clarifying their
claims/terminology

I know xml is basically (structured) text and there aren't such
definitions for texts/natural languages' grammars (their usefulness and
validity actually is more of a semantic not a syntactic one)

Do you know of works dealing with the definition of such terms?

Thanks
otf
 
O

onetitfemme

// - - - - - - - - - - - - - - - - - - - -
Look for "xml diff" instead...

mgungora, this is how I started. search comp.text.xml for "OSS,
java-based XML Diff?"

I could not find much either, as a matter of fact no one replied to me

// - - - - - - - - - - - - - - - - - - - -
A natural definition would use the infoset. Norm Walsh has a
definition:

Richard, thank you for pointing me to norman walsh's article

// __
Infoset Equality
19 May 2004 (modified 11 Sep 2005)
Volume 7, Issue 86
by norman walsh

http://norman.walsh.name/2004/05/19/infoset-equal
// __

in which he approaches the concept from the perspective of infosets
(http://www.w3.org/TR/xml-infoset/) is definitely a good start, but
there are a number of issues that I see right away by just looking at
his defs. for example:

// __ in def. 2:
2. Element Information Items

Two element information items are equal if the following properties
are equal:

- [namespace name]
- [local name]
- [children]
- [attributes]

Children are compared in order, attributes without respect to order.
// __
._ I would also include the path to the element, just the path, NOT
the content of all elements in the path(unless he understands it as
part of the "[namespace name]"). To me, it is very natural to include
the path to an element and I wonder why it escaped his considerations.
._ also, to even compare documents (and/or dox sections) they should
first have structural and type affinity on their schemas, at least on
the sections that are being compared,
._ the order of elements of similar children from the same path should
not really matter (this can be easily/practically solved by sorting
them all). These two sections of XML "instances" should be equal

<node4>
<children>younger child: Paul<children>
<children>older child: Mary<children>
</node4>

and

<node4>
<children>older child: Mary<children>
<children>younger child: Paul<children>
</node4>

._ if an attribute is not mandatory, should these two sections be the
same?

<node4>
<children>older child: Mary<children>
<children>younger child: Paul<children>
</node4>

and

<node4>
<children adopted="true">older child: Mary<children>
<children>younger child: Paul<children>
</node4>

Also I would be obvious that you should exclude comments while
comparing XML dox, but why ignoring processing instructions, when they
give important type and reference info that defines the included data?

Thanks
otf
 
R

Richard Tobin

._ I would also include the path to the element, just the path, NOT
the content of all elements in the path

I don't understand why you would do that. If the elements don't have
the same path from the root, you wouldn't be comparing them at all.

Unless you are considering comparison of fragments of documents, in
which case you probably don't care about the position in the document.
._ also, to even compare documents (and/or dox sections) they should
first have structural and type affinity on their schemas, at least on
the sections that are being compared,

XML documents aren't required to have any kind of schema. This would
be equality on documents+schemas, not documents.
._ the order of elements of similar children from the same path should
not really matter (this can be easily/practically solved by sorting
them all).

This requires knowledge of the interpretation of the document that is not
inherent in the document itself. Given some kind of schema, it might be
appropriate to interpret the children as a set rather than a sequence,
but in that case you are again not comparing documents themselves, but
the data models resulting from application of a schema to the documents.
._ if an attribute is not mandatory, should these two sections be the
same?

As XML documents, they would be different. According to some
interpretation, they might be the same. Optional attributes
are not always interpreted as supplying optional information: their
absence may be as significant as their presence.
Also I would be obvious that you should exclude comments while
comparing XML dox, but why ignoring processing instructions, when they
give important type and reference info that defines the included data?

Processing instructions are used for many different purposes. But their
obvious canonical use is to specify the processing of (part of) the
document rather than its content.

-- Richard
 
O

onetitfemme

Richard Tobin wrote ...
Hi *,
I don't understand why you would do that. If the elements don't have
the same path from the root, you wouldn't be comparing them at all.

"If the elements don't have the same path from the root, you
wouldn't be comparing them at all"
otf: exactly! Here I might be a little biased and/or some intuition
artifacts might be kicking in. We theoretical physicists
"naturally" think this way. You may go LOL, but to us if more
people board a train, it might still reach its end, but the trajectory
will definitely not be the same ;-)
Jokes aside now, to me (in an ontology (well structure hierarchical
tree-like depedency)) the Path to an element is as important as the
element itself
Unless you are considering comparison of fragments of documents, in
which case you probably don't care about the position in the document.
"fragments of documents"
otf: am I considering, but I still care about the position in the
document.
XML documents aren't required to have any kind of schema. This would
be equality on documents+schemas, not documents.

"equality on documents+schemas, not documents."
otf: exactly! "structural and type affinity on their schemas ..."
should be very important to even consider any type of comparison
This requires knowledge of the interpretation of the document that is not
inherent in the document itself. Given some kind of schema, it might be
appropriate to interpret the children as a set rather than a sequence,
but in that case you are again not comparing documents themselves, but
the data models resulting from application of a schema to the documents.

otf: granted! But how is it that you would not interpret the children
as a set, if no other indication has been explicitly indicated in the
schema?
Actually the data models resulting from the COMPLIANCE of documents to
a schema, so that they become actionable data for an XML application
As XML documents, they would be different. According to some
interpretation, they might be the same. Optional attributes
are not always interpreted as supplying optional information: their
absence may be as significant as their presence.

otf: OK. I think I have started to see that there might not be such
thing as "XML equality" (as you have e.g. for mathematical
magnitudes), but degrees thereof
Processing instructions are used for many different purposes. But their
obvious canonical use is to specify the processing of (part of) the
document rather than its content.
// - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
I am thinking of tones of web pages (and/or any other marked up dox)
as a huge forest of texts where "links" among them are not only
given though URLs, but though their structure as well.
I understood something from your comments when you talked about the
"position in the document" (of an element) I think I am missing
something. Even the path to the elements might not be enough to an
accurate description of "equality", but since "degrees thereof"
might be important as well, even the closed graphs to the point an
element is should be considered

Thanks
otf
 
O

onetitfemme

just found a really good article which answers my XML diffing doubts to
a large extent

http://www.mulberrytech.com/Extreme/Proceedings/html/2005/Schaffert01/EML2005Schaffert01.html

Structure-Preserving Difference Search for XML Documents
by E. Schubert, S. Schaffert, and F. Bry
abstract:
Current XML differencing applications usually try to find a minimal
sequence of edit operations that transform one XML document to another
XML document (the so-called "edit script"). In our conviction, this
approach often produces increments that are unintuitive for human
readers and do not reflect the actual changes. We therefore propose in
this article a different approach trying to maximize the retained
structure instead of minimizing the edit sequence. Structure is thereby
not limited to the usual tree structure of XML - any kind of structural
relations can be considered (like parent-child, ancestor-descendant,
sibling, document order). In our opinion, this approach is very
flexible and able to adapt to the user's requirements. It produces more
readable results while still retaining a reasonably small edit
sequence.
Keywords: Web; XML; Difference
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,001
Messages
2,570,254
Members
46,850
Latest member
VMRKlaus8

Latest Threads

Top