XML and namespaces

  • Thread starter =?ISO-8859-1?Q?Wilfredo_S=E1nchez_Vega?=
  • Start date
U

uche.ogbuji

Wilfredo Sánchez Vega:
"""
I'm having some issues around namespace handling with XML:
'<?xml version="1.0" ?>\n<href/>'

Note that the namespace wasn't emitted. If I have PyXML,
xml.dom.ext.Print does emit the namespace:
<?xml version='1.0' encoding='UTF-8'?><href xmlns='DAV:'/>

Is that a limitation in toxml(), or is there an option to make it
include namespaces?
"""

Getting back to the OP:

PyXML's xml.dom.ext.Print does get things right, and based on
discussion in this thread, the only way you can serialize correctly is
to use that add-on with minidom, or to use a third party, properly
Namespaces-aware tool such as 4Suite (there are others as well).

Good luck.
 
A

Alan Kennedy

[Fredrik Lundh]
> my point was that (unless I'm missing something here), there are at
> least two widely used implementations (libxml2 and the 4DOM domlette
> stuff) that don't interpret the spec in this way.

Libxml2dom is of alpha quality, according to its CheeseShop page anyway.

http://cheeseshop.python.org/pypi/libxml2dom/0.2.4

This can be seen in its incorrect serialisation of the following valid DOM.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
document = libxml2dom.createDocument(None, "doc", None)
top = document.xpath("*")[0]
elem1 = document.createElementNS("DAV:", "myns:href")
elem1.setAttributeNS(xml.dom.XMLNS_NAMESPACE, "xmlns:myns", "DAV:")
document.replaceChild(elem1, top)
print document.toString()
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Which produces

"""
<?xml version="1.0"?>
<myns:href
xmlns:myns="DAV:"
xmlns:xmlns="http://www.w3.org/2000/xmlns/"
xmlns:myns="DAV:"
/>
"""

Which is not even well-formed XML (duplicate attributes), let alone
namespace well-formed. Note also the invalid xml namespace "xmlns:xmlns"
attribute. So I don't accept that libxml2dom's behaviour is definitive
in this case.

The other DOM you refer to, the 4DOM stuff, was written by a participant
in this discussion.

Will you accept Apache Xerces 2 for Java as a widely used DOM
Implementation? I guarantee that it is far more widely used than either
of the DOMs mentioned.

Download Xerces 2 (I am using Xerces 2.7.1), and run the following code
under jython:-

http://www.apache.org/dist/xml/xerces-j/

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
#
# This is a simple adaptation of the DOMGenerate.java
# sample from the Xerces 2.7.1 distribution.
#
from javax.xml.parsers import DocumentBuilder, DocumentBuilderFactory
from org.apache.xml.serialize import OutputFormat, XMLSerializer
from java.io import StringWriter

def create_document():
dbf = DocumentBuilderFactory.newInstance()
db = dbf.newDocumentBuilder()
return db.newDocument()

def serialise(doc):
format = OutputFormat( doc )
outbuf = StringWriter()
serial = XMLSerializer( outbuf, format )
serial.asDOMSerializer()
serial.serialize(doc.getDocumentElement())
return outbuf.toString()

doc = create_document()
root = doc.createElementNS("DAV:", "href")
doc.appendChild( root )
print serialise(doc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Which produces

"""
<?xml version="1.0" encoding="UTF-8"?>
<href/>
"""

As I expected it would.
 
A

Alan Kennedy

[[email protected]]
You're the one who doesn't seem to clearly understand XML namespaces.
It's your position that is bewildering, not XML namespaces (well, they
are confusing, but I have a good handle on all the nuances by now).

So you keep claiming, but I have yet to see the evidence.
Again, no skin off my back here: I write and use tools that are XML
namespaces compliant. It doesn't hurt me that Minidom is not. I was
hoping to help, but again I don't have time for ths argument.

If you make statements such as "you're wrong on this ....", "you
misunderstand ....", "you're guessing .....", etc, then you should be
prepared to back them up, not state them and then say "but I'm too busy
and/or important to discuss it with you".

Perhaps you should think twice before making such statements in the future.
 
F

Fredrik Lundh

Alan said:
[Fredrik Lundh]
my point was that (unless I'm missing something here), there are at
least two widely used implementations (libxml2 and the 4DOM domlette
stuff) that don't interpret the spec in this way.

Libxml2dom is of alpha quality, according to its CheeseShop page anyway.

http://cheeseshop.python.org/pypi/libxml2dom/0.2.4

but isn't libxml2dom just a binding for libxml2? as I mention above, I had libxml2
in mind when I wrote "widely used", not the libxml2dom binding itself.
Will you accept Apache Xerces 2 for Java as a widely used DOM
Implementation?

sure.

but libxml2 is also widely used, so we have at least two ways to interpret the spec.
the defacto interpretation of the spec seems to be that namespace handling during
serialization is "undefined"...

(is there perhaps a DOM library that starts "hack" or "rogue" when you use name-
spaces ? ;-)

</F>
 
A

Alan Kennedy

[Fredrik Lundh]
but isn't libxml2dom just a binding for libxml2? as I mention above, I had libxml2
in mind when I wrote "widely used", not the libxml2dom binding itself.

No, libxml2dom is Paul Boddie's DOM API compatibility layer on top of
the cpython bindings for libxml2. From the CheeseShop page

"""
The libxml2dom package provides a traditional DOM wrapper around the
Python bindings for libxml2. In contrast to the libxml2 bindings,
libxml2dom provides an API reminiscent of minidom, pxdom and other
Python-based and Python-related XML toolkits.
"""

http://cheeseshop.python.org/pypi/libxml2dom

[Alan Kennedy]
[Fredrik Lundh]
sure.

but libxml2 is also widely used, so we have at least two ways to interpret the spec.

Don't confuse libxml2dom with libxml2.

As I showed with a code snippet in a previous message, libxml2dom has
significant defects in relation to serialisation of namespaced
documents, whereby the serialised documents it produces aren't even
well-formed xml.

Perhaps you can show a code snippet in libxml2 that illustrates the
behaviour you describe?
 
P

Paul Boddie

Alan said:
Libxml2dom is of alpha quality, according to its CheeseShop page anyway.

Given that I gave it that classification, let me explain that its alpha
status is primarily justified by the fact that it doesn't attempt to
cover the entire DOM API. As I mentioned in my original contribution to
this thread, the serialisation is done by libxml2 itself - arguably a
wise choice given the abysmal performance of many Python DOM
implementations when serialising documents.

I'll look into namespace-setting issues in the libxml2 API, but I
imagine that the serialisation mechanisms control much of what you're
seeing, and it's quite possible that they can be configured to perform
in whichever way is desirable.

Paul
 
P

Paul Boddie

Alan said:
Don't confuse libxml2dom with libxml2.

Well, quite, but perhaps you can explain what I'm doing wrong with this
low-level version of the previously specified code:

import libxml2mod
document = libxml2mod.xmlNewDoc(None)
element = libxml2mod.xmlNewChild(document, None, "href", None)
print libxml2mod.serializeNode(document, None, 1)

This prints the following:

<?xml version="1.0"?>
<href/>

Extending the above code...

ns = libxml2mod.xmlNewNs(element, "DAV:", None)
print libxml2mod.serializeNode(document, None, 1)

This prints the following:

<?xml version="1.0"?>
<href xmlns="DAV:"/>

Note that libxml2mod is as close to the libxml2 C API as you can get in
Python. As far as I can tell, by using that module, you're effectively
driving the C API almost directly. Note also that libxml2mod is nothing
to do with what I've written myself - I'm just using it here, just as
libxml2dom does.

Now, in the first part of the code, we didn't specify a namespace on
the element at all, but in the second part we chose to set a namespace
on the element with a null prefix. As you can see, we get the xmlns
attribute as soon as the namespace is introduced. It is difficult to
say whether this usage of the API is correct or not, judging from the
Web site's material [1], so I'd be happy if someone could point out
improvements or corrections.

Paul

[1] http://xmlsoft.org/
 
F

Fredrik Lundh

Alan said:
[Fredrik Lundh]
but isn't libxml2dom just a binding for libxml2? as I mention above, I had libxml2
in mind when I wrote "widely used", not the libxml2dom binding itself.

No, libxml2dom is Paul Boddie's DOM API compatibility layer on top of
the cpython bindings for libxml2.

So a binding that just passes things through to another binding is not
a binding? Alright, let's call it a compatibility layer then.
Don't confuse libxml2dom with libxml2.

As Paul has said several times, libxml2dom is just a thin API compatibility
layer on top of libxml2. It's libxml2 that does all the work, and the libxml2
authors claim that libxml2 implements the DOM level 2 document model,
but with a different API.

Maybe they're wrong, but wasn't the whole point of this subthread that
different developers have interpreted the specification in different ways ?

</F>
 
A

Alan Kennedy

[Fredrik Lundh]
It's libxml2 that does all the work, and the libxml2
authors claim that libxml2 implements the DOM level 2 document model,
but with a different API.

That statement is meaningless.

The DOM is *only* an API, i.e. an interface. The opening statement on
the W3C DOM page is

"""
What is the Document Object Model?

The Document Object Model is a platform- and language-neutral interface
that will allow programs and scripts to dynamically access and update
the content, structure and style of documents.
"""

http://www.w3.org/DOM/

The interfaces that make up the different levels of the DOM are
described in CORBA IDL - Interface Definition Language.

DOM Implementations are free to implement the methods and properties of
the IDL interfaces as they see fit. Some implementations might maintain
an object model, with separate objects for each node in the tree,
several string variables associated with each node, i.e. node name,
namespace, etc. But they could just as easily store those data in
tables, indexed by some node id. (As an aside, the non-DOM-compatible
Xalan Table Model does exactly that:
http://xml.apache.org/xalan-j/dtm.html).

So when the libxml2 developers say (copied from http://www.xmlsoft.org/)

"""
To some extent libxml2 provides support for the following additional
specifications but doesn't claim to implement them completely:

* Document Object Model (DOM)
http://www.w3.org/TR/DOM-Level-2-Core/ the document model, but it
doesn't implement the API itself, gdome2 does this on top of libxml2
"""

They've completely missed the point: DOM is *only* the API.
Maybe they're wrong, but wasn't the whole point of this subthread that
different developers have interpreted the specification in different ways ?

What specification? Libxml2 implements none of the DOM specifications.
 
A

Alan Kennedy

[Alan Kennedy]
[Paul Boddie]
Well, quite, but perhaps you can explain what I'm doing wrong with this
low-level version of the previously specified code:

Well, if your purpose is to make a point about minidom and DOM standards
compliance in relation to serialisation of namespaces, then what you're
doing wrong is to use a library that bears no relationship to the DOM to
make your point.

Think about it this way: Say you decide to create a new XML document
using a non-DOM library, such as the excellent ElementTree.

So you make a series of ElementTree-API-specific calls to create the
document, the elements, attributes, namespaces, etc, and then serialise
the whole thing.

And the end result is that you end up with a document that looks like this

"""
<?xml version="1.0" encoding="utf-8"?>
<href xmlns="DAV:"/>
"""

It is not possible to use that ElementTree code to make inferences on
how minidom should behave, because the syntax and semantics of the
minidom API calls and the ElementTree API calls are different.

Minidom is constrained to implement the precise semantics of the DOM
APIs, because it claims standards compliance.

ElementTree is free to do whatever it likes, e.g. be pythonic, because
it has no standard to conform to: it is designed solely according to the
experience and intuition of its author, who is free change it at any
stage if he feels like it.

s/ElementTree/libxml2/g

If I've completely missed your point and you were talking something else
entirely, please forgive me. I'd be happy to help with any questions if
I can.
 
P

Paul Boddie

Paul said:
It is difficult to say whether this usage of the API is correct or not, judging from the
Web site's material

[...]

Some more on this: I found an example on the libxml2 mailing list
(searching for "xmlNewNs default namespace") which is similar to the
one I gave:

http://mail.gnome.org/archives/xml/2004-April/msg00282.html

Meanwhile, the usage of xmlNewNs seems to have some correlation with
the production of xmlns attributes (found in a search for "xmlns
default namespace"):

http://mail.gnome.org/archives/xml/2002-March/msg00111.html

And whilst gdome2 - the GNOME project's DOM wrapper for libxml2 - seems
to create unowned namespaces, adding them to the document as global
namespace declarations (looking at the code for gdome_xmlNewNs and
gdome_xml_doc_createElementNS respectively)...

http://cvs.gnome.org/viewcvs/gdome2/libgdome/gdomecore/gdome-xml-xmlutil.c?rev=1.18&view=markup
http://cvs.gnome.org/viewcvs/gdome2/libgdome/gdomecore/gdome-xml-document.c?rev=1.50&view=markup

....seemingly comparable operations with libxml2mod seem to be no longer
supported:
xmlNewGlobalNs() deprecated function reached

Given that I've recently unsubscribed from some pretty unproductive
mailing lists, perhaps I should make some enquiries on the libxml2
mailing list and possibly report back.

Paul
 
P

Paul Boddie

Alan said:
Well, if your purpose is to make a point about minidom and DOM standards
compliance in relation to serialisation of namespaces, then what you're
doing wrong is to use a library that bears no relationship to the DOM to
make your point.

Alright. I respectfully withdraw libxml2/libxml2dom as an example of a
DOM Level 2 compatible implementation. Since I only profess to support
"a PyXML-style DOM" in libxml2dom, the course I take in any amendments
to that package will follow whatever Uche decides to do with 4DOM and
PyXML. ;-) Whatever happens, I'll attempt to make it compatible with
qtxmldom in both its flavours (qtxml and KHTML).

As for the various issues with namespaces and the DOM, with memories of
slapping empty xmlns attributes strategically-but-desperately in XSL
processing pipelines to avoid invisible-but-still-present default
namespaces now thankfully receding into the incoherent past, the whole
business merely reinforces my impression of the various standards
committees as a group of corporate delegates meeting regularly to hold
a "measuring competition" amongst themselves.

Paul
 
P

Paul Boddie

Alan Kennedy wrote:

[Discussing the appearance of xmlns="DAV:"]
But that's incorrect. You have now defaulted the namespace to "DAV:" for
every unprefixed element that is a descendant of the href element.

[Code creating the no_ns element with namespaceURI set to None]
<?xml version="1.0"?>
<href xmlns="DAV:"><no_ns/></href>

I must admit that I was focusing on the first issue rather than this
one, even though it is related, when I responded before. Moreover,
libxml2dom really should respect the lack of a namespace on the no_ns
element, which the current version unfortunately doesn't do. However,
wouldn't the correct serialisation of the document be as follows?

<?xml version="1.0"?>
<href xmlns="DAV:"><no_ns xmlns=""/></href>

As for the first issue - the presence of the xmlns attribute in the
serialised document - I'd be interested to hear whether it is
considered acceptable to parse the serialised document and to find that
no non-null namespaceURI is set on the href element, given that such a
namespaceURI was set when the document was created. In other words, ...

document = libxml2dom.createDocument(None, "doc", None)
top = document.xpath("*")[0]
elem1 = document.createElementNS("DAV:", "href")
document.replaceChild(elem1, top)
elem2 = document.createElementNS(None, "no_ns")
document.xpath("*")[0].appendChild(elem2)
document.toFile(open("test_ns.xml", "wb"))

....as before, followed by this test:

document = libxml2dom.parse("test_ns.xml")
print "Namespace is", repr(document.xpath("*")[0].namespaceURI)

What should the "Namespace is" message produce?

Paul
 
A

Alan Kennedy

[Paul Boddie]
> However,
> wouldn't the correct serialisation of the document be as follows?
>
> <?xml version="1.0"?>
> <href xmlns="DAV:"><no_ns xmlns=""/></href>

Yes, the correct way to override a default namespace is an xmlns=""
attribute.

[Paul Boddie]
> As for the first issue - the presence of the xmlns attribute in the
> serialised document - I'd be interested to hear whether it is
> considered acceptable to parse the serialised document and to find that
> no non-null namespaceURI is set on the href element, given that such a
> namespaceURI was set when the document was created.

The key issue: should the serialised-then-reparsed document have the
same DOM "content" (XML InfoSet) if the user did not explicitly create
the requisite namespace declaration attributes?

My answer: No, it should not be the same.
My reasoning: The user did not explicitly create the attributes
=> The DOM should not automagically create them (according to
the L2 spec)
=> such attributes should not be serialised
- The user didn't create them
- The DOM implementation didn't create them
- If the serialisation processor creates them, that gives the
same end result as if the DOM impl had (wrongly) created them.
=> the serialisation is a faithful/naive representation of the
(not-namespace-well-formed) DOM constructed by the user (who
omitted required attributes).
=> The reloaded document is a different DOM to the original, i.e.
it has a different infoset.

The xerces and jython snippet I posted the other day demonstrates this.
If you look closely at that code, the actual DOM implementation and the
serialisation processor used are from different libraries. The DOM is
the inbuilt JAXP DOM implementation, Apache Crimson(the example only
works on JDK 1.4). The serialisation processor is the Apache Xerces
serialiser. The fact that the xmlns="DAV:" attribute didn't appear in
the output document shows that BOTH the (Crimson) DOM implementation AND
the (Xerces) serialiser chose NOT to automagically create the attribute.

If you run that snippet with other DOM implementations, by setting the
"javax.xml.parsers.DocumentBuilderFactory" property, you'll find the
same result.

Serialisation and namespace normalisation are both in the realm of DOM
Level 3, whereas minidom is only L2 compliant. Automagically introducing
L3 semantics into the L2 implementation is the wrong thing to do.

http://www.w3.org/TR/DOM-Level-3-LS/load-save.html
http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/namespaces-algorithms.html

[Paul Boddie]
> In other words, ...
>
> What should the "Namespace is" message produce?

Namespace is None

If you want it to produce,

Namespace is 'DAV:'

and for your code to be portable to other DOM implementations besides
libxml2dom, then your code should look like:-
> document = libxml2dom.createDocument(None, "doc", None)
> top = document.xpath("*")[0]
> elem1 = document.createElementNS("DAV:", "href")

elem1.setAttributeNS(xml.dom.XMLNS_NAMESPACE, "xmlns", "DAV:")
> document.replaceChild(elem1, top)
> elem2 = document.createElementNS(None, "no_ns")

elem2.setAttributeNS(xml.dom.XMLNS_NAMESPACE, "xmlns", "")
> document.xpath("*")[0].appendChild(elem2)
> document.toFile(open("test_ns.xml", "wb"))

its-not-about-namespaces-its-about-automagic-ly'yrs,
 
P

Paul Boddie

Alan said:
Serialisation and namespace normalisation are both in the realm of DOM
Level 3, whereas minidom is only L2 compliant. Automagically introducing
L3 semantics into the L2 implementation is the wrong thing to do.

I think I'll have to either add some configuration support, in order to
let the user specify which standards they have in mind, or to
deny/assert support for one or another of the standards. It's
interesting that minidom plus PrettyPrint seems to generate the xmlns
attributes in the serialisation, though; should that be reported as a
bug?

As for the toxml method in minidom, the subject did seem to be briefly
discussed on the XML-SIG mailing list earlier in the year:

http://mail.python.org/pipermail/xml-sig/2005-July/011157.html
its-not-about-namespaces-its-about-automagic-ly'yrs

Well, with the automagic, all DOM users get the once in a lifetime
chance to exchange those lead boots for concrete ones. I'm sure there
are all sorts of interesting reasons for assigning namespaces to nodes,
serialising the document, and then not getting all the document
information back when parsing it, but I'd rather be spared all the
"amusement" behind all those reasons and just have life made easier for
just about everyone concerned. I think the closing remarks in the
following message say it pretty well:

http://mail-archives.apache.org/mod_mbox/xml-security-dev/200409.mbox/<1095071819.17967.44.camel%40amida>

And there are some interesting comments on this archived page, too:

http://web.archive.org/web/20010211173643/http://xmlbastard.editthispage.com/discuss/msgReader$6

Anyway, thank you for your helpful commentary on this matter!

Paul
 
A

Alan Kennedy

[Paul Boddie]
It's
interesting that minidom plus PrettyPrint seems to generate the xmlns
attributes in the serialisation, though; should that be reported as a
bug?

I believe that it is a bug.

[Paul Boddie]
Well, with the automagic, all DOM users get the once in a lifetime
chance to exchange those lead boots for concrete ones. I'm sure there
are all sorts of interesting reasons for assigning namespaces to nodes,
serialising the document, and then not getting all the document
information back when parsing it, but I'd rather be spared all the
"amusement" behind all those reasons and just have life made easier for
just about everyone concerned.

Well, if you have a fair amount of spare time and really want to improve
things, I recommend that you consider implementing the DOM L3 namespace
normalisation algorithm.

http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/namespaces-algorithms.html

That way, everyone can have namespace well-formed documents by simply
calling a single method, and not a line of automagic in sight: just
standards-compliant XML processing.
Anyway, thank you for your helpful commentary on this matter!

And thanks to you for actually informing yourself on the issue, and for
taking the time to research and understand it. I wish that your
refreshing attitude was more widespread!

now-i-really-must-get-back-to-work-ly'yrs,
 
A

and-google

Uche Ogbuji said:
Andrew Clover also suggested an overly-legalistic argument that current
minidom behavior is not a bug.

I stick by my language-law interpretation of spec. DOM 2 Core
specifically disclaims any responsibility for namespace fixup and
advises the application writer to do it themselves if they want to be
sure of the right output. W3C knew they weren't going to get all that
standardised by Level 2 so they left it open for future work - if
minidom claimed to support DOM 3 LS it would be a different matter.
'<?xml version="1.0" ?>\n<ferh/>'
(i.e. "ferh" rather than "href"), would you not consider that a minidom
bug?

It's not a *spec* bug, as no spec that minidom claims to conform to
says anything about serialisation. It's a *minidom* bug in that it
fails to conform to the minimal documentation of the method toxml()
which claims to "Return the XML that the DOM represents as a string" -
the DOM does not represent that XML.

However that doc for toxml() says nothing about being namespace-aware.
XML and XML-with-namespaces both still exist, and for the former class
of document the minidom behaviour is correct.
The reality is that once the poor user has done:
element = document.createElementNS("DAV:", "href")
They are following DOM specification that they have created an element
in a namespace

It's possible that a namespaced node could also be imported/parsed into
a non-namespace document and then serialised; it's particularly likely
this could happen for scripts processing XHTML.

We shouldn't change the existing behaviour for toxml/writexml because
people may be relying on it. One of the reasons I ended up writing a
replacement was that the behaviour of minidom was not only wrong, but
kept changing under my feet with each version.

However, adding the ability to do fixup on serialisation would indeed
be very welcome - toxmlns() maybe, or toxml(namespaces= True)?
I'll be sure to emphasize heavily to users that minidom is broken
with respect to Namespaces and serialization, and that they
abandon it in favor of third-party tools.

Well yes... there are in any case more fundamental bugs than just
serialisation problems.
can anyone perhaps dig up a DOM L2 implementation that's not written
by anyone involved in this thread

<g>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,735
Latest member
HikmatRamazanov

Latest Threads

Top