lxml/ElementTree and .tail

F

Fredrik Lundh

Chas said:
That's flatly unrealistic. If you'll remember, I'm not one of "those
people" that are specification-driven -- I hadn't even *heard* of
Infoset until earlier this week!

The rant wasn't directed at you or anyone special, but I don't really
think you got the point of it either. Which is a bit strange, because
it sounded like you *were* working on extracting information from messy
documents, so the "it's about the data, dammit" way of thinking
shouldn't be news to you.

And the routing around is not unrealistic, it's is a *fact*; JSON and
POX are killing the full XML/Schema/SOAP stack for communication, XHTML
is pretty much dead as a wire format, people are apologizing in public
for their use of SOAP, AJAX is quickly turning into AJAJ, few people
care about the more obscure details of the XML 1.0 standard (when did
you last see a conditional section? or even a DTD?), dealing with huge
XML data sets is still extremely hard compared to just uploading the
darn thing to a database and doing the crunching in SQL, and nobody uses
XML 1.1 for anything.

Practicality beats purity, and the Internet routes around damage, every
single time.
> overwhelming majority of the developers out there care for nothing
> but the serialization, simply because that's how one plays nicely
> with others.

The problem is if you only stare at the serialization, your code *won't*
play nicely with others. At the serialization level, it's easy to think
that CDATA sections are different from other text, that character
references are different from ordinary characters, that you should
somehow be able to distinguish between <tag></tag> and <tag/>, that
namespace prefixes are more important than the namespace URI, that an
&nbsp; in an XHTML-style stream is different from a U+00A0 character in
memory, and so on. In my experience, serialization-only thinking (at
the receiving end) is the single most common cause for interoperability
problems when it comes to general XML interchange.

But when you focus on the data model, and treat the serialization as an
implementation detail, to be addressed by a library written by someone
who's actually read the specifications a few more times than you have,
all those problems tend to just go away. Things just work.

And in practice, of course, most software engineers understand this, and
care about this. After all, good software engineering is about
abstractions and decoupling and designing things so you can focus on one
part of the problem at a time. And about making your customer happy,
and having fun while doing that. Not staying up all night to look for
an obscure interoperability problem that you finally discover is caused
by someone using a CDATA section where you expected a character
reference, in 0.1% of all production records, but in none of the files
in your test data set.

(By the way, did ET fail to *read* your XML documents? I thought your
complaint was that it didn't put the things it read in a place where you
expected them to be, and that you didn't have time to learn how to deal
with that because you had more important things to do, at the time?)

</F>
 
C

Chas Emerick

The rant wasn't directed at you or anyone special, but I don't really
think you got the point of it either. Which is a bit strange, because
it sounded like you *were* working on extracting information from
messy
documents, so the "it's about the data, dammit" way of thinking
shouldn't be news to you.

No, it's not any kind of news at all, and I'm very sympathetic to
your specific perspective (and have advocated it in other contexts
and circumstances, where appropriate). And yes, we are in fact
ensuring that we get from the HTML/XHTML/text/PDF/etc serialization
we have to consume to a uniform, normalized, and "clean" data model
in as few steps as possible. However, in those few steps, we have to
recognize the functional reality of how each data representation is
used out in the world in order to translate it into a uniform model
for our own purposes. In concrete terms, that means that an end tag
in an XHTML serialization means that that element is closed, done,
finit. Any other representation of that serialization doesn't
correspond properly with the intent of that HTML document's author.
And the routing around is not unrealistic, it's is a *fact*; JSON and
POX are killing the full XML/Schema/SOAP stack for communication,
XHTML
is pretty much dead as a wire format, people are apologizing in public
for their use of SOAP, AJAX is quickly turning into AJAJ, few people
care about the more obscure details of the XML 1.0 standard (when did
you last see a conditional section? or even a DTD?), dealing with huge
XML data sets is still extremely hard compared to just uploading the
darn thing to a database and doing the crunching in SQL, and nobody
uses
XML 1.1 for anything.

Practicality beats purity, and the Internet routes around damage,
every
single time.

I agree 100% -- but I would have thought that that's a point I would
have made. The model that ET uses seems like a "purified"
representation of a mixed-content serialization, exactly because it
is geared to an ideal rather than the practical realities of mixed
content and expectations thereof.

For what it's worth, our current effort is directed towards providing
significant stores/feeds of XML/PDF/HTML/text/etc in something that
can be dropped into a RDBMS. Perhaps that's the source of the
impedance between us: you view Infoset as a functional replacement
for serialization-dependent XML, whereas we are focussed on what
could be broadly described as a translation from one to the other.
The problem is if you only stare at the serialization, your code
*won't*
play nicely with others. At the serialization level, it's easy to
think
that CDATA sections are different from other text, that character
references are different from ordinary characters, that you should
somehow be able to distinguish between <tag></tag> and <tag/>, that
namespace prefixes are more important than the namespace URI, that an
&nbsp; in an XHTML-style stream is different from a U+00A0
character in
memory, and so on. In my experience, serialization-only thinking (at
the receiving end) is the single most common cause for
interoperability
problems when it comes to general XML interchange.

I agree with all of that. I would again refer to the pervasive view
of what end tags mean -- that's what I was primarily referring to
with the term 'serialization'.
(By the way, did ET fail to *read* your XML documents? I thought your
complaint was that it didn't put the things it read in a place
where you
expected them to be, and that you didn't have time to learn how to
deal
with that because you had more important things to do, at the time?)

No, it doesn't put things in the right places, so I consider that a
failure of the model. I don't see why I should have spent time
learning how to deal with that when another very comprehensive
library is available that does meet expectations. *shrug*

Further, the fact that ET/lxml works the way that it does makes me
think that there may be some other landmines in the underlying model
that we might not have discovered until some days, weeks, etc., had
passed, so there's a much greater comfort level in working with a
library that explicitly supports the model that we expect (and was
assumed when the HTML [now XHTML] documents in question were authored).

- Chas
 
F

Fredrik Lundh

Chas said:
Further, the fact that ET/lxml works the way that it does makes me
think that there may be some other landmines in the underlying model
that we might not have discovered until some days, weeks, etc., had
passed

so the real reason you posted your original post was to spread some FUD,
not to get help? that's a bit disappointing.

</F>
 
C

Chas Emerick

so the real reason you posted your original post was to spread some
FUD,
not to get help? that's a bit disappointing.

<sarcasm>
Yeah, that's exactly it. In fact, if you look back at the head of
this thread, you'll see how I was looking to disparage ET. I
especially wanted to make sure ET's API doesn't get any traction in
the python community. It's especially important that ET not find
popular success and acclaim -- I'd have quite a bit to gain from it
remaining a niche library.
</sarcasm>

Fredrik, I wasn't attempting to spread anything. I was confused, I
posed some illustrative examples, and asked for people's thoughts.
Your reply gave me the right vocabulary to find more information
(i.e. about Infoset), and I replied with a overview of what I had
learned so as to benefit anyone with similar questions or confusion
in the future. A discussion ensued.

ET (and lxml) is obviously extremely successful, widely used, and for
good reason. It's just not right for us, but you incorrectly
surmised that I was simply lazy by not modifying/extending ET/lxml to
make it suitable for our purposes even when other libraries existed
that better meshed with our requirements. I tried to answer as
straightforwardly as possible, and (regrettably, it turns out)
included the fact that I had worried that our apparent conceptual
differences indicated that we might find other instances where ET/
lxml works differently than we would expect. I think that's very
rational, and doesn't speak poorly of ET in any way (especially given
its obvious success elsewhere).

- Chas
 
U

Uche Ogbuji

Fredrik said:
sure, the computing world is and has always been full of people who want
the simplest thing to look a lot harder than it actually is. after all,
*they* spent lots of time reading all the specifications, they've bought
all the books, and went to all the seminars, so it's simply not fair
when others are cheating.

You sound bitter about something. Don't worry, it's really not all
that serious.
in reality, *all* interchange formats are easier to understand and use
if you focus on a (complete or intentionally simplified) data model of
the things being interchanged, and treat various artifacts of the
byte-stream used by the wire format as artifacts, historical accidents
based on what specification happened to be written before the other, or
what some guy did or did not do in the seventies, as accidents, and
esoteric arcana disseminated on limited-distribution mailing lists as
about as relevant for your customer as last week's episode of American Idol.

The fact that the XML Infoset is hardly used outside W3C XML Schema,
and that the XPath data model is far more common, and that focus on the
serialization is even more common than that is a matter of everyday
practicality.

And oh by the way, this thread is all about *your* customer's
complaining. And your response is to give them your philosophical take
on XML. Doesn't that contradict what you're saying above?

Oh never mind. You posted something misleading, and I posted another
point of view. I know you're incapable of any disagreement that
doesn't devolve into a full-scale flame-war. Sometimes I have time for
that sort of thing. This is not one fo those times, so this is
probably where I get off.
 
U

Uche Ogbuji

Paul said:
Thankfully, I'm largely on the periphery of that universe (except for being
a sometimes victim). But it is certainly frustrating to see many of the OMG
concepts of the 90's reimplemented in Java services, and then again in
XML/SOAP, with no detectable awareness that these messaging and
serialization problems have been considered before, and much more
thoroughly.

You'll be surprised at how many XMLers agree that Web services are a
pretty inept reinvention of CORBA. I was pretty much slain by this
take:

http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple

I think Duncan Grisby of OmniORB put it most succintly when he pointed
out that SOAP and friends are more complex, more bloated, and less
interoprable than CORBA ever was. But they use XML so they get the
teacher's pet treatment.

I liked XML when I could read it and hack it out in Notepad.

You still can, and don't let anyone tell you otherwise. I've always
argued that XML doesn't work unless it's Notepad-hackable. I do
usually allow an exception for SVG.
I like
attributes, which puts me on the outs with most XML zealots who forswear the
use of attributes on purely academic grounds (they defeat the future
possible expansion of an attribute's value into more complex substructure).

Really? Do you have any references for this? I haven't seen much
criticism of attributes since the very early days, and almost all XML
technologies make heavy use of attributes. Here's my take:

http://www.ibm.com/developerworks/xml/library/x-eleatt.html

As you can see, elements and attributes get equal billing.
I dislike namespaces, especially the default xmlns kind, as they make me
take extra steps when retrieving nodes via Xpaths; and everyone seems to
think their application needs namespaces, when there is no threat that these
tags will ever get mixed up with anyone else's.

Namespaces are possibly the worst thing to have ever happened to XML.
Again, my take:

http://www.ibm.com/developerworks/xml/library/x-namcar.html

And yes, default namespaces are about 50% of the problem with
namespace. QNames in content (which are of course an abuse of
namespaces) are almost all of the other 50%. I call them "hidden
namespaces":

http://copia.ogbuji.net/blog/2006-08-14/Some_thoug
 
D

Diez B. Roggisch

You'll be surprised at how many XMLers agree that Web services are a
pretty inept reinvention of CORBA. I was pretty much slain by this
take:

http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple

Thanks for that! Sums up nicely my experiences, and gave me a good chuckle!

While I liked the idea of AXIS reflecting my java code in the first
place (as long as interoperability only meant "I can test my own code"),
it sucked soooo hard when trying to make it work with anything else
(including python of course).

And I don't know why I've complained about this style of inverse
interface generation on so many other occasions (e.g. COM interfaces in
VStudio, JBuilder GUI design and so on), but could never quite put the
finger on what disturbed me on SOAP.

Probably because looking at a WSDL it immediately made me shrink away
from that mess and hope that there must be _some_ merciful deity that
will produce that crap for me, so that I never asked myself the right
questions....

Diez
 
F

Fredrik Lundh

Uche said:
> The fact that the XML Infoset is hardly used outside W3C XML Schema,
> and that the XPath data model is far more common, and that focus on
> the serialization is even more common than that is a matter of
> everyday practicality.

everyday interoperability problems, that is. yesterday, someone
reported a bug in Python's xml.dom because he couldn't get it to
serialize the string "&nbsp;" as "&nbsp;". earlier today, someone
asked how to work around an XML parser that didn't understand
namespace prefixes.
And oh by the way, this thread is all about *your* customer's
complaining.

from what I can tell, it was *your* customer posting FUD about a
different library, not my customer asking for help with a specific
problem. this is free software; people who use a piece of software
count a *lot* more than people who don't want to use it.
> This is not one fo those times, so this is probably where I get off.

I'll be looking forward to your next O'Reilly article.

</F>
 
C

Chas Emerick

from what I can tell, it was *your* customer posting FUD about a
different library, not my customer asking for help with a specific
problem. this is free software; people who use a piece of software
count a *lot* more than people who don't want to use it.

Holy hell Fredrik -- I hadn't even *downloaded* 4suite before I
posted my original question. I've tried to be nice, tried to be
complimentary, and tried to be diplomatic, so it would be nice if
*everyone* would stop casting aspersions or otherwise speculating
about my intentions. Flame amongst yourselves, but leave me out of it.

- Chas
 
F

Fredrik Lundh

Uche said:
The fact that the XML Infoset is hardly used outside W3C XML Schema,
and that the XPath data model is far more common,

and for the bystanders, it should be noted that the Infoset is pretty
much the same thing as the XPath data model; it's mostly just that the
specifications use different names for the same concept. if you cut
through the vocabulary, it's all about a tree of elements, plus text and
attributes and a few more (but usually less interesting) things. it's a
bit like arguing that

class Person(object):
__slots__ = ["name"]
def __init__(self, name):
self.name = name

and

class Employee:
def __init__(self, first_name, last_name):
self.full_name = first_name + " " + last_name

and

employee_name = "..."

are entirely different things, and not just three more or less con-
venient ways to store exactly the same information.

</F>
 
D

Damjan

sure, the computing world is and has always been full of people who want
the simplest thing to look a lot harder than it actually is. after all,
*they* spent lots of time reading all the specifications, they've bought
all the books, and went to all the seminars,

and have been sold all the expensive proprietary tools
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,979
Messages
2,570,185
Members
46,728
Latest member
FernMcmull

Latest Threads

Top