ElementTree should parse string and file in teh same way

Peter Pei · Dec 31, 2007

One bad design about elementtree is that it has different ways parsing a
string and a file, even worse they return different objects:
1) When you parse a file, you can simply call parse, which returns a
elementtree, on which you can then apply xpath;
2) To parse a string (xml section), you can call XML or fromstring, but both
return element instead of elementtree. This alone is bad. To make it worse,
you have to create an elementtree from this element before you can utilize
xpath.

Paddy · Dec 31, 2007

One bad design about elementtree is that it has different ways parsing a
string and a file, even worse they return different objects:
1) When you parse a file, you can simply call parse, which returns a
elementtree, on which you can then apply xpath;
2) To parse a string (xml section), you can call XML or fromstring, but both
return element instead of elementtree. This alone is bad. To make it worse,
you have to create an elementtree from this element before you can utilize
xpath.

I haven't tried this, but you should be able to wrap your text string
so that it looks like a file using the stringio module and pass that
to elementtree:

http://blog.doughellmann.com/2007/04/pymotw-stringio-and-cstringio.html

- Paddy.

Stefan Behnel · Dec 31, 2007

Peter said:
One bad design about elementtree is that it has different ways parsing a
string and a file, even worse they return different objects:
1) When you parse a file, you can simply call parse, which returns a
elementtree, on which you can then apply xpath;

ElementTree doesn't support XPath. In case you mean the simpler ElementPath
language that is supported by the find*() methods, I do not see a reason why
you can't use it on elements.

2) To parse a string (xml section), you can call XML or fromstring, but
both return element instead of elementtree. This alone is bad. To make
it worse, you have to create an elementtree from this element before you
can utilize xpath.

a) how hard is it to write a wrapper function around fromstring() that wraps
the result Element in an ElementTree object and returns it?

b) the same as above applies: I can't see the problem you are talking about.

Stefan

Peter Pei · Jan 1, 2008

You are talking shit. It is never about whether it is hard to write a
wrapper. It is about bad design. I should be able to parse a string and a
file in exactly same way, and that should be provided as part of the
package.

Looks like you are just a code monkey not a designer, so I forgive you. You
didn't understand the issue I described? That's your issue. You are not at
the same level to talk to me, so chill.
===================================================================

Peter Pei · Jan 1, 2008

To be preise, XPath is not fully supported. Don't be a smart asshole.
=====================================================================

Steven D'Aprano · Jan 1, 2008

You are talking shit. It is never about whether it is hard to write a
wrapper. It is about bad design. I should be able to parse a string and
a file in exactly same way, and that should be provided as part of the
package.

Oh my, somebody decided to start the new year with all guns blazing.

Before abusing anyone else, have you considered asking *why* ElementTree
does not treat files and strings the same way? I believe the writer of
ElementTree, Fredrik Lundh, frequents this newsgroup.

It may be that Fredrik doesn't agree with you that you should be able to
parse a string and a file the same way, in which case there's nothing you
can do but work around it. On the other hand, perhaps he just hasn't had
a chance to implement that functionality, and would welcome a patch.

Fredrik, if you're reading this, I'm curious what your reason is. I don't
have an opinion on whether you should or shouldn't treat files and
strings the same way. Over to you...

Stefan Behnel · Jan 1, 2008

Peter said:
To be preise

[...]

Preise the lord, not me.

Happy New Year!

Stefan

Diez B. Roggisch · Jan 1, 2008

Steven said:
Oh my, somebody decided to start the new year with all guns blazing.

Before abusing anyone else, have you considered asking *why* ElementTree
does not treat files and strings the same way? I believe the writer of
ElementTree, Fredrik Lundh, frequents this newsgroup.

It may be that Fredrik doesn't agree with you that you should be able to
parse a string and a file the same way, in which case there's nothing you
can do but work around it. On the other hand, perhaps he just hasn't had
a chance to implement that functionality, and would welcome a patch.

Fredrik, if you're reading this, I'm curious what your reason is. I don't
have an opinion on whether you should or shouldn't treat files and
strings the same way. Over to you...

I think the decision is pretty clear to everybody who is a code-monkey
and not a Peter-Pei-School-of-Excellent-And-Decent-Designers-attendant:

when building a XML-document, you start from a Element or Elementtree
and often do things like

root_element = <some_element>
for child in some_objects:
root_element.append(XML("""<child attribute="%i"/>""" %
child.attribute))

Which is such a common usage-pattern that it would be extremely annoying
to get a document from XML/fromstring and then needing to extract the
root-element from it.

And codemonkeys know that in python

doc = et.parse(StringIO(string))

is just one import away, which people who attend to
Peter-Pei-School-of-Excellent-And-Decent-Designers may have not learned
yet - because they are busy praising themselves and coating each other
in edible substances before stepping out into the world and having all
code-monkeys lick off their greatness in awe.

Diez

Steven D'Aprano · Jan 1, 2008

And codemonkeys know that in python

doc = et.parse(StringIO(string))

is just one import away

Yes, but to play devil's advocate for a moment,

doc = et.parse(string_or_file)

would be even simpler.

Is there any reason why it should not behave that way? It could be as
simple as adding a couple of lines to the parse method:

if isinstance(arg, str):
import StringIO
arg = StringIO(arg)

I'm not saying it *should*, I'm asking if there's a reason it *shouldn't*.

"I find it aesthetically distasteful" would be a perfectly acceptable
answer -- not one I would agree with, but I could accept it.

Steven Bethard · Jan 1, 2008

Steven said:
Yes, but to play devil's advocate for a moment,

doc = et.parse(string_or_file)

would be even simpler.

I assume the problem with this is that it would be ambiguous. You can
already use either a string or a file with ``et.parse``. A string is
interpreted as a file name, while a file object is used directly.

How would you differentiate between a string that's supposed to be a
file name, and a string that's supposed to be XML?

Steve

Steven D'Aprano · Jan 1, 2008

I assume the problem with this is that it would be ambiguous. You can
already use either a string or a file with ``et.parse``. A string is
interpreted as a file name, while a file object is used directly.

Ah! I wasn't aware that parse() operated on either an open file object or
a string file name. That's an excellent reason for not treating strings
the same as files in ElementTree.

How would you differentiate between a string that's supposed to be a
file name, and a string that's supposed to be XML?

Well, naturally I wouldn't.

I *could*, if I assumed that a multi-line string that started with "<"
was XML, and a single-line string with the path separator character or
ending in ".xml" was a file name, but that sort of Do What I Mean coding
is foolish in a library function that can't afford to occasionally Do The
Wrong Thing.

Peter Pei · Jan 2, 2008

To answer something posted deep down... It is fine with me if there are two
functions - one to parse a file or file handler and one to parse a string,
yet the returned objects should be consistent.

Fredrik Lundh · Jan 2, 2008

Steven said:
Fredrik, if you're reading this, I'm curious what your reason is. I don't
have an opinion on whether you should or shouldn't treat files and
strings the same way. Over to you...

as Diez shows, it's all about use cases.

and as anyone who's used my libraries or read my code knows, I'm a big
fan of minimalistic but highly composable object API:s and liberal use
of short helper functions to wire them up to fit the task at hand.

kitchen sink API design is a really bad idea, for more reasons than I
can fit in this small editor window.

</F>

Chris Mellon · Jan 2, 2008

as Diez shows, it's all about use cases.

and as anyone who's used my libraries or read my code knows, I'm a big
fan of minimalistic but highly composable object API:s and liberal use
of short helper functions to wire them up to fit the task at hand.

kitchen sink API design is a really bad idea, for more reasons than I
can fit in this small editor window.

On that note, I really don't like APIs that take either a file name or
a file object - I can open my own files, thanks. File objects are
fantastic abstractions and open(fname) is even shorter than
StringIO(somedata).

My take on the API decision in question was always that a file is
inherently an XML *document*, while a string is inherently an XML
*fragment*.

Stefan Behnel · Jan 3, 2008

Hi,

Chris said:
On that note, I really don't like APIs that take either a file name or
a file object - I can open my own files, thanks.

.... and HTTP URLs, and FTP URLs. In lxml, there is a performance difference
between passing an open file (which is read in Python space using the read()
method) and passing a file name or URL, which is passed on to libxml2 (and
thus doesn't require the GIL at parse time). That's only one reason why I like
APIs that allow me to pass anything that points to a file - be it an open file
object, a local file path or a URL - and they just Do The Right Thing with it.

I find that totally pythonic.

open(fname) is even shorter than StringIO(somedata).

It doesn't serve the same purpose, though.

My take on the API decision in question was always that a file is
inherently an XML *document*, while a string is inherently an XML
*fragment*.

Not inherently, no. I know some people who do web processing with an XML
document coming in as a string (from an HTTP request) and a result XML
document going out as a string. I don't think that's an uncommon use case.

Stefan

Fredrik Lundh · Jan 3, 2008

Stefan said:
Not inherently, no. I know some people who do web processing with an XML
document coming in as a string (from an HTTP request) /.../

in which case you probably want to stream the raw XML through the parser
*as it arrives*, to reduce latency (to do that, either parse from a
file-like object, or feed data directly to a parser instance, via the
consumer protocol).

also, putting large documents in a *single* Python string can be quite
inefficient. it's often more efficient to use lists of string fragments.

</F>

Stefan Behnel · Jan 3, 2008

Fredrik said:
in which case you probably want to stream the raw XML through the parser
*as it arrives*, to reduce latency (to do that, either parse from a
file-like object, or feed data directly to a parser instance, via the
consumer protocol).

It depends on the abstraction the web framework provides. If it allows you to
do that, especially in an event driven way, that's obviously the most
efficient implementation (and both ElementTree and lxml support this use
pattern just fine). However, some frameworks just pass the request content
(such as a POSTed document) in a dictionary or as callback parameters, in
which case there's little room for optimisation.

also, putting large documents in a *single* Python string can be quite
inefficient. it's often more efficient to use lists of string fragments.

That's a pretty general statement. Do you mean in terms of reading from that
string (which at least in lxml is a straight forward extraction of a char*/len
pair which is passed into libxml2), constructing that string (possibly from
partial strings, which temporarily *is* expensive) or just keeping the string
in memory?

At least lxml doesn't benefit from iterating over a list of strings and
passing it to libxml2 step-by-step, compared to reading from a straight
in-memory string. Here are some numbers:

$$ cat listtest.py
from lxml import etree

# a list of strings is more memory expensive than a straight string
doc_list = ["<root>"] + ["<a>test</a>"] * 2000 + ["</root>"]
# document construction temporarily ~doubles memory size
doc = "".join(doc_list)

def readlist():
tree = etree.fromstringlist(doc_list)

def readdoc():
tree = etree.fromstring(doc)

$$ python -m timeit -s 'from listtest import readlist,readdoc' 'readdoc()'
1000 loops, best of 3: 1.74 msec per loop

$$ python -m timeit -s 'from listtest import readlist,readdoc' 'readlist()'
100 loops, best of 3: 2.46 msec per loop

The performance difference stays somewhere around 20-30% even for larger
documents. So, as expected, there's a trade-off between temporary memory size,
long-term memory size and parser performance here.

Stefan

Fredrik Lundh · Jan 3, 2008

Stefan said:
That's a pretty general statement. Do you mean in terms of reading from that
string (which at least in lxml is a straight forward extraction of a char*/len
pair which is passed into libxml2), constructing that string (possibly from
partial strings, which temporarily *is* expensive) or just keeping the string
in memory?

overall I/O throughput. it's of course construction and internal
storage that are the main issues here; every extra copy has a cost, and
if you're working with multi-megabyte resources, the extra expenses
quickly become noticeable.

</F>

ElementTree : parse string input	2	Jul 6, 2006
Dealing with xml namespaces with ElementTree	0	Jan 21, 2011
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
elementtree and entities	0	Mar 17, 2008
using TreeBuilder in an ElementTree like way	0	Jun 28, 2006
elementtree and rounding questions	1	Jul 30, 2008
XML ElementTree Parse.	2	Oct 12, 2006
Advice for editing xml file using ElementTree and wxPython	2	Dec 9, 2007

ElementTree should parse string and file in teh same way

Peter Pei

Paddy

Stefan Behnel

Peter Pei

Peter Pei

Steven D'Aprano

Stefan Behnel

Diez B. Roggisch

Steven D'Aprano

Steven Bethard

Steven D'Aprano

Peter Pei

Fredrik Lundh

Chris Mellon

Stefan Behnel

Fredrik Lundh

Stefan Behnel

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads