Fredrik said:
in which case you probably want to stream the raw XML through the parser
*as it arrives*, to reduce latency (to do that, either parse from a
file-like object, or feed data directly to a parser instance, via the
consumer protocol).
It depends on the abstraction the web framework provides. If it allows you to
do that, especially in an event driven way, that's obviously the most
efficient implementation (and both ElementTree and lxml support this use
pattern just fine). However, some frameworks just pass the request content
(such as a POSTed document) in a dictionary or as callback parameters, in
which case there's little room for optimisation.
also, putting large documents in a *single* Python string can be quite
inefficient. it's often more efficient to use lists of string fragments.
That's a pretty general statement. Do you mean in terms of reading from that
string (which at least in lxml is a straight forward extraction of a char*/len
pair which is passed into libxml2), constructing that string (possibly from
partial strings, which temporarily *is* expensive) or just keeping the string
in memory?
At least lxml doesn't benefit from iterating over a list of strings and
passing it to libxml2 step-by-step, compared to reading from a straight
in-memory string. Here are some numbers:
$$ cat listtest.py
from lxml import etree
# a list of strings is more memory expensive than a straight string
doc_list = ["<root>"] + ["<a>test</a>"] * 2000 + ["</root>"]
# document construction temporarily ~doubles memory size
doc = "".join(doc_list)
def readlist():
tree = etree.fromstringlist(doc_list)
def readdoc():
tree = etree.fromstring(doc)
$$ python -m timeit -s 'from listtest import readlist,readdoc' 'readdoc()'
1000 loops, best of 3: 1.74 msec per loop
$$ python -m timeit -s 'from listtest import readlist,readdoc' 'readlist()'
100 loops, best of 3: 2.46 msec per loop
The performance difference stays somewhere around 20-30% even for larger
documents. So, as expected, there's a trade-off between temporary memory size,
long-term memory size and parser performance here.
Stefan