Trying to parse a HUGE(1gb) xml file

N

Nobody

Sometimes XML is processed sequentially. When the markup footprint is
large enough it must be. Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data. In those cases,
being forced to read all of the data sequentially is both inconvenient and
and a performance penalty unless there is some way to address the data you
want directly.

Actually, I should have said "must be processed sequentially". Even if you
only care about a small portion of the data, you have to read it
sequentially to locate that portion. IOW, anything you can do with
uncompressed XML can be done with compressed XML; you can't do random
access with either.

If XML has a drawback over application-specific formats, it's the
sequential nature of XML rather than its (uncompressed) size.

OTOH, formats designed for random access tend to be more limited in their
utility. You can only perform random access based upon criteria which
match the format's indexing. Once you step outside that, you often have to
walk the entire file anyhow.
 
S

Stefan Behnel

Tim Harig, 26.12.2010 02:05:
Sometimes XML is processed sequentially. When the markup footprint is
large enough it must be. Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data. In those cases, being
forced to read all of the data sequentially is both inconvenient and and a
performance penalty unless there is some way to address the data you want
directly.

So what? If you only have to do that once, it doesn't matter if you have to
read the whole file or just a part of it. Should make a difference of a
couple of minutes.

If you do it a lot, you will have to find a way to make the access
efficient for your specific use case. So the file format doesn't matter
either, because the data will most likely end up in a fast data base after
reading it in sequentially *once*, just as in the case above.

I really don't think there are many important use cases where you need fast
random access to large data sets and cannot afford to adapt the storage
layout before hand.

Stefan
 
T

Tim Harig

OTOH, formats designed for random access tend to be more limited in their
utility. You can only perform random access based upon criteria which
match the format's indexing. Once you step outside that, you often have to
walk the entire file anyhow.

That may be true and it may not. Even assuming that you have to walk
through a large number of top level elements there may be an advantage
to being able to directly access the next element as opposed to having
to parse through the entire current element once you have determined it
isn't one which you are looking for. To be fair, this may be invalid
preoptimization without taking into account how the hard drive buffers;
but, I would suspect that there is a threshold where the amount of
data skipped starts to outweigh the penalty of overreaching the hard
drives buffers.
 
T

Tim Harig

Tim Harig, 26.12.2010 02:05:

So what? If you only have to do that once, it doesn't matter if you have to
read the whole file or just a part of it. Should make a difference of a
couple of minutes.

Much agreed. I assume that the process needs to be repeated or it
probably would be simpler just to rip out what I wanted using regular
expressions with shell utilities.
If you do it a lot, you will have to find a way to make the access
efficient for your specific use case. So the file format doesn't matter
either, because the data will most likely end up in a fast data base after
reading it in sequentially *once*, just as in the case above.

If the data is just going to end up in a database anyway; then why not
send it as a database to begin with and save the trouble of having to
convert it?
 
S

Stefan Behnel

Tim Harig, 26.12.2010 10:22:
Tim Harig, 26.12.2010 02:05:
On 2010-12-25, Nobody wrote:
On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
Of course, one advantage of XML is that with so much redundant text, it
compresses well. We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.

XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.

Sometimes XML is processed sequentially. When the markup footprint is
large enough it must be. Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data. In those cases, being
forced to read all of the data sequentially is both inconvenient and and a
performance penalty unless there is some way to address the data you want
directly.
[...]
If you do it a lot, you will have to find a way to make the access
efficient for your specific use case. So the file format doesn't matter
either, because the data will most likely end up in a fast data base after
reading it in sequentially *once*, just as in the case above.

If the data is just going to end up in a database anyway; then why not
send it as a database to begin with and save the trouble of having to
convert it?

I don't think anyone would object to using a native format when copying
data from one database 1:1 to another one. But if the database formats are
different on both sides, it's a lot easier to map XML formatted data to a
given schema than to map a SQL dump, for example. Matter of use cases, not
of data size.

Stefan
 
T

Tim Harig

Tim Harig, 26.12.2010 10:22:
Tim Harig, 26.12.2010 02:05:
On 2010-12-25, Nobody wrote:
On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
Of course, one advantage of XML is that with so much redundant text, it
compresses well. We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.

XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.

Sometimes XML is processed sequentially. When the markup footprint is
large enough it must be. Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data. In those cases, being
forced to read all of the data sequentially is both inconvenient and and a
performance penalty unless there is some way to address the data you want
directly.
[...]
If you do it a lot, you will have to find a way to make the access
efficient for your specific use case. So the file format doesn't matter
either, because the data will most likely end up in a fast data base after
reading it in sequentially *once*, just as in the case above.

If the data is just going to end up in a database anyway; then why not
send it as a database to begin with and save the trouble of having to
convert it?

I don't think anyone would object to using a native format when copying
data from one database 1:1 to another one. But if the database formats are
different on both sides, it's a lot easier to map XML formatted data to a
given schema than to map a SQL dump, for example. Matter of use cases, not
of data size.

Your assumption keeps hinging on the fact that I should want to dump
the data into a database in the first place. Very often I don't.
I just want to rip out the small portion of information that happens to
be important to me. I may not even want to archive my little piece of
the information once I have processed it.

Even assuming that I want to dump all the data into a database,
walking through a bunch of database records to translate them into the
schema for another database is no more difficult then walking through a
bunch of XML elements. In fact, it is even easier since I can use the
relational model to reconstruct the information in an organization that
better fits how the data is actually structured in my database instead
of being constrained by how somebody else wanted to organize their XML.
There is no need to "map a[sic] SQL dump."

XML is great when the data is set is small enough that parsing the
whole tree has negligable costs. I can choose whether I want to parse
it sequentially or use XPath/DOM/Etree etc to make it appear as though
I am making random accesses. When the data set grows so that parsing
it is expensive I loose that choice even if my use case would otherwise
prefer a random access paradigm. When that happens, there are better
ways of communicating that data that doesn't force me into using a high
overhead method of extracting my data.

The problem is that XML has become such a defacto standard that it
used automatically, without thought, even when there are much better
alternatives available.
 
A

Alan Meyer

Adam Tauno Williams, 20.12.2010 20:49: ....

IMHO, this is the worst advice you can give.

Why do you say that? I would have thought that using SAX in this
application is an excellent idea.

I agree that for applications for which performance is not a problem,
and for which we need to examine more than one or a few element types, a
tree implementation is more functional, less programmer intensive, and
provides an easier to understand approach to the data. But with huge
amounts of data where performance is a problem SAX will be far more
practical. In the special case where only a few elements are of
interest in a complex tree, SAX can sometimes also be more natural and
easy to use.

SAX might also be more natural for this application. The O.P. could
tell us for sure, but I wonder if perhaps his 1 GB XML file is NOT a
true single record. You can store an entire text encyclopedia in less
than one GB. What he may have is a large number logically distinct
individual records of some kind, each stored as a node in an
all-encompassing element wrapper. Building a tree for each record could
make sense but, if I'm right about the nature of the data, building a
tree for the wrapper gives very little return for the high cost.

If that's so, then I'd recommend one of two approaches:

1. Use SAX, or

2. Parse out individual logical records using string manipulation on an
input stream, then build a tree for one individual record in memory
using one of the DOM or ElementTree implementations. After each record
is processed, discard its tree and start on the next record.

Alan
 
A

Alan Meyer

On 12/26/2010 3:15 PM, Tim Harig wrote:
....
The problem is that XML has become such a defacto standard that it
used automatically, without thought, even when there are much better
alternatives available.

I agree with you but, as you say, it has become a defacto standard. As
a result, we often need to use it unless there is some strong reason to
use something else.

The same thing can be said about relational databases. There are
applications for which a hierarchical database makes more sense, is more
efficient, and is easier to understand. But anyone who recommends a
database that is not relational had better be prepared to defend his
choice with some powerful reasoning because his management, his
customers, and the other programmers on his team are probably going to
need a LOT of convincing.

And of course there are many applications where XML really is the best.
It excels at representing complex textual documents while still
allowing programmatic access to individual items of information.

Alan
 
S

Stefan Behnel

Alan Meyer, 27.12.2010 21:40:
Why do you say that? I would have thought that using SAX in this
application is an excellent idea.

From my experience, SAX is only practical for very simple cases where
little state is involved when extracting information from the parse events.
A typical example is gathering statistics based on single tags - not a very
common use case. Anything that involves knowing where in the XML tree you
are to figure out what to do with the event is already too complicated. The
main drawback of SAX is that the callbacks run into separate method calls,
so you have to do all the state keeping manually through fields of the SAX
handler instance.

My serious advices is: don't waste your time learning SAX. It's simply too
frustrating to debug SAX extraction code into existence. Given how simple
and fast it is to extract data with ElementTree's iterparse() in a memory
efficient way, there is really no reason to write complicated SAX code instead.

Stefan
 
A

Adam Tauno Williams

Alan Meyer, 27.12.2010 21:40:
From my experience, SAX is only practical for very simple cases where
little state is involved when extracting information from the parse events.
A typical example is gathering statistics based on single tags - not a very
common use case. Anything that involves knowing where in the XML tree you
are to figure out what to do with the event is already too complicated.

I've found that using a stack-model makes traversing complex documents
with SAX quite manageable. For example, I parse BPML files with SAX.
If the document is nested and context sensitive then I really don't see
how iterparse differs all that much.
 
T

Tim Harig

On 12/26/2010 3:15 PM, Tim Harig wrote:
...

I agree with you but, as you say, it has become a defacto standard. As
a result, we often need to use it unless there is some strong reason to
use something else.

XML should be used where it makes sense to do so. As always, use the
proper tool for the proper job. XML became such a defacto standard, in
part, because it was abused for many uses in the first place so using it
because it is a defacto standard is just piling more and more mistakes
on top of each other.
The same thing can be said about relational databases. There are
applications for which a hierarchical database makes more sense, is more
efficient, and is easier to understand. But anyone who recommends a
database that is not relational had better be prepared to defend his
choice with some powerful reasoning because his management, his
customers, and the other programmers on his team are probably going to
need a LOT of convincing.

I have no particular problem with using other database models in
theory. In practice, at least until recently, there were few decent
implementations for alternative model databases. That is starting to
change with the advent of the so-called NoSQL databases. There are a few
models that I really do like; but, there are also a lot of failed models.
A large part of the problem was the push towards object databases which
is one of the failed models IMNSHO. Its failure tended to give some of
the other datase models a bad name.
And of course there are many applications where XML really is the best.
It excels at representing complex textual documents while still
allowing programmatic access to individual items of information.

Much agreed. There are many things that XML does very well. It works
great for XMP-RPC style interfaces. I prefer it over binary formats
for documents. It does suitibly for exporting discreet amounts of
information.

There are however a number of things that it does poorly. I don't
condone its use for configuration files. I don't condone its use as a
data store and when you have data approaching gigabytes, that is exaclty
how you are using it.
 
R

Roy Smith

Alan Meyer said:
On 12/26/2010 3:15 PM, Tim Harig wrote:
I agree with you but, as you say, it has become a defacto standard. As
a result, we often need to use it unless there is some strong reason to
use something else.

This is certainly true. In the rarified world of usenet, we can all
bash XML (and I'm certainly front and center of the XML bashing crowd).
In the real world, however, it's a necessary evil. Knowing how to work
with it (at least to some extent) should be in every software engineer's
bag of tricks.
The same thing can be said about relational databases. There are
applications for which a hierarchical database makes more sense, is more
efficient, and is easier to understand. But anyone who recommends a
database that is not relational had better be prepared to defend his
choice with some powerful reasoning because his management, his
customers, and the other programmers on his team are probably going to
need a LOT of convincing.

This is also true. In the old days, they used to say, "Nobody ever got
fired for buying IBM". Relational databases have pretty much gotten to
that point. Suits are comfortable with Oracle and MS SqlServer, and
even MySQL. If you want to go NoSQL, the onus will be on you to
demonstrate that it's the right choice.

Sometimes, even when it is the right choice, it's the wrong choice. You
typically have a limited amount of influence capital to spend, and many
battles to fight. Sometimes it's right to go along with SQL, even if
you know it's wrong from a technology point of view, simply because
taking the easy way out on that battle may let you devote the energy you
need to win more important battles.

And, anyway, when your SQL database becomes the bottleneck, you can
always go back and say, "I told you so". Trust me, if you're ever
involved in an "I told you so" moment, you really want to be on the
transmitting end.
And of course there are many applications where XML really is the best.
It excels at representing complex textual documents while still
allowing programmatic access to individual items of information.

Yup. For stuff like that, there really is no better alternative. To go
back to my earlier example of

<Parental-Advisory>FALSE</Parental-Advisory>

using 432 bits to store 1 bit of information, stuff like that doesn't
happen in marked-up text documents. Most of the file is CDATA (do they
still use that term in XML, or was that an SGML-ism only?). The markup
is a relatively small fraction of the data. I'm happy to pay a factor
of 2 or 3 to get structured text that can be machine processed in useful
ways. I'm not willing to pay a factor of 432 to get tabular data when
there's plenty of other much more reasonable ways to encode it.
 
A

Alan Meyer

On 12/27/2010 4:55 PM, Stefan Behnel wrote:
....
From my experience, SAX is only practical for very simple cases where
little state is involved when extracting information from the parse
events. A typical example is gathering statistics based on single tags -
not a very common use case. Anything that involves knowing where in the
XML tree you are to figure out what to do with the event is already too
complicated. The main drawback of SAX is that the callbacks run into
separate method calls, so you have to do all the state keeping manually
through fields of the SAX handler instance.

My serious advices is: don't waste your time learning SAX. It's simply
too frustrating to debug SAX extraction code into existence. Given how
simple and fast it is to extract data with ElementTree's iterparse() in
a memory efficient way, there is really no reason to write complicated
SAX code instead.

Stefan

I confess that I hadn't been thinking about iterparse(). I presume that
clear() is required with iterparse() if we're going to process files of
arbitrary length.

I should think that this approach provides an intermediate solution.
It's more work than building the full tree in memory because the
programmer has to do some additional housekeeping to call clear() at the
right time and place. But it's less housekeeping than SAX.

I guess I've done enough SAX, in enough different languages, that I
don't find it that onerous to use. When I need an element stack to keep
track of things I can usually re-use code I've written for other
applications. But for a programmer that doesn't do a lot of this stuff,
I agree, the learning curve with lxml will be shorter and the
programming and debugging can be faster.

Alan
 
A

Alan Meyer

... In the old days, they used to say, "Nobody ever got
fired for buying IBM". Relational databases have pretty much gotten to
that point....

That's _exactly_ the comparison I had in mind too.

I once worked for a company that made a pitch to a big potential client
(the BBC) and I made the mistake of telling the client that I didn't
think a relational database was the best for his particular application.
We didn't win that contract and I never made that mistake again!

Alan
 
A

Alan Meyer

By the way Stefan, please don't take any of my comments as complaints.
I use lxml more and more in my work. It's fast, functional and pretty
elegant.

I've written a lot of code on a lot of projects in my 35 year career but
I don't think I've written anything anywhere near as useful to anywhere
near as many people as lxml.

Thank you very much for writing lxml and contributing it to the community.

Alan
 
S

Stefan Behnel

Roy Smith, 28.12.2010 00:21:
To go back to my earlier example of

<Parental-Advisory>FALSE</Parental-Advisory>

using 432 bits to store 1 bit of information, stuff like that doesn't
happen in marked-up text documents. Most of the file is CDATA (do they
still use that term in XML, or was that an SGML-ism only?). The markup
is a relatively small fraction of the data. I'm happy to pay a factor
of 2 or 3 to get structured text that can be machine processed in useful
ways. I'm not willing to pay a factor of 432 to get tabular data when
there's plenty of other much more reasonable ways to encode it.

If the above only appears once in a large document, I don't care how much
space it takes. If it appears all over the place, it will compress down to
a couple of bits, so I don't care about the space, either.

It's readability that counts here. Try to reverse engineer a binary format
that stores the above information in 1 bit.

Stefan
 
S

Stefan Behnel

Alan Meyer, 28.12.2010 03:18:
By the way Stefan, please don't take any of my comments as complaints.

I don't. After all, this discussion is more about the general data format
than the specific tools.

I use lxml more and more in my work. It's fast, functional and pretty elegant.

I've written a lot of code on a lot of projects in my 35 year career but I
don't think I've written anything anywhere near as useful to anywhere near
as many people as lxml.

Thank you very much for writing lxml and contributing it to the community.

Thanks, I'm happy to read that. You're welcome.

Note that lxml also owes a lot to Fredrik Lundh for designing ElementTree
and to Martijn Faassen for starting to reimplement it on top of libxml2
(and choosing the name :).

Stefan
 
S

Stefan Behnel

Alan Meyer, 28.12.2010 01:29:
I confess that I hadn't been thinking about iterparse(). I presume that
clear() is required with iterparse() if we're going to process files of
arbitrary length.

I should think that this approach provides an intermediate solution. It's
more work than building the full tree in memory because the programmer has
to do some additional housekeeping to call clear() at the right time and
place. But it's less housekeeping than SAX.

The iterparse() implementation in lxml.etree allows you to intercept on a
specific tag name, which is especially useful for large XML documents that
are basically an endless sequence of (however deeply structured) top-level
elements - arguably the most common format for gigabyte sized XML files. So
what I usually do here is to intercept on the top level tag name, clear()
that tag after use and leave it dangling around, like this:

for _, element in ET.iterparse(source, tag='toptagname'):
# ... work on the element and its subtree
element.clear()

That allows you to write simple in-memory tree handling code (iteration,
XPath, XSLT, whatever), while pushing the performance up (compared to ET's
iterparse that returns all elements) and keeping the total amount of memory
usage reasonably low. Even a series of several hundred thousand empty top
level tags don't add up to anything that would truly hurt a decent machine.

In many cases where I know that the XML file easily fits into memory
anyway, I don't even do any housekeeping at all. And the true advantage is:
if you ever find that it's needed because the file sizes grow beyond your
initial expectations, you don't have to touch your tested and readily
debugged data extraction code, just add a suitable bit of cleanup code, or
even switch from the initial all-in-memory parse() solution to an
event-driven iterparse()+cleanup solution.

I guess I've done enough SAX, in enough different languages, that I don't
find it that onerous to use. When I need an element stack to keep track of
things I can usually re-use code I've written for other applications. But
for a programmer that doesn't do a lot of this stuff, I agree, the learning
curve with lxml will be shorter and the programming and debugging can be
faster.

I'm aware that SAX has the advantage of being available for more languages.
But if you are in the lucky position to use Python for XML processing, why
not just use the tools that it makes available?

Stefan
 
B

BartC

Stefan Behnel said:
Roy Smith, 28.12.2010 00:21:

If the above only appears once in a large document, I don't care how much
space it takes. If it appears all over the place, it will compress down to
a couple of bits, so I don't care about the space, either.

It's readability that counts here. Try to reverse engineer a binary format
that stores the above information in 1 bit.

The above typically won't get much below 2 bytes (as one character plus a
separator, eg. in comma-delimited-format). So it's more like 27:1, if you're
going to stay with a text format.

Still, that's 27 times as much as it need be. Readability is fine, but why
does the full, expanded, human-readable textual format have to be stored on
disk too, and for every single instance?

What if the 'Parental-Advisory' tag was even longer? Just how long do these
things have to get before even the advocates here admit that it's getting
ridiculous?

Isn't it possible for XML to define a shorter alias for these tags? Isn't
there a shortcut available for </Parental-Advisory> in simple examples like
this (I seem to remember something like this)?

And why not use 1 and 0 for TRUE and FALSE? Even the consumer appliances in
my house have 1 and 0 on their power switches! With the advantage that they
are internationally recognised.
 
R

Roy Smith

"BartC said:
Still, that's 27 times as much as it need be. Readability is fine, but why
does the full, expanded, human-readable textual format have to be stored on
disk too, and for every single instance?

Well, I know the answer to that one. The particular XML feed I'm
working with is a dump from an SQL database. The element names in the
XML are exactly the same as the column names in the SQL database.

The difference being that in the database, the string
"Parental-Advisory" appears in exactly one place, in some schema
metadata table. In the XML, it appears (doubled!) once per row.

It's still obscene. That fact that I understand the cause of the
obscenity doesn't make it any less so.

Another problem with XML is that some people don't use real XML tools to
write their XML files. DTD? What's that? So you end up with tag soup
that the real XML tools can't parse on the other end.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,166
Messages
2,570,901
Members
47,442
Latest member
KevinLocki

Latest Threads

Top