Trying to parse a HUGE(1gb) xml file

S

spaceman-spiff

Hi c.l.p folks

This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.

I am trying to parse a ginormous ( ~ 1gb) xml file.


0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)


1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.

import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root

2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.

I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
I dont get an error, seg fault or out_of_memory exception.

My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.

3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]

When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
import lxml.etree as lxml_etree

i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)

I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
& ran top from another terminal (http://imgur.com/HAoHA.png)

4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]

Which one is the best for my situation ?

Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.

thanks a ton

cheers
ashish

email :
ashish.makani
domain:gmail.com

p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3. https://groups.google.com/forum/?hl...+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
 
A

Adam Tauno Williams

Hi c.l.p folks
This is a rather long post, but i wanted to include all the details &
everything i have tried so far myself, so please bear with me & read
the entire boringly long post.
I am trying to parse a ginormous ( ~ 1gb) xml file.

Do that hundreds of times a day.
0. I am a python & xml n00b, s& have been relying on the excellent
beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
u are readng this, you are AWESOME & so is your witty & humorous
writing style)
1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root

Yes, this is a terrible technique; most examples are crap.
2. In the 2nd line of code above, as Mark explains in DIP, the parse
function builds & returns a tree object, in-memory(RAM), which
represents the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i
run this simple 4 line py code in a terminal for my HUGE target file
(1GB), nothing happens.
In a separate terminal, i run the top command, & i can see a python
process, with memory (the VIRT column) increasing from 100MB , all the
way upto 2100MB.

Yes, this is using DOM. DOM is evil and the enemy, full-stop.
I am guessing, as this happens (over the course of 20-30 mins), the
tree representing is being slowly built in memory, but even after
30-40 mins, nothing happens.
I dont get an error, seg fault or out_of_memory exception.

You need to process the document as a stream of elements; aka SAX.
3. I also tried using lxml, but an lxml tree is much more expensive,
as it retains more info about a node's context, including references
to it's parent.
[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
When i ran the same 4line code above, but with lxml's elementree
( using the import below in line1of the code above)
import lxml.etree as lxml_etree

You're still using DOM; DOM is evil.
Which one is the best for my situation ?
Any & all
code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.

<http://docs.python.org/library/xml.sax.html>

<http://coils.hg.sourceforge.net/hgw...5a211fda/src/coils/foundation/standard_xml.py>
 
T

Tim Harig

[Wrapped to meet RFC1855 Netiquette Guidelines]
This is a rather long post, but i wanted to include all the details &
everything i have tried so far myself, so please bear with me & read
the entire boringly long post.

I am trying to parse a ginormous ( ~ 1gb) xml file. [SNIP]
4. I then investigated some streaming libraries, but am confused - there
is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
interface[http://effbot.org/zone/element-iterparse.htm]

I have made extensive use of SAX and it will certainly work for low
memory parsing of XML. I have never used "iterparse"; so, I cannot make
an informed comparison between them.
Which one is the best for my situation ?

Your posed was long but it failed to tell us the most important piece
of information: What does your data look like and what are you trying
to do with it?

SAX is a low level API that provides a callback interface allowing you to
processes various elements as they are encountered. You can therefore
do anything you want to the information, as you encounter it, including
outputing and discarding small chunks as you processes it; ignoring
most of it and saving only what you want to memory data structures;
or saving all of it to a more random access database or on disk data
structure that you can load and process as required.

What you need to do will depend on what you are actually trying to
accomplish. Without knowing that, I can only affirm that SAX will work
for your needs without providing any information about how you should
be using it.
 
T

Terry Reedy

Yes, this is a terrible technique; most examples are crap.
Yes, this is using DOM. DOM is evil and the enemy, full-stop.
You're still using DOM; DOM is evil.

For serial processing, DOM is superfluous superstructure.
For random access processing, some might disagree.

For Python (unlike Java), wrapping module functions as class static
methods is superfluous superstructure that only slows things down.

raise Exception(...) # should be something specific like
raise ValueError(...)
 
S

Stefan Behnel

Adam Tauno Williams, 20.12.2010 20:49:
Do that hundreds of times a day.

Try

import xml.etree.cElementTree as etree

instead. Note the leading "c", which hints at the C implementations of
ElementTree. It's much faster and much more memory friendly than the Python
implementation.

Yes, this is a terrible technique; most examples are crap.


Yes, this is using DOM. DOM is evil and the enemy, full-stop.

Actually, ElementTree is not "DOM", it's modelled after the XML Infoset.
While I agree that DOM is, well, maybe not "the enemy", but not exactly
beautiful either, ElementTree is really a good thing, likely also in this case.

You need to process the document as a stream of elements; aka SAX.

IMHO, this is the worst advice you can give.

Stefan
 
S

Stefan Behnel

spaceman-spiff, 20.12.2010 21:29:
I am sorry i left out what exactly i am trying to do.

0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)

I need to detect them& then for each 1, i need to copy all the content b/w the element's start& end tags& create a smaller xml file.

Then cElementTree's iterparse() is your friend. It allows you to basically
iterate over the XML tags while its building an in-memory tree from them.
That way, you can either remove subtrees from the tree if you don't need
them (to safe memory) or otherwise handle them in any way you like, such as
serialising them into a new file (and then deleting them).

Also note that the iterparse implementation in lxml.etree allows you to
specify a tag name to restrict the iterator to these tags. That's usually a
lot faster, but it also means that you need to take more care to clean up
the parts of the tree that the iterator stepped over. Depending on your
requirements and the amount of manual code optimisation that you want to
invest, either cElementTree or lxml.etree may perform better for you.

It seems that you already found the article by Liza Daly about high
performance XML processing with Python. Give it another read, it has a
couple of good hints and examples that will help you here.

Stefan
 
S

Stefan Sonnenberg-Carstens

Am 20.12.2010 20:34, schrieb spaceman-spiff:
Hi c.l.p folks

This is a rather long post, but i wanted to include all the details& everything i have tried so far myself, so please bear with me& read the entire boringly long post.

I am trying to parse a ginormous ( ~ 1gb) xml file.


0. I am a python& xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME& so is your witty& humorous writing style)


1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.

import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root

2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds& returns a tree object, in-memory(RAM), which represents the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
In a separate terminal, i run the top command,& i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.

I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
I dont get an error, seg fault or out_of_memory exception.

My hardware setup : I have a win7 pro box with 8gb of RAM& intel core2 quad cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space& 2gb(2048mb) ram, assigned to the guest ubuntu os.

3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]

When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
import lxml.etree as lxml_etree

i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb& then, python(or the os ?) kills the process as it nears the total system memory(2gb)

I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
& ran top from another terminal (http://imgur.com/HAoHA.png)

4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]

Which one is the best for my situation ?

Any& all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
Plz feel free to email me directly too.

thanks a ton

cheers
ashish

email :
ashish.makani
domain:gmail.com

p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3. https://groups.google.com/forum/?hl...+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
Normally (what is normal, anyway?) such files are auto-generated,
and are something that has a apparent similarity with a database query
result, encapsuled in xml.
Most of the time the structure is same for every "row" thats in there.
So, a very unpythonic but fast, way would be to let awk resemble the
records and write them in csv format to stdout.
then pipe that to your python cruncher of choice and let it do the hard
work.
The awk part can be done in python, anyway, so could skip that.

And take a look at xmlsh.org, they offer tools for the command line,
like xml2csv. (Need java, btw).

Cheers
 
N

Nobody

Normally (what is normal, anyway?) such files are auto-generated,
and are something that has a apparent similarity with a database query
result, encapsuled in xml.
Most of the time the structure is same for every "row" thats in there.
So, a very unpythonic but fast, way would be to let awk resemble the
records and write them in csv format to stdout.

awk works well if the input is formatted such that each line is a record;
it's not so good otherwise. XML isn't a line-oriented format; in
particular, there are many places where both newlines and spaces are just
whitespace. A number of XML generators will "word wrap" the resulting XML
to make it more human readable, so line-oriented tools aren't a good idea.
 
S

Stefan Sonnenberg-Carstens

Am 23.12.2010 21:27, schrieb Nobody:
awk works well if the input is formatted such that each line is a record; You shouldn't tell it to awk.
it's not so good otherwise. XML isn't a line-oriented format; in
particular, there are many places where both newlines and spaces are just
whitespace. A number of XML generators will "word wrap" the resulting XML
to make it more human readable, so line-oriented tools aren't a good idea.
I never had the opportunity seeing awk fail on this task :)

For large datasets I always have huge question marks if one says "xml".
But I don't want to start a flame war.
 
S

Steve Holden

For large datasets I always have huge question marks if one says "xml".
But I don't want to start a flame war.

I agree people abuse the "spirit of XML" using it to transfer gigabytes
of data, but what else are they to use?

regards
Steve
 
S

Stefan Behnel

Steve Holden, 25.12.2010 16:55:
I agree people abuse the "spirit of XML" using it to transfer gigabytes
of data

I keep reading people say that (and *much* worse). XML may not be the
tightly tailored solution for data of that size, but it's not inherently
wrong to store gigabytes of data in XML. I mean, XML is a reasonably fast,
versatile, widely used, well-compressing and safe data format with an
extremely ubiquitous and well optimised set of tools available for all
sorts of environments. So as soon as the data is any complex or the
environments require portable data exchange, I consider XML a reasonable
choice, even for large data sets (which usually implies that it's machine
generated outputo anyway).

Stefan
 
A

Adam Tauno Williams

Steve Holden said:
I agree people abuse the "spirit of XML" using it to transfer gigabytes
of data,

How so? I think this assertion is bogus. XML works extremely well for large datasets.
but what else are they to use?

If you are sending me data - please use XML . I've gotten 22GB XML files in the past - worked without issue and pretty quickly too.

Sure better than trying to figure out whatever goofy document format someone cooks up on their own. XML toolkits are proven and documented.
 
T

Tim Harig

I would agree; but, you don't always have the choice over the data format
that you have to work with. You just have to do the best you can with what
they give you.
I agree people abuse the "spirit of XML" using it to transfer gigabytes
of data, but what else are they to use?

Something with an index so that you don't have to parse the entire file
would be nice. SQLite comes to mind. It is not standardized; but, the
implementation is free with bindings for most languages.
 
R

Roy Smith

Adam Tauno Williams said:
XML works extremely well for large datasets.

Barf. I'll agree that there are some nice points to XML. It is
portable. It is (to a certain extent) human readable, and in a pinch
you can use standard text tools to do ad-hoc queries (i.e. grep for a
particular entry). And, yes, there are plenty of toolsets for dealing
with XML files.

On the other hand, the verbosity is unbelievable. I'm currently working
with a data feed we get from a supplier in XML. Every day we get
incremental updates of about 10-50 MB each. The total data set at this
point is 61 GB. It's got stuff like this in it:

<Parental-Advisory>FALSE</Parental-Advisory>

That's 54 bytes to store a single bit of information. I'm all for
human-readable formats, but bloating the data by a factor of 432 is
rather excessive. Of course, that's an extreme example. A more
efficient example would be:

<Id>1173722</Id>

which is 26 bytes to store an integer. That's only a bloat factor of
6-1/2.

Of course, one advantage of XML is that with so much redundant text, it
compresses well. We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.
 
S

Stefan Sonnenberg-Carstens

Am 25.12.2010 20:41, schrieb Roy Smith:
Barf. I'll agree that there are some nice points to XML. It is
portable. It is (to a certain extent) human readable, and in a pinch
you can use standard text tools to do ad-hoc queries (i.e. grep for a
particular entry). And, yes, there are plenty of toolsets for dealing
with XML files.

On the other hand, the verbosity is unbelievable. I'm currently working
with a data feed we get from a supplier in XML. Every day we get
incremental updates of about 10-50 MB each. The total data set at this
point is 61 GB. It's got stuff like this in it:

<Parental-Advisory>FALSE</Parental-Advisory>

That's 54 bytes to store a single bit of information. I'm all for
human-readable formats, but bloating the data by a factor of 432 is
rather excessive. Of course, that's an extreme example. A more
efficient example would be:

<Id>1173722</Id>

which is 26 bytes to store an integer. That's only a bloat factor of
6-1/2.

Of course, one advantage of XML is that with so much redundant text, it
compresses well. We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.
Sending complete SQLite databases is absolute perfect.
For example Fedora uses (used?) this for their yum catalog updates.
Download to the right place, point your tool to it, ready.
 
N

Nobody

One advantage it has over many legacy formats is that there are no
inherent 2^31/2^32 limitations. Many binary formats inherently cannot
support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
indices.
Of course, one advantage of XML is that with so much redundant text, it
compresses well. We typically see gzip compression ratios of 20:1.
But, that just means you can archive them efficiently; you can't do
anything useful until you unzip them.

XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.

If file size is that much of an issue, eventually we'll see a standard for
compressing XML. This could easily result in smaller files than using a
dedicated format compressed with general-purpose compression algorithms,
as a widely-used format such as XML merits more effort than any
application-specific format.
 
A

Adam Tauno Williams

One advantage it has over many legacy formats is that there are no
inherent 2^31/2^32 limitations. Many binary formats inherently cannot
support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
indices.

And what legacy format has support for code pages, namespaces, schema
verification, or comments? None.
XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.
Yep.

If file size is that much of an issue,

Which it isn't.
eventually we'll see a standard for
compressing XML. This could easily result in smaller files than using a
dedicated format compressed with general-purpose compression algorithms,
as a widely-used format such as XML merits more effort than any
application-specific format.

Agree; and there actually already is a standard compression scheme -
HTTP compression [supported by every modern web-server]; so the data is
compressed at the only point where it matters [during transfer].

Again: "XML works extremely well for large datasets".
 
B

BartC

Adam Tauno Williams said:
And what legacy format has support for code pages, namespaces, schema
verification, or comments? None.


Which it isn't.

Only if you're prepared to squander resources that could be put to better
use.

XML is so redundant, anyone (even me :) could probably spend an afternoon
coming up with a compression scheme to reduce it to a fraction of it's size.

It can even be an custom format, provided you also send along the few dozen
lines of Python (or whatever language) needed to decompress. Although if
it's done properly, it might be possible to create an XML library that works
directly on the compressed format, and as a plug-in replacement for a
conventional library.

That will likely save time and memory.

Anyway there seem to be existing schemes for binary XML, indicating some
people do think it is an issue.

I'm just concerned at the waste of computer power (I used to think HTML was
bad, for example repeating the same long-winded font name hundreds of times
over in the same document. And PDF: years ago I was sent a 1MB document for
a modem; perhaps some substantial user manual for it? No, just a simple
diagram showing how to plug it into the phone socket!).
 
T

Tim Harig

One advantage it has over many legacy formats is that there are no
inherent 2^31/2^32 limitations. Many binary formats inherently cannot
support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
indices.

That is probably true of many older and binary formats; but, XML
is certainly not the the only format that supports arbitrary size.
It certainly doesn't prohibit another format with better handling of
large data sets from being developed. XML's primary benefit is its
ubiquity. While it is an excellent format for a number of uses, I don't
accept ubiquity as the only or preeminent metric when choosing a data
format.
XML is typically processed sequentially, so you don't need to create a
decompressed copy of the file before you start processing it.

Sometimes XML is processed sequentially. When the markup footprint is
large enough it must be. Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data. In those cases, being
forced to read all of the data sequentially is both inconvenient and and a
performance penalty unless there is some way to address the data you want
directly.
 
T

Tim Harig


Sometimes that is true and sometimes it isn't. There are many situations
where you want to access the data nonsequentially or address just a small
subset of it. Just because you never want to access data randomly doesn't
mean others might not. Certainly the OP would be happier using something
like XPath to get just the piece of data that he is looking for.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,967
Messages
2,570,148
Members
46,694
Latest member
LetaCadwal

Latest Threads

Top