XML parsing per record

Willem Ligtenberg · Apr 16, 2005

I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse

)
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.

Irmen de Jong · Apr 16, 2005

Willem said:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse )
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.

Read about SAX parsers.
This may be of help:
http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/

Out of curiousity, why is the data stored in a XML file?
XML is not known for its efficiency....

--Irmen

Ivan Voras · Apr 16, 2005

Irmen said:
XML is not known for its efficiency....

<sarcasm> Surely you are blaspheming, sir! XML's the greatest thing
since peanut butter! </sarcasm>

I'm just *waiting* for the day someone finds its use on the rolls of
toilet paper... oh the glorious day...

Kent Johnson · Apr 16, 2005

Willem said:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse )
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_archive.htm#element-generator

Kent

Fredrik Lundh · Apr 16, 2005

Kent said:
You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_archive.htm#element-generator

if you have ElementTree 1.2.5 or later, the "iterparse" function provides a
more efficient implementation of that pattern:

http://effbot.org/zone/element-iterparse.htm

the cElementTree implemention of "iterparse" is a lot faster than SAX; see
the second table under

http://effbot.org/zone/celementtree.htm#benchmarks

for some figures.

</F>

William Park · Apr 17, 2005

Willem Ligtenberg said:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse ) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.

You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind.

Care to post more details?

Fredrik Lundh · Apr 17, 2005

William said:
You may want to try Expat (www.libexpat.org) or Python wrapper to it.

Python comes with a low-level expat wrapper (pyexpat).

however, if you want performance, cElementTree (which also uses expat) is a
lot faster than pyexpat. (see my other post for links to benchmarks and code).

</F>

Willem Ligtenberg · Apr 20, 2005

You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind.

Care to post more details?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>9996</Gene-track_geneid>
<Gene-track_status value="secondary">1</Gene-track_status>
<Gene-track_current-id>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
<Dbtag>
<Dbtag_db>GeneID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Gene-track_current-id>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>8</Date-std_month>
<Date-std_day>28</Date-std_day>
<Date-std_hour>21</Date-std_hour>
<Date-std_minute>39</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2005</Date-std_year>
<Date-std_month>2</Date-std_month>
<Date-std_day>17</Date-std_day>
<Date-std_hour>12</Date-std_hour>
<Date-std_minute>54</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="genomic">1</BioSource_genome>
<BioSource_origin value="natural">1</BioSource_origin>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Mus musculus</Org-ref_taxname>
<Org-ref_common>house mouse</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>10090</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Org-ref_db>
<Org-ref_syn>
<Org-ref_syn_E>mouse</Org-ref_syn_E>
</Org-ref_syn>
<Org-ref_orgname>
<OrgName>
<OrgName_name>
<OrgName_name_binomial>
<BinomialOrgName>
<BinomialOrgName_genus>Mus</BinomialOrgName_genus>
<BinomialOrgName_species>musculus</BinomialOrgName_species>
</BinomialOrgName>
</OrgName_name_binomial>
</OrgName_name>
<OrgName_lineage>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus</OrgName_lineage>
<OrgName_gcode>1</OrgName_gcode>
<OrgName_mgcode>2</OrgName_mgcode>
<OrgName_div>ROD</OrgName_div>
</OrgName>
</Org-ref_orgname>
</Org-ref>
</BioSource_org>
</BioSource>
</Entrezgene_source>
<Entrezgene_gene>
<Gene-ref>
</Gene-ref>
</Entrezgene_gene>
<Entrezgene_gene-source>
<Gene-source>
<Gene-source_src>LocusLink</Gene-source_src>
<Gene-source_src-int>9996</Gene-source_src-int>
<Gene-source_src-str2>9996</Gene-source_src-str2>
<Gene-source_gene-display value="false"/>
<Gene-source_locus-display value="false"/>
<Gene-source_extra-terms value="false"/>
</Gene-source>
</Entrezgene_gene-source>
<Entrezgene_locus>
<Gene-commentary>
<Gene-commentary_type value="genomic">1</Gene-commentary_type>
<Gene-commentary_version>0</Gene-commentary_version>
</Gene-commentary>
</Entrezgene_locus>
<Entrezgene_unique-keys>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>9996</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Entrezgene_unique-keys>
<Entrezgene_xtra-index-terms>
<Entrezgene_xtra-index-terms_E>LOC320632</Entrezgene_xtra-index-terms_E>
</Entrezgene_xtra-index-terms>
</Entrezgene>
</Entrezgene-Set>

Kent Johnson · Apr 20, 2005

Willem said:
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>

This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid

# Throw away the element, we're done with it
elem.clear()

Kent

Willem Ligtenberg · Apr 21, 2005

I'll first try it using SAX, because I want to have as little dependancies
as possible. I already have BioPython as a dependancy. And I personally
don't like to install lot's of packages for a program to work. So I don't
want to impose that on other people.
But thanks anyway and I might go for the cElementTree later on, if the
ordinary SAX proves to slow...

Willem Ligtenberg · Apr 21, 2005

Sorry I just decided that I want to use your solution, but I am wondering
is cElemenTree in expat or is that something different?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>

This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid

# Throw away the element, we're done with it
elem.clear()

Kent[/QUOTE]

Simon Brunning · Apr 21, 2005

Sorry I just decided that I want to use your solution, but I am wondering
is cElemenTree in expat or is that something different?

Nope, cElemenTree is very much its own man. See
<http://effbot.org/zone/celementtree.htm>.

Paul McGuire · Apr 21, 2005

Don't assume that just because you have a 2.4G XML file that you have
2.4G of data. Looking at these verbose tags, plus the fact that the
XML is pretty-printed (all those leading spaces - not even tabs! - add
up), I'm guessing you only have about 5-10% actual data, and the rest
is just XML tagging/untagging and spaces. (For example, 373 characters
used to represent a date/time - this is a sin!)

As XML goes, this looks pretty dead easy to parse with non-XML parser
means. It looks like all of your leaf nodes open and close on the same
line, which would be easy to extract with regexp's or pyparsing.
Especially since you mention "I only need some of the informtion", you
don't even have to build a full document tree representation. SAX
parsers would also be good, since you could only trigger on the
matching subset of tags that you are really interested in. Lastly, you
could even try a pyparsing approach. I usually don't recommend
pyparsing for XML since there are already many good XML-targeted tools
out there, but it is very easy to throw together something in pyparsing
that extracts, say, all of the <object-id_id> entries, or all of the
<gene-source> structures. What is the subset of information you are
looking to extract?

-- Paul

William Park · Apr 22, 2005

Willem Ligtenberg said:
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.

You have to help us a little more here. Which info do you want to
extract from below example?

Willem Ligtenberg · Apr 22, 2005

This is all the info I need from the xml file:
ID --> <Gene-track_geneid>320632</Gene-track_geneid>

Name --> <Gene-ref>
<Gene-ref_locus>Pzp</Gene-ref_locus>

Startbase --> <Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>126957426</Seq-interval_from>
<Seq-interval_to>126989473</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Seq-id_gi>51860766</Seq-id_gi>
</Seq-id>
</Seq-interval_id>
</Seq-interval>
</Seq-loc_int>
</Seq-loc>
</Gene-commentary_seqs>
Endbase

Function --> <Prot-ref_name>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa</Prot-ref_name_E>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
family)</Prot-ref_name_E>
</Prot-ref_name>

DBLink --> <Gene-ref_locus-tag>MGI:2444401</Gene-ref_locus-tag>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>5524</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Other-source_src>
<Other-source_anchor>ATP binding</Other-source_anchor>
<Other-source_post-text>evidence: ISS</Other-source_post-text>
</Other-source>
</Gene-commentary_source>

Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type>

gene-comment --> <Gene-ref_desc>activating signal cointegrator 1 complex subunit 3-like
1</Gene-ref_desc>

synonym --> <Gene-ref_syn>
<Gene-ref_syn_E>HELIC2</Gene-ref_syn_E>
<Gene-ref_syn_E>KIAA0788</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200KD</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200-KD</Gene-ref_syn_E>
<Gene-ref_syn_E>A330064G03Rik</Gene-ref_syn_E>
</Gene-ref_syn>

EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec>

Chromosome: <SubSource>
<SubSource_subtype value="chromosome">1</SubSource_subtype>
<SubSource_name>6</SubSource_name>
</SubSource>

Some can happen more than once in a record.

Willem Ligtenberg · Apr 22, 2005

As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Thanks in advance,

Willem Ligtenberg

Willem Ligtenberg · Apr 22, 2005

By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...

Fredrik Lundh · Apr 22, 2005

Willem said:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>

Fredrik Lundh · Apr 22, 2005

Willem said:
By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...

for x in function:
print 'function', x.text

</F>

Willem Ligtenberg · Apr 22, 2005

As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

Willem said:
Willem said:

As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Click to expand...

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>

parsing nested unbounded XML fields with ElementTree	6	Nov 25, 2013
Upgrading Company's Internal Record Keeping Systems	0	Sep 24, 2021
Read xml column inside csv file with Python	0	Jul 23, 2022
record pixel value with Python script	2	Jan 11, 2014
XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
parsing multiple root element XML into text	19	May 9, 2014
XML parsing: SAX/expat & yield	2	Aug 4, 2010
ElementTree XML parsing problem	8	Apr 27, 2011

XML parsing per record

Willem Ligtenberg

Irmen de Jong

Ivan Voras

Kent Johnson

Fredrik Lundh

William Park

Fredrik Lundh

Willem Ligtenberg

Kent Johnson

Willem Ligtenberg

Willem Ligtenberg

Simon Brunning

Paul McGuire

William Park

Willem Ligtenberg

Willem Ligtenberg

Willem Ligtenberg

Fredrik Lundh

Fredrik Lundh

Willem Ligtenberg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads