XML parsing per record

W

Willem Ligtenberg

I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.
 
I

Irmen de Jong

Willem said:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.


Read about SAX parsers.
This may be of help:
http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/

Out of curiousity, why is the data stored in a XML file?
XML is not known for its efficiency....

--Irmen
 
I

Ivan Voras

Irmen said:
XML is not known for its efficiency....

<sarcasm> Surely you are blaspheming, sir! XML's the greatest thing
since peanut butter! </sarcasm>

I'm just *waiting* for the day someone finds its use on the rolls of
toilet paper... oh the glorious day...
 
K

Kent Johnson

Willem said:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_archive.htm#element-generator

Kent
 
F

Fredrik Lundh

W

William Park

Willem Ligtenberg said:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.

You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :)

Care to post more details?
 
F

Fredrik Lundh

William said:
You may want to try Expat (www.libexpat.org) or Python wrapper to it.

Python comes with a low-level expat wrapper (pyexpat).

however, if you want performance, cElementTree (which also uses expat) is a
lot faster than pyexpat. (see my other post for links to benchmarks and code).

</F>
 
W

Willem Ligtenberg

You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :)

Care to post more details?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>9996</Gene-track_geneid>
<Gene-track_status value="secondary">1</Gene-track_status>
<Gene-track_current-id>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
<Dbtag>
<Dbtag_db>GeneID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Gene-track_current-id>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>8</Date-std_month>
<Date-std_day>28</Date-std_day>
<Date-std_hour>21</Date-std_hour>
<Date-std_minute>39</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2005</Date-std_year>
<Date-std_month>2</Date-std_month>
<Date-std_day>17</Date-std_day>
<Date-std_hour>12</Date-std_hour>
<Date-std_minute>54</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="genomic">1</BioSource_genome>
<BioSource_origin value="natural">1</BioSource_origin>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Mus musculus</Org-ref_taxname>
<Org-ref_common>house mouse</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>10090</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Org-ref_db>
<Org-ref_syn>
<Org-ref_syn_E>mouse</Org-ref_syn_E>
</Org-ref_syn>
<Org-ref_orgname>
<OrgName>
<OrgName_name>
<OrgName_name_binomial>
<BinomialOrgName>
<BinomialOrgName_genus>Mus</BinomialOrgName_genus>
<BinomialOrgName_species>musculus</BinomialOrgName_species>
</BinomialOrgName>
</OrgName_name_binomial>
</OrgName_name>
<OrgName_lineage>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus</OrgName_lineage>
<OrgName_gcode>1</OrgName_gcode>
<OrgName_mgcode>2</OrgName_mgcode>
<OrgName_div>ROD</OrgName_div>
</OrgName>
</Org-ref_orgname>
</Org-ref>
</BioSource_org>
</BioSource>
</Entrezgene_source>
<Entrezgene_gene>
<Gene-ref>
</Gene-ref>
</Entrezgene_gene>
<Entrezgene_gene-source>
<Gene-source>
<Gene-source_src>LocusLink</Gene-source_src>
<Gene-source_src-int>9996</Gene-source_src-int>
<Gene-source_src-str2>9996</Gene-source_src-str2>
<Gene-source_gene-display value="false"/>
<Gene-source_locus-display value="false"/>
<Gene-source_extra-terms value="false"/>
</Gene-source>
</Entrezgene_gene-source>
<Entrezgene_locus>
<Gene-commentary>
<Gene-commentary_type value="genomic">1</Gene-commentary_type>
<Gene-commentary_version>0</Gene-commentary_version>
</Gene-commentary>
</Entrezgene_locus>
<Entrezgene_unique-keys>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>9996</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Entrezgene_unique-keys>
<Entrezgene_xtra-index-terms>
<Entrezgene_xtra-index-terms_E>LOC320632</Entrezgene_xtra-index-terms_E>
</Entrezgene_xtra-index-terms>
</Entrezgene>
</Entrezgene-Set>
 
K

Kent Johnson

Willem said:
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>

This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid

# Throw away the element, we're done with it
elem.clear()

Kent
 
W

Willem Ligtenberg

I'll first try it using SAX, because I want to have as little dependancies
as possible. I already have BioPython as a dependancy. And I personally
don't like to install lot's of packages for a program to work. So I don't
want to impose that on other people.
But thanks anyway and I might go for the cElementTree later on, if the
ordinary SAX proves to slow...
 
W

Willem Ligtenberg

Sorry I just decided that I want to use your solution, but I am wondering
is cElemenTree in expat or is that something different?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>

This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid

# Throw away the element, we're done with it
elem.clear()

Kent[/QUOTE]
 
P

Paul McGuire

Don't assume that just because you have a 2.4G XML file that you have
2.4G of data. Looking at these verbose tags, plus the fact that the
XML is pretty-printed (all those leading spaces - not even tabs! - add
up), I'm guessing you only have about 5-10% actual data, and the rest
is just XML tagging/untagging and spaces. (For example, 373 characters
used to represent a date/time - this is a sin!)

As XML goes, this looks pretty dead easy to parse with non-XML parser
means. It looks like all of your leaf nodes open and close on the same
line, which would be easy to extract with regexp's or pyparsing.
Especially since you mention "I only need some of the informtion", you
don't even have to build a full document tree representation. SAX
parsers would also be good, since you could only trigger on the
matching subset of tags that you are really interested in. Lastly, you
could even try a pyparsing approach. I usually don't recommend
pyparsing for XML since there are already many good XML-targeted tools
out there, but it is very easy to throw together something in pyparsing
that extracts, say, all of the <object-id_id> entries, or all of the
<gene-source> structures. What is the subset of information you are
looking to extract?

-- Paul
 
W

William Park

Willem Ligtenberg said:
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.

You have to help us a little more here. Which info do you want to
extract from below example?
 
W

Willem Ligtenberg

This is all the info I need from the xml file:
ID --> <Gene-track_geneid>320632</Gene-track_geneid>

Name --> <Gene-ref>
<Gene-ref_locus>Pzp</Gene-ref_locus>

Startbase --> <Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>126957426</Seq-interval_from>
<Seq-interval_to>126989473</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Seq-id_gi>51860766</Seq-id_gi>
</Seq-id>
</Seq-interval_id>
</Seq-interval>
</Seq-loc_int>
</Seq-loc>
</Gene-commentary_seqs>
Endbase

Function --> <Prot-ref_name>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa</Prot-ref_name_E>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
family)</Prot-ref_name_E>
</Prot-ref_name>

DBLink --> <Gene-ref_locus-tag>MGI:2444401</Gene-ref_locus-tag>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>5524</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Other-source_src>
<Other-source_anchor>ATP binding</Other-source_anchor>
<Other-source_post-text>evidence: ISS</Other-source_post-text>
</Other-source>
</Gene-commentary_source>

Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type>

gene-comment --> <Gene-ref_desc>activating signal cointegrator 1 complex subunit 3-like
1</Gene-ref_desc>

synonym --> <Gene-ref_syn>
<Gene-ref_syn_E>HELIC2</Gene-ref_syn_E>
<Gene-ref_syn_E>KIAA0788</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200KD</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200-KD</Gene-ref_syn_E>
<Gene-ref_syn_E>A330064G03Rik</Gene-ref_syn_E>
</Gene-ref_syn>

EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec>

Chromosome: <SubSource>
<SubSource_subtype value="chromosome">1</SubSource_subtype>
<SubSource_name>6</SubSource_name>
</SubSource>

Some can happen more than once in a record.
 
W

Willem Ligtenberg

As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Thanks in advance,

Willem Ligtenberg
 
W

Willem Ligtenberg

By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...
 
F

Fredrik Lundh

Willem said:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>
 
F

Fredrik Lundh

Willem said:
By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...

for x in function:
print 'function', x.text

</F>
 
W

Willem Ligtenberg

As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

Willem said:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,236
Messages
2,571,185
Members
47,820
Latest member
HortenseKo

Latest Threads

Top