XML parsing per record

W

William Park

Willem Ligtenberg said:
ID --> <Gene-track_geneid>320632</Gene-track_geneid> ....
Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type> ....
EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec> ....

Some can happen more than once in a record.

Since all your data are contained in unique tags on individual lines,
you can tackle this so many different ways. Okey, that's your input
format. What is your output format?
 
W

Willem Ligtenberg

Is there an easy way, to couple data together. Because I have discoverd an
irritating feature in the xml file.
Sometimes this is a database reference:
<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_str>1234</Object-id_str>
</Object-id>
</Dbtag_tag>
</Dbtag>

And sometimes:

<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>1234</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>

So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
If not, I still might need to revert to SAX... :(

As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

Willem said:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>
 
F

Fredrik Lundh

Willem said:
So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID

why not just check for both alternatives?

text = elem.findtext("Object-id_str")
if text is None:
text = elem.findtext("Object-id_id")

(or you can loop over the child elements and map elem.tag through a
dictionary...)
If not, I still might need to revert to SAX... :(

you still have to check for both alternatives...

(if you find a parsing problem that you cannot solve with a light-weight
DOM, SAX won't help you...)

</F>
 
W

Willem Ligtenberg

Since there are more than one database references possible per record you
should get per record a list of database names, database strings and
databases ids. (where the strings and the id's are really the same thing...)
So per record you check for both alternatives but since there could be
more than one, you do findall and get a (unsorted) list back. And now you
don't know which ID belonged to which database...
See my problem?

Cheers,

Willem
 
F

Fredrik Lundh

Willem said:
Since there are more than one database references possible per record you
should get per record a list of database names, database strings and
databases ids. (where the strings and the id's are really the same thing...)
So per record you check for both alternatives but since there could be
more than one, you do findall and get a (unsorted) list back.

findall returns matching elements in document order.
And now you don't know which ID belonged to which database...

why not? by looking at each database separately, surely you must be
able to figure out if the subelement holds an ID or a string? sure, if you
do document.findall(".//Object-id_id"), you'll get all IDs in document
order. but if you do record.findall(".//Dbtag"), you get a list of all Dbtag
elements, and can then look inside them to see what they contain.
See my problem?

I'm afraid not. the document seems to have a clear structure; for some
reason, you don't seem to take that into account in your program.

</F>
 
F

Fredrik Lundh

order. but if you do record.findall(".//Dbtag"), you get a list of all Dbtag
elements

make that "you get a list of all Dbtag elements in that record"

</F>
 
K

Kent Johnson

Willem said:
Is there an easy way, to couple data together. Because I have discoverd an
irritating feature in the xml file.
Sometimes this is a database reference:
<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_str>1234</Object-id_str>
</Object-id>
</Dbtag_tag>
</Dbtag>

And sometimes:

<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>1234</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>

So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
If not, I still might need to revert to SAX... :(

None of your requirements sound particularly difficult to implement. If you would post a complete
example of the data you want to parse and the data you would like to end up it would be easier to
help you. The sample data you posted originally does not have many of the fields you want to extract
and your example of what you want to end up with is not too clear either.

If you are having trouble with ElementTree I expect you will be completely lost with SAX,
ElementTree is much easier to work with and cElementTree is very fast.

Kent
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,236
Messages
2,571,185
Members
47,820
Latest member
HortenseKo

Latest Threads

Top