Parsing XML and storing attributes in MySQL using Perl

J

jkugler

Hello,

I am trying to store huge amounts of data from xml files and put them
into a MySQL database. The xml files all are in this format:

<?xml version="1.0" encoding="UTF-8"?>
<snp_submission xsi:schemaLocation="http://www.hapmap.org
http://hapmap.cshl.org/xml-schema/2003-1
1-04/hapmap.xsd"
lsid="urn:LSID:ncbi.nlm.nih.gov:dbSNP/DCC_batch:34_chrom11_DELETE:005"
xmlns="http
://www.hapmap.org"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<batch_info>
<class>DELETE</class>
<group lsid="urn:lsid:dcc.hapmap.cshl.org:Labgroup:DBSNP:1"/>
<date_created>2005-09-15</date_created>
<contact>
<name>Steve Sherry</name>
<email>[email protected]</email>
</contact>
</batch_info>
<snps>
<snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461544:111">
<snp_class>
<bac-overlap/>
</snp_class>
<sequence>

<flank_5>caaaggaatataaatcattctattataaagatacatgcacAGGgctgggtgcagtggctcacacctgtaatccc
agcactttgggaggccaaggcgggtggatcacctgaggacaggagtttgagaccagcctagccaacatggggaaactccatctctactaaaaatacaaa
aattagccaggtatagtggtgcacacctgtaataccagctactttggaggctgaggcaggagaatcgctggaacccaggaggcagaggtcaaagtgagc
caagatcataccattgcactccagcctgggcaacaagagcaaaactccatcttaaaaaaatatatatatatacatatacatacatatatatacacatat
atatacatatatacagatattatatatgtaaatgtatatatatgtgtatatatatacacatatatatacatattataactacatatatatacacacaca
catacatatacatgcacacatatgtttattgcagcactatttacgatagaaaatacatggaatcctcccaaatgcccatcaatgatatattggataaag
aaaatgtgatatatattcaccatggaatactatgcagccgttaaaataaatgagatcatgttctttgcagggacatggatgaagctggaagccatcacc
ctcagcaaactaacacaggaaaagaaaaccaaacaccacatgttctcagtcgtaagagggagttgaacaatgagagcaaacacatggatacatggaggg
gaacaacacacaccagggcctctcagcgggacaggggtaggagaCCATCAGGACAaacacgtggatacatggaggggaacaacacacaccagggcctct
cagggggacagggggtaggagaccatcaagacaaacacgtggatacatggaggggaacaacacacaccagggcctctcagggggacagggggtaggaga
ccatcaggacaaacacgtggatacatggaggggaa</flank_5>
<variation>
<allele base="C"/>
<allele base="T"/>
</variation>

<flank_3>aacacacaccagggcctctcagggggacagggggtaggagaccatcaRgacaaacacgtggRtacatggagggg
aacaacacacaccagggcctctcagggggacggggggtagRagaccatcaggacaaatagctaatgcatgcagggcctcatacctaggtgatgggttga
tgggtgcagcaaaccaccatggcacacatttacctatgtatcaaacctaYactttctgcacgtgtatcccagaacataaaataaaatttaaaaaatata
taCACTGATTCATGATCTCCTttctctccttctgaaacactctttaaaactttttagcatttccccctctgtcttccatgtctcctaactacatgtttc
ttattttccatgtctttattcctgtgttcattttggatagccccttctgacctatattacagtttactagttcactcttcaactgcttctaacatacta
atattctgttaaaaccattcatttgggtttaaatttcaattatgttattctctatggacattctatttgttttcttttaatcttcttggccattctcta
gagtttcctgttccattatgatatttttaattttttgttttactttaaacatactaaatatagttattttattttattttctgtatctgatactttcaa
taactgcagtctttgctagtcttttttctgtgctcttgctcatagtttttttcatttgttttCATGATTagaaaaacagagagagaagaaggagagtaa
agggaggaggcggaggaggagaaaagaagaaagcagagaagaagggacagagaaaaaaaggaagTTGGTTCTAACGTTTCTCTAACAACTGGCTTCAGT
GAAACACTCCCACCTTGTGGATTTTTAGGTTATTGAAATTAACCAGTCTTCtgggtgcagcacaccaacatggcacatgtatacatatgtaacaaacct
gcactttgtgcacatgtaccctaaaacttaaagta</flank_3>
<length_class value="full-length"/>
</sequence>
<genomic_locations>
<genomic_location active="true" type="exact">
<assembly_version>ncbi_build34</assembly_version>
<coordinates chrom="11" start="60749" stop="60749"
strand="forward"/>
</genomic_location>
</genomic_locations>
<neighbour_snps>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461545:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461546:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461547:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2686858:100"/>
</neighbour_snps>
</snp>

I am trying to extract the following information: lsid, allele base,
cooordinates chrom, start, and strand. I would ideally like to use a
perl script to pull this data out and put it in MySQL, but I have had
no luck with XML::parser since the information I am looking for is
embedded in the element's attributes.

If anyone has any ideas on how to get this data out of xml and into
MySQL, it would be much appreciated.

Thanks so much,

--James
 
R

robic0

Hello,

I am trying to store huge amounts of data from xml files and put them
into a MySQL database. The xml files all are in this format:
[xml snipped]

Hi James.

If you want you could use RXParse.pm (ver .91) posted on this group.
It would send the same attribute info to a tag handler.
Should be the same as the one you used.
From there you could check the "name" for the "value" data.

If you could hang on a while, I'm writing filter interfaces for
tags, namespace, attributes, content, etc.. The filters can be
either regexp or dos style. Each filter is specific, set by the
user who provides the handler to recieve that specific info.

Whats nice is that the regexp provided by the user is applied to
just the sub-parsed global data, the category is guaranteed.
Tag is tag, attrib is attrib, content is content.

Its almost there. Yours is a basic search but the intent is to
provide search/replace as well as multiple filter capability.
At my stage of development with this its really pretty easy.
The hold up is a base compliant parser, which is almost done.

Anyway, just some info for you. No need to use RXParse. The difference
between my module and all thats out there is its whole purpose is meant
to be interactive. Funny how the simple things you would expect from
high power software just aint there, and its so easy to do!

robic0
(god of porn)
 
J

J. Gleixner

jkugler wrote:
[...]
<neighbour_snps>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461545:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461546:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461547:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2686858:100"/>
</neighbour_snps>
</snp>

I am trying to extract the following information: lsid, allele base,
cooordinates chrom, start, and strand. I would ideally like to use a
perl script to pull this data out and put it in MySQL, but I have had
no luck with XML::parser since the information I am looking for is
embedded in the element's attributes.

Use XML::parser to get the element's value, then parse the value using
regular expressions, split, or various other means.

What have you tried? No one can point out coding errors, unless you post
your code.

Look at using DBI, and DBD::mysql for interacting with MySQL. It's all
available on CPAN ( http://search.cpan.org/ ).
 
X

xhoster

jkugler said:
Hello,

I am trying to store huge amounts of data from xml files and put them
into a MySQL database.

What is huge? Is it the number of files, or the number of entries per
file? If the former, I would probably use XML::Simple. If the later,
maybe look into XML::Twig.

The xml files all are in this format:

<?xml version="1.0" encoding="UTF-8"?>
<snp_submission xsi:schemaLocation="http://www.hapmap.org
http://hapmap.cshl.org/xml-schema/2003-1
1-04/hapmap.xsd"
lsid="urn:LSID:ncbi.nlm.nih.gov:dbSNP/DCC_batch:34_chrom11_DELETE:005"
xmlns="http
://www.hapmap.org"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> ....
....
</snp>

I am trying to extract the following information: lsid,


lsid appears at more than one level.

allele base,
cooordinates chrom, start, and strand. I would ideally like to use a
perl script to pull this data out and put it in MySQL, but I have had
no luck with XML::parser since the information I am looking for is
embedded in the element's attributes.

I haven't had any trouble getting element attributes from XML files with
XML::parser. Can you show some of what you have tried so far?
Here is something trivial I whipped up that just shows how to get the
attributes for the allele:


use strict;
use Data::Dumper;
use XML::parser;
my $p1 = new XML::parser(Style =>'Subs');
open my $fh, "test.xml" or die $!;
$p1->parse($fh);

sub allele {
my ($expat,$ele,%att) = @_;
print Dumper \%att
};

__END__
$VAR1 = {
'base' => 'C'
};
$VAR1 = {
'base' => 'T'
};

no element found at line 73, column 0, byte 3851 at
/tools/GNU/perl/5.8.7_64/lib/site_perl/5.8.7/x86_64-linux/XML/Parser.pm
line 187


(That error is only there because the sample file you provided was
incomplete.)

Xho
 
R

robic0

What is huge? Is it the number of files, or the number of entries per
file? If the former, I would probably use XML::Simple. If the later,
maybe look into XML::Twig.
No no! Attribute data is "content", why invoke Simple when its just content?
lsid appears at more than one level. That doesen't matter....



I haven't had any trouble getting element attributes from XML files with
XML::parser. Can you show some of what you have tried so far?
Here is something trivial I whipped up that just shows how to get the
attributes for the allele:


use strict;
use Data::Dumper;
use XML::parser;
my $p1 = new XML::parser(Style =>'Subs');
open my $fh, "test.xml" or die $!;
$p1->parse($fh);

sub allele {
my ($expat,$ele,%att) = @_;
print Dumper \%att
};

__END__
$VAR1 = {
'base' => 'C'
};
$VAR1 = {
'base' => 'T'
};

no element found at line 73, column 0, byte 3851 at
/tools/GNU/perl/5.8.7_64/lib/site_perl/5.8.7/x86_64-linux/XML/Parser.pm
line 187


(That error is only there because the sample file you provided was
incomplete.)

Xho

The attribute is known, any tag handler will do.
Don't make it into a federal case, <gods name=jeezz/>.
 
J

John Bokma

[..]
I am trying to extract the following information: lsid, allele base,
cooordinates chrom, start, and strand. I would ideally like to use a
perl script to pull this data out and put it in MySQL, but I have had
no luck with XML::parser since the information I am looking for is
embedded in the element's attributes.

XML::parser can handle attributes perfectly. What exactly didn't work?
See: http://johnbokma.com/perl/element-id-for-given-parent.html
If anyone has any ideas on how to get this data out of xml and into
MySQL, it would be much appreciated.

XML::Twig? Give me a book? :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,783
Latest member
RickeyDort

Latest Threads

Top