splitting an XML file on the basis on basis of XML tags

B

bijeshn

Hi all,

i have an XML file with the following structure::

<r1>
<r2>-----|
<r3> |
<r4> |
.. |
.. | --------------------> constitutes one record.
.. |
.. |
.. |
</r4> |
</r3> |
</r2>----|
<r2>
..
..
.. -----------------------|
.. |
.. |
.. |----------------------> there are n
records in between....
.. |
.. |
.. |
.. ------------------------|
..
..
</r2>
<r2>-----|
<r3> |
<r4> |
.. |
.. | --------------------> constitutes one record.
.. |
.. |
.. |
</r4> |
</r3> |
</r2>----|
</r1>


Here <r1> is the main root tag of the XML, and <r2>...</r2>
constitutes one record. What I would like to do is
to extract everything (xml tags and data) between nth <r2> tag and (n
+k)th <r2> tag. The extracted data is to be
written down to a separate file.

Thanks...
 
C

Chris

Hi all,

i have an XML file with the following structure::

<r1>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
<r2>
.
.
. -----------------------|
. |
. |
. |----------------------> there are n
records in between....
. |
. |
. |
. ------------------------|
.
.
</r2>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
</r1>


Here <r1> is the main root tag of the XML, and <r2>...</r2>
constitutes one record. What I would like to do is
to extract everything (xml tags and data) between nth <r2> tag and (n
+k)th <r2> tag. The extracted data is to be
written down to a separate file.

Thanks...

You could create a generator expression out of it:

txt = """<r1>
<r2><r3><r4>1</r4></r3></r2>
<r2><r3><r4>2</r4></r3></r2>
<r2><r3><r4>3</r4></r3></r2>
<r2><r3><r4>4</r4></r3></r2>
<r2><r3><r4>5</r4></r3></r2>
</r1>
"""
l = len(txt.split('r2>'))-1
a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
and i.replace('>','').replace('<','').strip())

Now you have a generator you can iterate through with a.next() or
alternatively you could just create a list out of it by replacing the
outer parens with square brackets.
 
B

bijeshn

You could create a generator expression out of it:

txt = """<r1>
    <r2><r3><r4>1</r4></r3></r2>
    <r2><r3><r4>2</r4></r3></r2>
    <r2><r3><r4>3</r4></r3></r2>
    <r2><r3><r4>4</r4></r3></r2>
    <r2><r3><r4>5</r4></r3></r2>
    </r1>
    """
l = len(txt.split('r2>'))-1
a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
and i.replace('>','').replace('<','').strip())

Now you have a generator you can iterate through with a.next() or
alternatively you could just create a list out of it by replacing the
outer parens with square brackets.- Hide quoted text -

- Show quoted text -

Hmmm... will look into it.. Thanks

the XML file is almost a TB in size...

so SAX will have to be the parser.... i'm thinking of doing something
to split the file using SAX
... Any suggestions on those lines..? If there are any other parsers
suitable, please suggest...
 
S

Steve Holden

bijeshn said:
Hmmm... will look into it.. Thanks

the XML file is almost a TB in size...
Good grief. When will people stop abusing XML this way?
so SAX will have to be the parser.... i'm thinking of doing something
to split the file using SAX
... Any suggestions on those lines..? If there are any other parsers
suitable, please suggest...

You could try pulldom, but the documentation is disgraceful.

ElementTree.iterparse *might* help.

regards
Steve
 
C

Chris

Good grief. When will people stop abusing XML this way?


You could try pulldom, but the documentation is disgraceful.

ElementTree.iterparse *might* help.

regards
  Steve

I abuse it because I can (and because I don't generally work with XML
files larger than 20-30meg) :)
And the OP never said the XML file for 1TB in size, which makes things
different.
 
D

Diez B. Roggisch

I abuse it because I can (and because I don't generally work with XML
files larger than 20-30meg) :)
And the OP never said the XML file for 1TB in size, which makes things
different.

Even with small xml-files your advice was not very sound. Yes, it's
tempting to use regexes to process xml. But usually one falls flat on
his face soon - because of whitespace or attribute order or <foo></foo>
versus <foo/> or .. or .. or.

Use an XML-parser. That's what they are for. And especially with the
pythonic ones like element-tree (and the compatible lxml), its even more
straight-forward than using rexes.


Diez
 
B

bijeshn

Even with small xml-files your advice was not very sound. Yes, it's
tempting to use regexes to process xml. But usually one falls flat on
his face soon - because of whitespace or attribute order or <foo></foo>
versus <foo/> or .. or .. or.

Use an XML-parser. That's what they are for. And especially with the
pythonic ones like element-tree (and the compatible lxml), its even more
straight-forward than using rexes.

Diez

yeah, i plan to use SAX.. but the thing is how do you do it with
that?....

forget 1 TB for now... how do you split an XML file which is something
like 70-80 GB... on the basis of my need (thats the post.)?
 
S

Stefan Behnel

Hi all,

i have an XML file with the following structure::

<r1>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
<r2>
.
.
. -----------------------|
. |
. |
. |----------------------> there are n
records in between....
. |
. |
. |
. ------------------------|
.
.
</r2>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
</r1>


Here <r1> is the main root tag of the XML, and <r2>...</r2>
constitutes one record. What I would like to do is
to extract everything (xml tags and data) between nth <r2> tag and (n
+k)th <r2> tag. The extracted data is to be
written down to a separate file.

What do you mean by "written down to a separate file"? Do you have a specific
format in mind?

In general, you can try this:
... if event == "end" and element.tag == "r2":
... print ET.tostring(element) # write record subtree as XML
... root.clear() # one record done, clean up everything

http://effbot.org/zone/element-iterparse.htm

You can also do things like

... print element.findtext("r3/r4")

Read the ElementTree tutorial to learn how to extract your data:

http://effbot.org/zone/element.htm#searching-for-subelements

Stefan
 
B

bijeshn

What do you mean by "written down to a separate file"? Do you have a specific
format in mind?


sorry, it should be extracted into separate "files". i.e. if i have an
XML file containing 10 million records, i need to split the file to
100 files containing 100,000 records each.

i hope this is clearer...
 
B

bijeshn

pls disregard the above post....

sorry, it should be extracted into separate " XML files". i.e. if i have an
XML file containing 10 million records, i need to split the file to
100 XML files containing 100,000 records each.

i hope this is clearer...
 
B

bijeshn

the extracted files are to be XML too. ijust need to extract it raw
(tags and data just like it is in the parent XML file..)
 
S

Stefan Behnel

bijeshn said:
the extracted files are to be XML too. ijust need to extract it raw
(tags and data just like it is in the parent XML file..)

Ah, so then replace the "print tostring()" line in my example by

ET.ElementTree(element).write("outputfile.xml")

and you're done.

Stefan
 
B

bijeshn

Ah, so then replace the "print tostring()" line in my example by

    ET.ElementTree(element).write("outputfile.xml")

and you're done.

Stefan

thanks a lot, Stefan....
i haven't tested out your idea yet.
Will get back as soon as I do it...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,856
Latest member
MyronKatz6

Latest Threads

Top