splitting up huge (1 GB) xml documents

T

twinkler

Dear all,

I am facing the problem that have to handle XML documents of approx. 1
GB. Do not ask me which sane architecture allows the creation of such
files - I have no control over the creation and have to live with it.

I need to split this massive document up into smaller chunks of valid
XML:
The structure of the XML is quite easy:

<businessHeader>Bla, bla -> only about 10 tags</businessHeader>

<businessInformation>info goes here</businessInformations>
<!--the tag business information is repeated a couple of hundred
thousand times... -->
<businessInformation>info goes here</businessInformations>
<businessFoolter>about 10 tags footer</businessFooter>


My current approach is to use SAX to parse the document and write the
businessInformation into different files. Before that the header gets
inserted into each file and after that the footer.

This obviously consumes quite a lot of time since the entire file is
parsed sequentially.

Can you think about a way of how to speed this process up ? I was
thinking of jumping randomly into the <businessInformation>-section of
the file (Random Access File) and then start parsing from there on with
SAX (potentially in parallell by using threads) but I am not sure if
this works.

Any hint is appreciated.

Cheers

Torsten
 
H

HK

twinkler said:
Dear all,

I am facing the problem that have to handle XML documents of approx. 1
GB. Do not ask me which sane architecture allows the creation of such
files - I have no control over the creation and have to live with it.

I need to split this massive document up into smaller chunks of valid
XML:
The structure of the XML is quite easy:

You definitively want to use monq.jfa available from

http://www.ebi.ac.uk/Rebholz-srv/whatizit/software

Download the jar and play with Grep. As an example
use a command like

java -cp monq.jar monq.programs.Grep \
-r '<YourTag[^>]*>' '</YourTag>' \
-rf %0 '%0\n' \
-cr <your_file

It will extract the YourTag XML elements only. The
-r defines the 'region of interest' and -rf says
how to handle the start and end of it. The -cr
requests to print every region of interest. You could
also define regular expressions to fetch only
regions with a match.

To distribute into different files, you will have
to write some lines of code yourself. To get started,
read the example:

http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/jfa/package-summary.html#package_description

and use

http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/jfa/Xml.html#GoofedElement(java.lang.String)

to create regular expressions for the elements you want to
fetch.

Don't hesitate to contact me (see download page) for more
specific questions and hints.

Harald.
 
T

Thomas Weidenfeller

twinkler said:
I need to split this massive document up into smaller chunks of valid
XML:
The structure of the XML is quite easy:

<businessHeader>Bla, bla -> only about 10 tags</businessHeader>

<businessInformation>info goes here</businessInformations>
<!--the tag business information is repeated a couple of hundred
thousand times... -->
<businessInformation>info goes here</businessInformations>
<businessFoolter>about 10 tags footer</businessFooter>


My current approach is to use SAX to parse the document and write the
businessInformation into different files. Before that the header gets
inserted into each file and after that the footer.

This obviously consumes quite a lot of time since the entire file is
parsed sequentially.

Can you think about a way of how to speed this process up ? I was
thinking of jumping randomly into the <businessInformation>-section of
the file (Random Access File) and then start parsing from there on with
SAX (potentially in parallell by using threads) but I am not sure if
this works.

First, you have to read the whole file anyhow. So randomly jumping
around doesn't make too much sense. Threads shouldn't gain you much too.
If the thing is I/O bound (which is likely) then your threads would hang
around idle waiting for their next chunk of input data.

I would not use thread. I would not use random access. I would not use
SAX, I would not use any kind of XML parser. In fact I would not even
use Java.

I would give the XML a very intensive look. Assuming that it is machine
generated it should have a regular layout. Based on that layout I would
use Perl and a Perl script. That script would have regular expressions
(the simplest ones that could possibly work) to identify the different
parts in the file, and break it up. XML is not too well suited to be
processed with pattern matching, but machine-generated XML is usually
regular enough to do so. And maybe 20 or 40 lines of Perl are enough to
process the file.

I would also consider tampering with the way the writing application
generates the data. Under Unix I would try the age-old trick of
providing the writing application with an output file-name which in fact
does not point to a file, but to a FIFO (named pipe). The Perl script
would sit at the reading end of the FIFO and directly write chunks, and
there would never be a 1 GB file at all.

/Thomas
 
T

twinkler

Thomas, I probably do not have to read the entire file since the
structure is quite regular. After the header follows the huge
<businessInformation> section. My idea was as follows:
-parse the file until the header stops and store it in a StringBuffer
-jump somewhere into the file - search for the next complete
<businesInformationTag>
-write header and the xml section until there into the file
-open a new file
-jump on into the file - search for the next complete
<businessInformation> Tag.
-continue this until the footer is found.
-Store the footer in a string buffer - open all created files and
append the footer.

By this I would only have to read a fraction of the file. The problem I
have is that I am not sure how to use a RandomAccessFile with Sax.

Cheers
Torsten
 
J

Johan Poppe

twinkler said:
Thomas, I probably do not have to read the entire file since the
structure is quite regular. After the header follows the huge
<businessInformation> section. My idea was as follows:
-parse the file until the header stops and store it in a StringBuffer
-jump somewhere into the file - search for the next complete
<businesInformationTag>
-write header and the xml section until there into the file
-open a new file
-jump on into the file - search for the next complete
<businessInformation> Tag.
-continue this until the footer is found.
-Store the footer in a string buffer - open all created files and
append the footer.

By this I would only have to read a fraction of the file. The problem I
have is that I am not sure how to use a RandomAccessFile with Sax.

In general, using RandomAccessFile with SAX makes no sense at all, and
there's is no provision for doing it.

Also, in your case you want to write out the <businessInformation>
elements to a new file. So you have to read it in somehow. Or how do
you suggest to write out contents to the smaller files, without
reading that content in from the larger file first?

As Thomas said, as long as the xml has a reasonably regular layout,
you might get away from the xml parsing. Unix/linux provides tools for
doing things like sending the first n lines of a file to a new file.
You can also do that in Java, if that's easier for you than using
Perl. Basically you stick to the algorithm you suggest above, just a)
remove the sax parser from the equation and b) instead of jumping into
the file, read in and write out.

Johan
 
T

Thomas Weidenfeller

twinkler said:
Thomas, I probably do not have to read the entire file since the
structure is quite regular.

The regularity of the XML structure is of no meaning if the size of
elements is not predictable. How do you intend to calculate the jump
width once you have found something?
-jump somewhere into the file - search for the next complete
<businesInformationTag>

Come up with a formula to calculate that "jump somewhere", and you are
talking. Can you ensure that you will never miss such a tag by
accidentally jumping to fare?
By this I would only have to read a fraction of the file. The problem I
have is that I am not sure how to use a RandomAccessFile with Sax.

From my point of view you would also need a considerable amount of
black magic. Fine with me. But I prefer the conventional way of programming.

/Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top