T
twinkler
Dear all,
I am facing the problem that have to handle XML documents of approx. 1
GB. Do not ask me which sane architecture allows the creation of such
files - I have no control over the creation and have to live with it.
I need to split this massive document up into smaller chunks of valid
XML:
The structure of the XML is quite easy:
<businessHeader>Bla, bla -> only about 10 tags</businessHeader>
<businessInformation>info goes here</businessInformations>
<!--the tag business information is repeated a couple of hundred
thousand times... -->
<businessInformation>info goes here</businessInformations>
<businessFoolter>about 10 tags footer</businessFooter>
My current approach is to use SAX to parse the document and write the
businessInformation into different files. Before that the header gets
inserted into each file and after that the footer.
This obviously consumes quite a lot of time since the entire file is
parsed sequentially.
Can you think about a way of how to speed this process up ? I was
thinking of jumping randomly into the <businessInformation>-section of
the file (Random Access File) and then start parsing from there on with
SAX (potentially in parallell by using threads) but I am not sure if
this works.
Any hint is appreciated.
Cheers
Torsten
I am facing the problem that have to handle XML documents of approx. 1
GB. Do not ask me which sane architecture allows the creation of such
files - I have no control over the creation and have to live with it.
I need to split this massive document up into smaller chunks of valid
XML:
The structure of the XML is quite easy:
<businessHeader>Bla, bla -> only about 10 tags</businessHeader>
<businessInformation>info goes here</businessInformations>
<!--the tag business information is repeated a couple of hundred
thousand times... -->
<businessInformation>info goes here</businessInformations>
<businessFoolter>about 10 tags footer</businessFooter>
My current approach is to use SAX to parse the document and write the
businessInformation into different files. Before that the header gets
inserted into each file and after that the footer.
This obviously consumes quite a lot of time since the entire file is
parsed sequentially.
Can you think about a way of how to speed this process up ? I was
thinking of jumping randomly into the <businessInformation>-section of
the file (Random Access File) and then start parsing from there on with
SAX (potentially in parallell by using threads) but I am not sure if
this works.
Any hint is appreciated.
Cheers
Torsten