large xml file...

boris · Aug 23, 2011

hi all,
I need to process large xml file and dump some documents to a different
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to
a different file:

<doc>
<text1>
<text2>
<text3> ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?
thanks.

Ian Shef · Aug 23, 2011

hi all,
I need to process large xml file and dump some documents to a different
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to
a different file:

<doc>
<text1>
<text2>
<text3> ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?
thanks.

What you are asking is unclear to me.
Do you mean that <text3> will determine whether you dump the whole <doc> to
another file?
Do you mean that <text3> will determine what file the whole <doc> will be
dumped to?
Or do you mean that the whole <doc> will be dumped to some other file, and
while you are at it, <text3> will also be checked and reported in some way?

Can you read the "large xml file" twice?
Can you put the whole "large xml file" (or at least the part preceeding
<text3>) into memory?
Can you copy the "large xml file" to another file while it is being
processed?

Sorry about the questions, but I need clarification. I have used SAX and
may be able to provide enlightenment. SAX has its uses, but is not so good
when 'memory' is involved unless _you_ provide the memory. SAX appears to
excel when processing can take place in a single pass with very little
lokking backwards. Consequently, it does not use as much memory as some
other methods.

boris · Aug 23, 2011

What you are asking is unclear to me.
Do you mean that<text3> will determine whether you dump the whole<doc> to
another file?
Do you mean that<text3> will determine what file the whole<doc> will be
dumped to?
Or do you mean that the whole<doc> will be dumped to some other file, and
while you are at it,<text3> will also be checked and reported in some way?

Can you read the "large xml file" twice?
Can you put the whole "large xml file" (or at least the part preceeding
<text3>) into memory?
Can you copy the "large xml file" to another file while it is being
processed?

Sorry about the questions, but I need clarification. I have used SAX and
may be able to provide enlightenment. SAX has its uses, but is not so good
when 'memory' is involved unless _you_ provide the memory. SAX appears to
excel when processing can take place in a single pass with very little
lokking backwards. Consequently, it does not use as much memory as some
other methods.

Do you mean that<text3> will determine whether you dump the
whole<doc> to
another file? yes

Can you read the "large xml file" twice?

I would like to read it once.

Can you put the whole "large xml file" (or at least the part >preceeding
<text3>) into memory?

no.

boris · Aug 23, 2011

no.

No, I can load the whole file. 1 doc is not a problem...

Arne Vajhøj · Aug 23, 2011

I need to process large xml file and dump some documents to a different
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to
a different file:

<doc>
<text1>
<text2>
<text3> ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?

SAX or StAX seems as the most obvious choices given the context.

Any textbook SAX example should lead you to working code.

I can post some code, but I doubt that it will show anything
various books and tutorials does not.

Arne

Ian Shef · Aug 23, 2011

No, I can load the whole file. 1 doc is not a problem...

As you are processing, you can save the XML yourself (e.g. as a List of
String_s).

Based on the result of evaluating <text3>, you can choose to:

Open an output file, copy the List of String_s to the output file, and copy
any succeeding XML to the output file, or discard the List and discontinue
processing.

Alternatively, you can save the XML to a file as you process it. When you
evaluate <text3>, you can choose to continue saving to the file, or delete
the file and discontinue processing.

boris · Aug 24, 2011

SAX or StAX seems as the most obvious choices given the context.

Any textbook SAX example should lead you to working code.

I can post some code, but I doubt that it will show anything
various books and tutorials does not.

Arne

I tried to accumulate the whole xml(<doc>...</doc>) as string using
sax, but in this case all special characters are processed by parser
and are just characters and not "predefined entities" like "

Using stax, I get correct xml, if I print events right away, but I if I
store them in collection and print them later , I don't get the same result.

Andreas Leitgeb · Aug 24, 2011

boris said:
Using stax, I get correct xml, if I print events right away, but I if I
store them in collection and print them later , I don't get the same result.

That sounds more like a bug in your code for "storing" and "printing later"
than a problem with stax itself.

Arne Vajhøj · Aug 25, 2011

I tried to accumulate the whole xml(<doc>...</doc>) as string using sax,
but in this case all special characters are processed by parser
and are just characters and not "predefined entities" like "

Using stax, I get correct xml, if I print events right away, but I if I
store them in collection and print them later , I don't get the same
result.

Any correct XML parser should convert the XML " to a " in
a Java String.

Any correct XML formatter/serializer should convert it back again
when generating new XML.

Arne

Stanimir Stamenkov · Aug 25, 2011

Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:

Any correct XML parser should convert the XML " to a " in
a Java String.

Any correct XML formatter/serializer should convert it back again
when generating new XML.

I think any sane XML serializer should not output " as " in
text content.

RedGrittyBrick · Aug 25, 2011

Wed, 24 Aug 2011 19:10:26 -0400, /Arne VajhÃ¸j/:

I think any sane XML serializer should not output " as " in text
content.

If you use an XML parser to read '<foo delimiter=""">...' you will
get a structure with an attribute with a value of '"'.

If you serialise that structure back to XML again, I would hope to get
'<foo delimiter=""">...' again. Am I wrong?

Stanimir Stamenkov · Aug 26, 2011

Thu, 25 Aug 2011 10:39:17 +0100, /RedGrittyBrick/:

If you use an XML parser to read '<foo delimiter=""">...' you
will get a structure with an attribute with a value of '"'.

If you serialise that structure back to XML again, I would hope to
get '<foo delimiter=""">...' again. Am I wrong?

The serializer may choose (or be configured) to output:

<foo delimiter='"'>...

But my point was text content, not attribute values:

<foo>"</foo>

an then:

<foo>"</foo>

PHP cURL for large content and single HTTP request	1	Feb 23, 2023
Read xml column inside csv file with Python	0	Jul 23, 2022
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
Problem: XSLT on a large XML using Java results in OutOfMemory error	0	May 17, 2006
How to use log4r to generate xml format log file?	1	Aug 29, 2008
How to create PDF file in Batch	5	May 11, 2022
Request data is empty	0	Nov 29, 2023
To extract file name only from a file	8	Jul 9, 2009

large xml file...

boris

Ian Shef

boris

boris

Arne Vajhøj

Ian Shef

boris

Andreas Leitgeb

Arne Vajhøj

Stanimir Stamenkov

RedGrittyBrick

Stanimir Stamenkov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads