large xml file...

B

boris

hi all,
I need to process large xml file and dump some documents to a different
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to
a different file:

<doc>
<text1>
<text2>
<text3> ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?
thanks.
 
I

Ian Shef

hi all,
I need to process large xml file and dump some documents to a different
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to
a different file:

<doc>
<text1>
<text2>
<text3> ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?
thanks.

What you are asking is unclear to me.
Do you mean that <text3> will determine whether you dump the whole <doc> to
another file?
Do you mean that <text3> will determine what file the whole <doc> will be
dumped to?
Or do you mean that the whole <doc> will be dumped to some other file, and
while you are at it, <text3> will also be checked and reported in some way?

Can you read the "large xml file" twice?
Can you put the whole "large xml file" (or at least the part preceeding
<text3>) into memory?
Can you copy the "large xml file" to another file while it is being
processed?

Sorry about the questions, but I need clarification. I have used SAX and
may be able to provide enlightenment. SAX has its uses, but is not so good
when 'memory' is involved unless _you_ provide the memory. SAX appears to
excel when processing can take place in a single pass with very little
lokking backwards. Consequently, it does not use as much memory as some
other methods.
 
B

boris

What you are asking is unclear to me.
Do you mean that<text3> will determine whether you dump the whole<doc> to
another file?
Do you mean that<text3> will determine what file the whole<doc> will be
dumped to?
Or do you mean that the whole<doc> will be dumped to some other file, and
while you are at it,<text3> will also be checked and reported in some way?

Can you read the "large xml file" twice?
Can you put the whole "large xml file" (or at least the part preceeding
<text3>) into memory?
Can you copy the "large xml file" to another file while it is being
processed?

Sorry about the questions, but I need clarification. I have used SAX and
may be able to provide enlightenment. SAX has its uses, but is not so good
when 'memory' is involved unless _you_ provide the memory. SAX appears to
excel when processing can take place in a single pass with very little
lokking backwards. Consequently, it does not use as much memory as some
other methods.
Do you mean that<text3> will determine whether you dump the
whole<doc> to
another file? yes


Can you read the "large xml file" twice?
I would like to read it once.
Can you put the whole "large xml file" (or at least the part >preceeding
<text3>) into memory?
no.
 
A

Arne Vajhøj

I need to process large xml file and dump some documents to a different
file based on content of some elements.

let's say I need to check content of <text3> and dump the whole <doc> to
a different file:

<doc>
<text1>
<text2>
<text3> ... etc

</doc>

I'm trying to do this using sax. Are there any examples how to do this?
Is using sax ok for this task?

SAX or StAX seems as the most obvious choices given the context.

Any textbook SAX example should lead you to working code.

I can post some code, but I doubt that it will show anything
various books and tutorials does not.

Arne
 
I

Ian Shef

No, I can load the whole file. 1 doc is not a problem...

As you are processing, you can save the XML yourself (e.g. as a List of
String_s).

Based on the result of evaluating <text3>, you can choose to:

Open an output file, copy the List of String_s to the output file, and copy
any succeeding XML to the output file, or discard the List and discontinue
processing.

Alternatively, you can save the XML to a file as you process it. When you
evaluate <text3>, you can choose to continue saving to the file, or delete
the file and discontinue processing.
 
B

boris

SAX or StAX seems as the most obvious choices given the context.

Any textbook SAX example should lead you to working code.

I can post some code, but I doubt that it will show anything
various books and tutorials does not.

Arne
I tried to accumulate the whole xml(<doc>...</doc>) as string using
sax, but in this case all special characters are processed by parser
and are just characters and not "predefined entities" like &quot;

Using stax, I get correct xml, if I print events right away, but I if I
store them in collection and print them later , I don't get the same result.
 
A

Andreas Leitgeb

boris said:
Using stax, I get correct xml, if I print events right away, but I if I
store them in collection and print them later , I don't get the same result.

That sounds more like a bug in your code for "storing" and "printing later"
than a problem with stax itself. ;)
 
A

Arne Vajhøj

I tried to accumulate the whole xml(<doc>...</doc>) as string using sax,
but in this case all special characters are processed by parser
and are just characters and not "predefined entities" like &quot;

Using stax, I get correct xml, if I print events right away, but I if I
store them in collection and print them later , I don't get the same
result.

Any correct XML parser should convert the XML &quot; to a " in
a Java String.

Any correct XML formatter/serializer should convert it back again
when generating new XML.

Arne
 
S

Stanimir Stamenkov

Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:
Any correct XML parser should convert the XML &quot; to a " in
a Java String.

Any correct XML formatter/serializer should convert it back again
when generating new XML.

I think any sane XML serializer should not output " as &quot; in
text content.
 
R

RedGrittyBrick

Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:


I think any sane XML serializer should not output " as &quot; in text
content.

If you use an XML parser to read '<foo delimiter="&quot;">...' you will
get a structure with an attribute with a value of '"'.

If you serialise that structure back to XML again, I would hope to get
'<foo delimiter="&quot;">...' again. Am I wrong?
 
S

Stanimir Stamenkov

Thu, 25 Aug 2011 10:39:17 +0100, /RedGrittyBrick/:
If you use an XML parser to read '<foo delimiter="&quot;">...' you
will get a structure with an attribute with a value of '"'.

If you serialise that structure back to XML again, I would hope to
get '<foo delimiter="&quot;">...' again. Am I wrong?

The serializer may choose (or be configured) to output:

<foo delimiter='"'>...

But my point was text content, not attribute values:

<foo>&quot;</foo>

an then:

<foo>"</foo>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,701
Latest member
XavierQ83

Latest Threads

Top