Java and huge XML file to be parsed

K

Katrin Tomanek

Hi everybody,

I've got a really big XML File (about 215 MBytes), which I have to parse.

So, my question is: what would be the best solution: DOM, SAX, JDOM ???
Anything else ? And is it possible at all to parse this huge kinda XML
files ?

I already tried JDOM, i did set my jvm to 512 MB of RAM, but still after
one hour I got an out-of-memory exception.

I thought that maybe SAX might be better, since it is not tree-based.
What do you think according to 215 MB files ?

ok, i am happy about every answer and hint i can get, thanx in advance

Katrin
 
S

Sudsy

Katrin said:
Hi everybody,

I've got a really big XML File (about 215 MBytes), which I have to parse.

SAX is really your only option. DOM has to build the document in memory.
Even if you have a 64-bit processor with GBs of virtual memory...
SAX is also good if you need to process data "on-the-fly"; DOM requires
the document to be complete before the parser returns.
Different tools for different scenarios.
 
M

Malcolm Dew-Jones

Katrin Tomanek ([email protected]) wrote:
: Hi everybody,

: I've got a really big XML File (about 215 MBytes), which I have to parse.

: So, my question is: what would be the best solution: DOM, SAX, JDOM ???
: Anything else ? And is it possible at all to parse this huge kinda XML
: files ?

: I already tried JDOM, i did set my jvm to 512 MB of RAM, but still after
: one hour I got an out-of-memory exception.

: I thought that maybe SAX might be better, since it is not tree-based.
: What do you think according to 215 MB files ?

: ok, i am happy about every answer and hint i can get, thanx in advance

I would think that this is exactly the sort of situation for which
SAX is intended.
 
R

Roedy Green

I've got a really big XML File (about 215 MBytes), which I have to parse.
u
ARRGH. That file is probably 20 times the size if would be if stored
in some sensible format. It will take 100 times a long to parse than
some sensible binary format.


PHOOEY ON XML! I knew this insanity would happen.

See http://mindprod.com/jgloss/xml.html
 
R

Roedy Green

|· It uses HTML's fluffy system of entities such as  

" " has no specific meaning in XML:

If you can discern that from that endlessly recursive XML spec, more
power to you.
 
S

Sudsy

Roedy Green wrote:
....
ARRGH. That file is probably 20 times the size if would be if stored
in some sensible format. It will take 100 times a long to parse than
some sensible binary format.


PHOOEY ON XML! I knew this insanity would happen.

C'mon, Roedy: XML has a place in the overall scheme of things. I
wouldn't use it for database replication, and 215 MB seems a tad
excessive, but at least it's a lingua franca for inter-connected
systems. We can be free of the bonds of proprietary formats and
encoded approaches like EDI. Try modifying those with a simple
text editor!
 
R

Roedy Green

Try modifying those with a simple
text editor!

Why use a ancient tool like that? It is like doing data entry with
NOTEPAD. For heaven sake. Surely we could create editor that
created, edited and searched a compact XML-like representation that
made it IMPOSSIBLE to create syntax errors and almost correct data.

It is not as though we failed to notice what a MESS HTML became from
lack of such a representation. The idiots took the worst features of
HTML.

It is amazing that such a IDIOTIC format caught on.

It is proof of man's attraction to the trashy -- along with McDonald's
fast food success.
 
S

Sudsy

Roedy said:
Why use a ancient tool like that? It is like doing data entry with
NOTEPAD. For heaven sake. Surely we could create editor that
created, edited and searched a compact XML-like representation that
made it IMPOSSIBLE to create syntax errors and almost correct data.

Again, look to the genesis of the specification. While nobody can be
reasonably expected to mentally decode base64 content, the basis for
XML is that it is human-readable. As such, it is editable using the
most basic tools.
You seem to be promoting tools which operate at a much higher level
rather than the LCD (lowest-common denominator).
But can everyone afford to shell-out for the latest version of
Macromedia Flash MX? It's priced between US$500-700, depending on
whether you choose the basic or "Professional" version. Should you
expect everyone to pony-up that kind of money?
Ever look at how much it costs to create/serve RealPlayer or
QuickTime streaming?
If you want to make bags of money and promote your own proprietary
format/protocol (kind of reminds me of the M$ "commoditization" of
established network protocols) then be my guest.
I stand by my assertion: XML provides a platform-neutral exchange
framework.
FWIW, Web Services (and, by definition, SOA) utilizes a foundation
of XML.
So although you might detest it, XML has a place in the "bigger
picture" and is one of the prime candidates for bridging to the
"dark side", also known as .NET (tm, sm, whatever...)
 
T

Tim Ward

Katrin Tomanek said:
I've got a really big XML File (about 215 MBytes), which I have to parse.

Why is it in XML, how often does it change, and what do you have to do with
it when you've parsed it (and other such problem scoping questions, such as,
why are you assuming that the solution is some Java code)? Once that is
known there's a whole range of possible solutions including but not limited
to:

(1) Java and SAX.
(2) Convert it to a proper database first, then do the queries in SQL.
(3) A DOM approach in C++.
(4) ...
 
K

Katrin Tomanek

Hi again,

....coming up with new problems.

after most of the people told me to solve the problem with SAX, i did
that and got a new problem.
i have a very simple SAXParser with a DefaultHandler, nothing special.
when i just try to go through the whole 215 mb file I get an error which
sounds like this in english:
org.xml.sax.SAXParseException: The Parser has reached the
(critical/boundary) value of "64.000" for the extension of the entity
which was set by the application.

(sorry for the bad translation, for some strange reason i get a german
error message saying:
org.xml.sax.SAXParseException: Der Parser hat den von der Anwendung
gesetzten Grenzwert "64.000" für die Erweiterung der Entität erreicht.)

does anyone have an idea what this means, how i could change this value
and why this error occures ?

thx again...
Katrin
 
M

mromarkhan

Katrin said:
Hi again,

org.xml.sax.SAXParseException: The Parser has reached the
(critical/boundary) value of "64.000" for the extension of the entity
which was set by the application.

(sorry for the bad translation, for some strange reason i get a german
error message saying:
org.xml.sax.SAXParseException: Der Parser hat den von der Anwendung
gesetzten Grenzwert "64.000" für die Erweiterung der Entität erreicht.)

does anyone have an idea what this means, how i could change this value
and why this error occures ?

Peace be unto you.




You XML file has entities.

Solution:

java -Xms512m -Xmx512m -DentityExpansionLimit=512000 ThreadMessages
Author: DrClap
http://forum.java.sun.com/thread.jsp?forum=34&thread=515796&tstart=60&trange=15

or

System.setProperty("entityExpansionLimit", "512000");
Author: jatiin
http://forum.java.sun.com/thread.jsp?forum=34&thread=515796&tstart=60&trange=15

"The entityExpansionLimit system property lets existing applications
constrain the total number of entity expansions without recompiling
the code. The parser throws a fatal error once it has reached the
entity expansion limit. (By default, no limit is set.)

To set the entity expansion limit using the system property, use
an option like the following on the java command line:
-DentityExpansionLimit=100000"
http://java.sun.com/webservices/docs/1.2/jaxp/ReleaseNotes.html

Have a good evening.
 
R

Roedy Green

And what specifically is wrong with allowing someone to edit it
with the simplest of tools? That isn't even an option with a
binary format.

Because that introduces the option of error. If you use the proper
tool you don't litter the Internet with malformed files.

Look at the mess HTML is in because we allow hand editing and
publishing. If HTML had to go through a processor before being
published it would be very unlikely you would have malformed published
files, and browsers would not have to deal with such crap.

You don't use notepad to edit your Oracle files. You should not be
using it on any other form of structured data either. It like using a
word processor to do your accounting. You defeat the possible error
checking.
 
R

Roedy Green

I find that to be a gross exaggeration, but neither of us has
hard data. I would also say that the development time for coding
the parser and editor for a binary format is 100 times that of
using XML.

Of course, BUT in a sane world XML would be a binary format and there
would be generic parsers available. Then you would solve three of
XML's biggest problems:

1. fluffiness.
2. malformed files being passed around.
3. complicated parsers just to read it. You want something much faster
and simpler for handheld units.
 
S

Stan Berka

Looks like StAX would be a good choice. Am not sure where to find an
implementation, though.

Stan Berka
 
D

Dimitri Maziuk

Roedy Green sez:
Because that introduces the option of error. If you use the proper
tool you don't litter the Internet with malformed files.

Look at the mess HTML is in because we allow hand editing and
publishing. If HTML had to go through a processor before being
published it would be very unlikely you would have malformed published
files, and browsers would not have to deal with such crap.

LOL. Roedy, either you've never looked at output of any "HTML processor",
or you're posting from a parallel universe.

Dima
 
R

Roedy Green

LOL. Roedy, either you've never looked at output of any "HTML processor",
or you're posting from a parallel universe.

You are missing my point. I believe that both XML and HTML, the thing
actually posted should be binary formats. No one would ever read or
edit them directly, guaranteed to meet the spec, preparsed. Anything
hand-coded with notepad is guaranteed to have some errors. Even
though I validate my HTML daily, you will always find some HTML errors
in there, and also some quasi errors that I tell the verifier to
ignore. My site is very clean compared with most.

See http://mindprod.com/jgloss/xml.html and
http://mindprod.com/projects/htmlcompactor.html for the sort of
formats I had in mind.


When you want to view the HTML/XML you use a viewer or editor.
Tradionalists could fluff it up to something like conventional HTML or
XML for viewing. I would prefer something more graphic like a JTree or
WYSIWYG

How many of you are old enough to remember Wordstar. It was
conceptually easy to understand because you embedded visible tags in
your text. Then Word came along and hit the tags, and just let you
think in terms of the final outcome. It drove everyone mad at first
since Word did such a bad job of the internal tags, but in the long
run the impossibility of getting invalid or unbalanced tags won out.

XML is just about data, so you don't have that same problem. With
HTML it would a lot easier to collapse and clean up a preparsed tree.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top