L
Luc Mercier
Hi Folks,
I'm new here, and I need some advice for what tool to use.
I'm using XML for benchmarking purposes. I'm writing some scientific
programs which I want to analyze. My program generates large XML logs
giving semi-structured information on the flow of the program. The XML
tree looks like the method calls tree, but at a much higher level, and I
add many values of some variables.
There is no predefined schema, and often as I modify my program I will
add some new tags and new information to put into the log.
Once a log is written, I never modify the document.
To analyze the data, I add a /almost/ perfect solution: from Matlab, I
would call the methods of the Java library dom4j. Typically, I would
load a document, then dump values of attributes matching an XPath
expression into a Matlab array, then do some stats or plotting. I'm very
happy with the comfort and the ease of this solution: no DB to set up,
just load a document, and and Matlab gives you an environment in which
you can call java methods without creating a java program, so it's very
easy to debug the XPath expressions you pass to dom4j's "selectNodes"
method.
Now, the problem is, it's perfect for documents of a few 10's of
megabytes, but now I would like to process documents of several hundreds
MBs to, let's say, maybe 10 GB (that's a fairly large upper bound).
It seems I have to give up with dom4j for that. I have tried to use
eXist to create a DB with my documents, and all I got was a lot of
(rather violent) crashes when I tried to run the first example they give
in the doc for retrieving a document via the XMLB api. Then I tried
BerkeleyDB XML, which I have not been able to install. I then tried
xmlDB, but as I tried to import a first document into a collection I got
a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
the doc of how to specify the heap space.
After these 3 unsuccessful trials, I'd like to ask for some advice!
To summarize, my needs are:
* Processing (very) large XML documents
* Need for XPath
* Java API, to be able to call from Matlab
* Read-only processing
* Single user, no security issues, no remote access need
* Platform: Java if possible, otherwise Linux/Debian on x86.
I welcome any suggestion.
- Luc Mercier.
I'm new here, and I need some advice for what tool to use.
I'm using XML for benchmarking purposes. I'm writing some scientific
programs which I want to analyze. My program generates large XML logs
giving semi-structured information on the flow of the program. The XML
tree looks like the method calls tree, but at a much higher level, and I
add many values of some variables.
There is no predefined schema, and often as I modify my program I will
add some new tags and new information to put into the log.
Once a log is written, I never modify the document.
To analyze the data, I add a /almost/ perfect solution: from Matlab, I
would call the methods of the Java library dom4j. Typically, I would
load a document, then dump values of attributes matching an XPath
expression into a Matlab array, then do some stats or plotting. I'm very
happy with the comfort and the ease of this solution: no DB to set up,
just load a document, and and Matlab gives you an environment in which
you can call java methods without creating a java program, so it's very
easy to debug the XPath expressions you pass to dom4j's "selectNodes"
method.
Now, the problem is, it's perfect for documents of a few 10's of
megabytes, but now I would like to process documents of several hundreds
MBs to, let's say, maybe 10 GB (that's a fairly large upper bound).
It seems I have to give up with dom4j for that. I have tried to use
eXist to create a DB with my documents, and all I got was a lot of
(rather violent) crashes when I tried to run the first example they give
in the doc for retrieving a document via the XMLB api. Then I tried
BerkeleyDB XML, which I have not been able to install. I then tried
xmlDB, but as I tried to import a first document into a collection I got
a "java.lang.OutOfMemoryError: Java heap space" and found no mention in
the doc of how to specify the heap space.
After these 3 unsuccessful trials, I'd like to ask for some advice!
To summarize, my needs are:
* Processing (very) large XML documents
* Need for XPath
* Java API, to be able to call from Matlab
* Read-only processing
* Single user, no security issues, no remote access need
* Platform: Java if possible, otherwise Linux/Debian on x86.
I welcome any suggestion.
- Luc Mercier.