SAX succeeds, but StAX fails

K

Kai Schlamp

Hy!

I tried to parse PubMed (a biomedical article database) with SAX and
also StAX. The last one failed, but I am not sure why (see Exception
below).
Why does SAX succeed and StAX don't?
The XML document seems to be fine (see
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml)
Any suggestions?

Kai

StAX example:
String address = "http://www.ncbi.nlm.nih.gov/entrez/
eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
URL url = new URL(address);

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser =
factory.createXMLStreamReader(url.openConnection().getInputStream());

while(parser.hasNext()) {
switch(parser.getEventType()) {
}
parser.next();
}

Error message:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
Message: A '(' character or an element type is required in the
declaration of element type "PubMedPubDate".

SAX example:
SAXParserFactory parserFactory =
SAXParserFactory.newInstance();
parserFactory.setValidating(true);
parserFactory.setNamespaceAware(true);
SAXParser parser = parserFactory.newSAXParser();
parser.parse(url.openConnection().getInputStream(), new
PubmedEFetchHandler());

(PubmedEFetchHander is a simple DefaultHandler with some debugging
output).
 
G

GArlington

Hy!

I tried to parse PubMed (a biomedical article database) with SAX and
also StAX. The last one failed, but I am not sure why (see Exception
below).
Why does SAX succeed and StAX don't?
The XML document seems to be fine (seehttp://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11...)

As far as I can see this request DOES NOT generate valid xml (or any
xml).
Any suggestions?

Kai

StAX example:
String address = "http://www.ncbi.nlm.nih.gov/entrez/
eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
URL url = new URL(address);

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser =
factory.createXMLStreamReader(url.openConnection().getInputStream());

while(parser.hasNext()) {
switch(parser.getEventType()) {
}
parser.next();
}

Error message:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
Message: A '(' character or an element type is required in the
declaration of element type "PubMedPubDate".

SAX example:
SAXParserFactory parserFactory =
SAXParserFactory.newInstance();
parserFactory.setValidating(true);
parserFactory.setNamespaceAware(true);
SAXParser parser = parserFactory.newSAXParser();
parser.parse(url.openConnection().getInputStream(), new
PubmedEFetchHandler());

(PubmedEFetchHander is a simple DefaultHandler with some debugging
output).
 
K

Kai Schlamp

Ok, I checked the new link again and the problem remains. When I click
the link and it opens in Firefox, it is indeed no XML.
But when you then press the "Go To" button (green button on the right
of the url input field), then the valid XML appears. I am not sure why
this happens, but it doesn't have to do something with my original
problem. Seems to be a little Firefox problem.
 
G

GArlington

Ok, I checked the new link again and the problem remains. When I click
the link and it opens in Firefox, it is indeed no XML.
But when you then press the "Go To" button (green button on the right
of the url input field), then the valid XML appears. I am not sure why
this happens, but it doesn't have to do something with my original
problem. Seems to be a little Firefox problem.

OK, I tried accessing it with IE and it worked first time, I thought
that I gave it a try in IE yesterday too, but...
I fetched your url and parsed it (with my own methods) and it works,
so I suspect that there is a problem with StAX...
The only thing I can suggest is: try to dump what you get from your
url BEFORE you try to parse it and then dump the data at each step
until you get to your error - this will help you to find where the
problem first shows it's ugly head...
 
K

Kai Schlamp

I still have the same problem with StAX. I dumped the output of the
url before parsing it, and it seems to be fine and well formed.
But parsing with StAX still gives me an exception right in the first
loop (SAX seems to work fine).
Below is a small test class. Can someone explain to me, why this
happens?
I also tried to copy the output of the url in a file and parsing it
directly from disk ... didn't solve that problem.
Perhaps I should try it with another StAX provider. I found one on the
net named Woodstox. Are there any more? What is the default
implementation? An Apache project?

The error output of the below test class:

START_DOCUMENT: 1.0
beforeNext
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
Message: A '(' character or an element type is required in the
declaration of element type "PubMedPubDate".
at
com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:
588)
at StaxTester.main(StaxTester.java:49)

The test class:

import java.net.URL;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

public class StaxTester {

public static void main(String[] args) {
try {
String address = "http://www.ncbi.nlm.nih.gov/entrez/eutils/
efetch.fcgi?db=pubmed&retmode=xml&id=11748933";
//String address = "http://www.ncbi.nlm.nih.gov/entrez/eutils/
esearch.fcgi?db=pmc&term=stem+cells+AND+free+fulltext[filter]";
URL url = new URL(address);

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser =
factory.createXMLStreamReader(url.openConnection().getInputStream());

while(parser.hasNext()) {
switch(parser.getEventType()) {
case XMLStreamConstants.START_DOCUMENT:
System.out.println( "START_DOCUMENT: " +
parser.getVersion() );
break;

case XMLStreamConstants.END_DOCUMENT:
System.out.println( "END_DOCUMENT: " );
parser.close();
break;

case XMLStreamConstants.NAMESPACE:
System.out.println( "NAMESPACE: " +
parser.getNamespaceURI() );
break;

case XMLStreamConstants.START_ELEMENT:
System.out.println( "START_ELEMENT: " +
parser.getLocalName() );
break;

case XMLStreamConstants.CHARACTERS:
if ( ! parser.isWhiteSpace() )
System.out.println( "CHARACTERS: " + parser.getText() );
break;

case XMLStreamConstants.END_ELEMENT:
System.out.println("END_ELEMENT: " +
parser.getLocalName() );
break;

default:
break;
}
System.out.println("beforeNext");
parser.next();
System.out.println("afterNext");
}

/** SAX succeeds. Why that? */
// SAXParserFactory parserFactory = SAXParserFactory.newInstance();
// parserFactory.setValidating(true);
// parserFactory.setNamespaceAware(true);
// SAXParser parser = parserFactory.newSAXParser();
// parser.parse(url.openConnection().getInputStream(), new
PubmedEFetchHandler());
//
}
catch (Exception e) {
e.printStackTrace();
}

}

}
 
O

Owen Jacobson

Hy!

I tried to parse PubMed (a biomedical article database) with SAX and
also StAX. The last one failed, but I am not sure why (see Exception
below).
Why does SAX succeed and StAX don't?
The XML document seems to be fine (seehttp://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11...)
Any suggestions?

...

            String address = "http://www.ncbi.nlm.nih.gov/entrez/
eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
            URL url = new URL(address);
...

Error message:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
Message: A '(' character or an element type is required in the
declaration of element type "PubMedPubDate".

The XML document itself is fine, but non-validating due to problems in
the DTD; StAX by default attempts to validate input documents. SAX is
ignoring the DTD associated with the XML document, and therefore
doesn't notice that the DTD is invalid.

-o
 
K

Kai Schlamp

I tried to parse PubMed (a biomedical article database) with SAX and
also StAX. The last one failed, but I am not sure why (see Exception
below).
Why does SAX succeed and StAX don't?
The XML document seems to be fine (seehttp://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11...)
Any suggestions?
...

String address = "http://www.ncbi.nlm.nih.gov/entrez/
eutils/efetch.fcgi?db=pubmed&id=11748933&retmode=xml";
URL url = new URL(address);
...

Error message:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[50,39]
Message: A '(' character or an element type is required in the
declaration of element type "PubMedPubDate".

The XML document itself is fine, but non-validating due to problems in
the DTD; StAX by default attempts to validate input documents. SAX is
ignoring the DTD associated with the XML document, and therefore
doesn't notice that the DTD is invalid.

-o

Thanks for the answer.
So disabling DTD validation should solve that problem?
I tried
factory.setProperty("javax.xml.stream.isValidating", false);
(which is the default as stated in the Javadoc), but it also didn't
solve the problem.

Another thing ... I just tried the Woodstox implementation (just added
it to the classpath), and everything works fine (even without changing
any property). So it seems, that there is a specific problem with the
reference implementation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top