Interested in System ID only, not the whole parsing ...

  • Thread starter Dhurandhar Bhatvadekar
  • Start date
D

Dhurandhar Bhatvadekar

I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

A way I can think of is to write an entity resolver and somehow
arrange for the implementation of resolveEntity()
return an appropriate InputSource and preserve the system ID because
System/Public ID are passed to the method.

If that's the only way to achieve it, my question is:
- will this have performance impact and overhead of doing it this way,
because I have to give a call to the parse() method?

If there are other ways of achieving this (again, noting that I am
only interested in the declaration part), please
let me know.

Thank you!
 
J

Joe Kesselman

Dhurandhar said:
I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

Outside of writing a parser yourself for that much of the document...

Run a SAX parser, and as soon as you've gotten that information have
your handler throw an exception to crash the parser. (Obviously the code
that calls the parser will want to catch and recognize this particular
exception as a "normal abnormal exit.")

However, when I proposed that to one manager, he held his nose and
insisted that I let the parser finish spinning instead. And I can't
_entirely_ disagree with him.
 
D

Dhurandhar Bhatvadekar

Hi Joe,

Thanks for your reply. So, here is some code-review time for you. Can
you please let me know if the following
will work? With my preliminary tests it appears to work. But I want to
be sure.

----------------------------------------------
private String getSystemIdFromDtd() {
//Use Streaming XML parser, returns null in case of parsing
error
BufferedInputStream bis = null;
try {
bis = new BufferedInputStream(new FileInputStream(xml)); //
xml is defined elsewhere
final XMLReader xr =
XMLReaderFactory.createXMLReader();
final InputSource is = new InputSource(bis);
xr.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(final String pid,
final String sid)
throws SAXException, IOException {
if (sid != null) {
mSystemId = sid.trim(); //mSystemId is
defined elsewhere
//resolve the entities locally somehow and
return a meaningful InputSource instance
} //else default resolution
} //else default resolution
return ( null );
}
});
xr.parse(is);
return ( mSystemId );
} catch (final Exception ioe) {
throw new RuntimeException(ioe);
} finally {
try {
if (bis != null)
bis.close();
} catch(Exception ee) {
//squelching ee on purpose
}
}
}
 
P

Peter Flynn

Dhurandhar said:
I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

All XML tools conduct a formal parse, either for well-formedness or for
validity as well. This implies they read to the end of the file. Most
XML tools don't provide for fragmentary reading, so the penalty when you
"just" want something from the top of the file is enormous unless you do
the "crash me when I find it" trick.

If you can guarantee that the entire Document type Declaration will be
contained in the first nn lines of the file, and that the double quote
has been used to delimit the identifiers, then the following Unix
commands will do the job, returning two lines: the first is the SYSTEM
identifier, and the second (if non-empty) is the FPI:

head -nn yourfile.xml|tr '\012\015<' '\040\040\012'|grep -m 1
'^!DOCTYPE'|awk -F\" '{print $2 "\n" $4}'

The commands head, tr, grep, and awk are also available for Windows.

///Peter
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Joe said:
Outside of writing a parser yourself for that much of the document...

Run a SAX parser, and as soon as you've gotten that information have
your handler throw an exception to crash the parser. (Obviously the code

Following Joe's idea (and assuming there always
_is_ a DOCTYPE declaration in your file), I
implemented this in XMLgawk:

XMLSTARTDOCT {
print XMLATTR["PUBLIC"], XMLATTR["SYSTEM"]
exit
}

The "exit" statement ensures that the XML data
will only be read up to the point where the
DOCTYPE declaration is. Immediately after this,
parsing will be terminated. I described such an
approach in the XMLgawk doc:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Dealing-with-DTDs
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,268
Members
46,867
Latest member
Lonny Petersen

Latest Threads

Top