Interested in System ID only, not the whole parsing ...

Dhurandhar Bhatvadekar · Mar 3, 2007

I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

A way I can think of is to write an entity resolver and somehow
arrange for the implementation of resolveEntity()
return an appropriate InputSource and preserve the system ID because
System/Public ID are passed to the method.

If that's the only way to achieve it, my question is:
- will this have performance impact and overhead of doing it this way,
because I have to give a call to the parse() method?

If there are other ways of achieving this (again, noting that I am
only interested in the declaration part), please
let me know.

Thank you!

Joe Kesselman · Mar 3, 2007

Dhurandhar said:
I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

Outside of writing a parser yourself for that much of the document...

Run a SAX parser, and as soon as you've gotten that information have
your handler throw an exception to crash the parser. (Obviously the code
that calls the parser will want to catch and recognize this particular
exception as a "normal abnormal exit.")

However, when I proposed that to one manager, he held his nose and
insisted that I let the parser finish spinning instead. And I can't
_entirely_ disagree with him.

Dhurandhar Bhatvadekar · Mar 3, 2007

Hi Joe,

Thanks for your reply. So, here is some code-review time for you. Can
you please let me know if the following
will work? With my preliminary tests it appears to work. But I want to
be sure.

----------------------------------------------
private String getSystemIdFromDtd() {
//Use Streaming XML parser, returns null in case of parsing
error
BufferedInputStream bis = null;
try {
bis = new BufferedInputStream(new FileInputStream(xml)); //
xml is defined elsewhere
final XMLReader xr =
XMLReaderFactory.createXMLReader();
final InputSource is = new InputSource(bis);
xr.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(final String pid,
final String sid)
throws SAXException, IOException {
if (sid != null) {
mSystemId = sid.trim(); //mSystemId is
defined elsewhere
//resolve the entities locally somehow and
return a meaningful InputSource instance
} //else default resolution
} //else default resolution
return ( null );
}
});
xr.parse(is);
return ( mSystemId );
} catch (final Exception ioe) {
throw new RuntimeException(ioe);
} finally {
try {
if (bis != null)
bis.close();
} catch(Exception ee) {
//squelching ee on purpose
}
}
}

Joe Kesselman · Mar 3, 2007

Sorry, but code review goes beyond what you get for free.

Peter Flynn · Mar 4, 2007

Dhurandhar said:
I am not sure if this is a naive question. But I have an arbitrarily
long document where I know that a DOCTYPE
declaration exists. I am not interested in "parsing" the document. All
I am interested is in finding out what the
System id and Public id of the document is.

All XML tools conduct a formal parse, either for well-formedness or for
validity as well. This implies they read to the end of the file. Most
XML tools don't provide for fragmentary reading, so the penalty when you
"just" want something from the top of the file is enormous unless you do
the "crash me when I find it" trick.

If you can guarantee that the entire Document type Declaration will be
contained in the first nn lines of the file, and that the double quote
has been used to delimit the identifiers, then the following Unix
commands will do the job, returning two lines: the first is the SYSTEM
identifier, and the second (if non-empty) is the FPI:

head -nn yourfile.xml|tr '\012\015<' '\040\040\012'|grep -m 1
'^!DOCTYPE'|awk -F\" '{print $2 "\n" $4}'

The commands head, tr, grep, and awk are also available for Windows.

///Peter

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Mar 4, 2007

Joe said:
Outside of writing a parser yourself for that much of the document...

Run a SAX parser, and as soon as you've gotten that information have
your handler throw an exception to crash the parser. (Obviously the code

Following Joe's idea (and assuming there always
_is_ a DOCTYPE declaration in your file), I
implemented this in XMLgawk:

XMLSTARTDOCT {
print XMLATTR["PUBLIC"], XMLATTR["SYSTEM"]
exit
}

The "exit" statement ensures that the XML data
will only be read up to the point where the
DOCTYPE declaration is. Immediately after this,
parsing will be terminated. I described such an
approach in the XMLgawk doc:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Dealing-with-DTDs

Messages don't show on the website, only stores in Firestore Database	1	Jan 22, 2023
Anybody interested in a KDevelop-Ruby-plugin?	0	Nov 19, 2009
Implementing Many Stacks in the Same Program	1	Aug 10, 2021
Python, LDA : How to get the id of keywords instead of the keywords themselves with Gensim?	0	Jan 20, 2017
I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
How does a HEAD pointer end up pointing to the first node in a linked list?	3	Jan 24, 2023
Parsing files in python	0	Dec 23, 2012
Python : parsing the command line options using optparse	0	Feb 25, 2014

Interested in System ID only, not the whole parsing ...

Dhurandhar Bhatvadekar

Joe Kesselman

Dhurandhar Bhatvadekar

Joe Kesselman

Peter Flynn

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads