K
Kaidi
Hello guys,
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "><". However, how about if the XML file really
needs to contain some text like: ""? (as
content of a tag)
The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):
<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>
This works fine since that program will replace &, <, > etc with <
etc.
However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:
PK ÈR<+ &#
......
(Just think what you will see if you open a .pdf file in notepad!)
In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:
FATAL: line 5079: Character reference "" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenceValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?
Thanks a lot.
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "><". However, how about if the XML file really
needs to contain some text like: ""? (as
content of a tag)
The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):
<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>
This works fine since that program will replace &, <, > etc with <
etc.
However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:
PK ÈR<+ &#
......
(Just think what you will see if you open a .pdf file in notepad!)
In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:
FATAL: line 5079: Character reference "" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenceValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?
Thanks a lot.