this code: &#x3, an invalid XML character error.

K

Kaidi

Hello guys,
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "&gt;&lt;". However, how about if the XML file really
needs to contain some text like: ""? (as
content of a tag)

The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):

<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>

This works fine since that program will replace &, <, > etc with &lt;
etc.

However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PKÈR&lt;+&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:

FATAL: line 5079: Character reference "&#x3" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "&#x3" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenceValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?

Thanks a lot.
 
P

Patrick TJ McPhee

% I get the "an invalid XML character" error when using xerces to parse
% a XML file. I know that XML will correspond the &, <, >, " to special
% strings like "&gt;&lt;". However, how about if the XML file really
% needs to contain some text like: ""? (as
% content of a tag)

The only valid characters in an XML file are the non-control code points
from Unicode, tab, carriage-return, and line-feed. Even if you enter
them as numeric entity references, other control characters (such as
) are not allowed. I suggest encoding binary data using one of
the schemes recognised in mime, such as quoted-printable (for text with
the odd control character) or base64.

% However, some web urls point to files: .zip, .pdf file, etc. The
% program just "prints" the .pdf content as text and puts it in the XML
% file. In this case, the content of <pagecontent> will look like:

For these, use base64.
 
J

Johannes Koch

Kaidi said:
The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PKÈR&lt;+&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it,

Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.
 
K

Kaidi

Johannes Koch said:
Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.

yes, if let me write the whole program, I will do that way. The
problem is: the existing program (which I can not change) is doing
that way: it just put .jar/pdf, etc. into one XML file. I need to
process this XML file. :-(
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,738
Latest member
JinaMacvit

Latest Threads

Top