this code: &#x3, an invalid XML character error.

Kaidi · Sep 27, 2004

Hello guys,
I get the "an invalid XML character" error when using xerces to parse
a XML file. I know that XML will correspond the &, <, >, " to special
strings like "><". However, how about if the XML file really
needs to contain some text like: ""? (as
content of a tag)

The story is:
I am writing a program to parse some XML files from another program.
In that program, it graps webpages, and saves the pages' URLs and
content into a XML file, something like (for each webpage):

<pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
<pagecontent> the_page_HTML_content </pagecontent>

This works fine since that program will replace &, <, > etc with <
etc.

However, some web urls point to files: .zip, .pdf file, etc. The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PKÃˆR<+&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it, it will get
errors like:

FATAL: line 5079: Character reference "&#x3" is an invalid XML
character.
org.xml.sax.SAXParseException: Character reference "&#x3" is an
invalid XML character.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.scanCharReferenceValue(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCharReference(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

So, any idea how I can make it work?
How can I tell the xerces parser to ignore the "&xx;" pairs (except
those for <,>,", etc) and parse them just as plain text?

Thanks a lot.

Patrick TJ McPhee · Sep 27, 2004

% I get the "an invalid XML character" error when using xerces to parse
% a XML file. I know that XML will correspond the &, <, >, " to special
% strings like "><". However, how about if the XML file really
% needs to contain some text like: ""? (as
% content of a tag)

The only valid characters in an XML file are the non-control code points
from Unicode, tab, carriage-return, and line-feed. Even if you enter
them as numeric entity references, other control characters (such as
) are not allowed. I suggest encoding binary data using one of
the schemes recognised in mime, such as quoted-printable (for text with
the odd control character) or base64.

% However, some web urls point to files: .zip, .pdf file, etc. The
% program just "prints" the .pdf content as text and puts it in the XML
% file. In this case, the content of <pagecontent> will look like:

For these, use base64.

Johannes Koch · Sep 27, 2004

Kaidi said:
The
program just "prints" the .pdf content as text and puts it in the XML
file. In this case, the content of <pagecontent> will look like:

PKÃˆR<+&#
......
(Just think what you will see if you open a .pdf file in notepad!)

In this way, when I use a XML parser (xerces) to parse it,

Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.

Kaidi · Sep 27, 2004

Johannes Koch said:
Why do you want to parse PDF with an XML parser? When downloading the
resources, you may store the content-type and make XML pasring dependent
on the content-type.

yes, if let me write the whole program, I will do that way. The
problem is: the existing program (which I can not change) is doing
that way: it just put .jar/pdf, etc. into one XML file. I need to
process this XML file. :-(

Error while parsing local languages using SAX/DOM parser.	1	Sep 15, 2008
why is this not validating?	1	Apr 29, 2004
Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.	6	Jan 21, 2010
Transformation error	3	Sep 24, 2006
error XML validation JAXP:org.xml.sax.SAXParseException	1	Oct 8, 2008
invalid XML character	6	Dec 7, 2004
org.apache.axis.AxisFault.makeFault(AxisFault.java:101)	28	May 16, 2009
Character reference "&#c" is an invalid XML character	6	Jul 17, 2003

this code: &#x3, an invalid XML character error.

Kaidi

Patrick TJ McPhee

Johannes Koch

Kaidi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads