Special characters in docs

Robert M. Gary · Nov 20, 2004

I receive an XML document in which one of the text nodes contains a
characters not in the character set (in this case its a ^L). The DOM that
creates the document converts it to a &#0x. However, I cannot get a parser
to accept the document with this character in it. Each time the parser gets
to &X0c it dies. I'm using DOM in Java 1.4 and Xerces 2.6 in C++ (Solaris
Sparc).
I've also created test programs in both Java and C++. My test program
creates the DOM, generates a document and then tries to parse the document
it just created. In each case, it fails!!! How crazy! The parser can't read
the doc it just created!?
BTW: In the actual problem I'm trying to solve, I really don't have any
control over the document I'm receiving.
Are there any options on the parser I can try???
Thanks you so much!

-Robert

Richard Tobin · Nov 20, 2004

Robert M. Gary said:
I receive an XML document in which one of the text nodes contains a
characters not in the character set (in this case its a ^L).

XML 1.0 documents can't contain that character, either literally or as
a character reference. So the solution is to get whoever's providing
this so-called XML document to remove it.

XML 1.1 documents can use but using XML 1.1 may limit the
applications you can use. If you decide to do this, you will
have to put an XML declaration with version="1.1" at the top of
the document.

BTW: In the actual problem I'm trying to solve, I really don't have any
control over the document I'm receiving.

Tell whoever does have control of it to fix it!

-- Richard

ExGuardianReader · Nov 20, 2004

Richard said:
XML 1.0 documents can't contain that character, either literally or as
a character reference. So the solution is to get whoever's providing
this so-called XML document to remove it.

You can't have ?

Where is this documented?

I had a problem sending XML documents in where there was some text with
characters < 31. I encoded them with , but Xerces complained when it
saw and Which happened to be the first characters in the two
elements in question so I switched to using base64 for those data.

What was going on?

Robert M. Gary · Nov 22, 2004

Richard Tobin said:
XML 1.0 documents can't contain that character, either literally or as
a character reference. So the solution is to get whoever's providing
this so-called XML document to remove it.

XML 1.1 documents can use but using XML 1.1 may limit the
applications you can use. If you decide to do this, you will
have to put an XML declaration with version="1.1" at the top of
the document.

But why does Xerces create that character then? Xerces apparently went to
great lengths to convert my ^L into a &#x0c, why can't it parse what it
produced?

Tell whoever does have control of it to fix it!

Right now, this is the Java 1.4 XML DOM builder. I suspect this is actually
Xerces repackaged.

-Robert

Richard Tobin · Nov 22, 2004

ExGuardianReader said:
You can't have ?
Correct.

Where is this documented?

In the XML specification.

Section 4.1 (http://www.w3.org/TR/REC-xml/#sec-references) says:

[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';' [WFC: Legal Character]

Well-formedness constraint: Legal Character

Characters referred to using character references MUST match the
production for Char.

and section 2.2 says:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]

I had a problem sending XML documents in where there was some text with
characters < 31.

Yes; in XML 1.0 the only characters < 32 that are allowed are CR, LF
and TAB.

-- Richard

Richard Tobin · Nov 22, 2004

Robert M. Gary said:
But why does Xerces create that character then?

I have no idea.

Right now, this is the Java 1.4 XML DOM builder. I suspect this is actually
Xerces repackaged.

Someone must have given DOM builder the ^L in the first place. Tell
them to stop doing it.

-- Richard

Robert M. Gary · Nov 22, 2004

I think I mistyped the substitute character in this email. The DOM is
changing ^L to be &#12 but then if you try to parse the resultant document
into another DOM, it says it can't understand teh &#12 that that other DOM
just created!!!

Keith M. Corbett · Nov 24, 2004

Robert M. Gary said:
I think I mistyped the substitute character in this email. The DOM is
changing ^L to be &#12 but then if you try to parse the resultant document
into another DOM, it says it can't understand teh &#12 that that other DOM
just created!!!

The DOM serializer you're using to save the output is trying its best to
handle what you gave it.

The parser you're using to read the resulting instance is doing its best to
conform to the XML specification.

Just another case of garbage out, garbage in.

/kmc

Keith M. Corbett · Nov 24, 2004

Robert M. Gary said:
BTW: In the actual problem I'm trying to solve, I really don't have any
control over the document I'm receiving.

Knowing that your program input may contain illegal characters, you could
pre-process to transform or eliminate the offending data. This is more or
less trivial depending on the character encoding(s) you need to support.

/kmc

Handling Special characters in python	7	Jan 1, 2013
How to convert MS Word special characters to HTML codes?	1	Mar 31, 2012
regex to escape special characters	4	Feb 10, 2009
Problem with special characters in the password field (urllib)	0	Mar 18, 2012
java sax parser special characters	3	Jun 12, 2008
Windows, Dir class and special characters	1	Jun 21, 2010
Special characters in csv header using fastercsv	16	Nov 17, 2009
Parse XML file on Linux faled because of special characters	2	Jan 1, 2008

Special characters in docs

Robert M. Gary

Richard Tobin

ExGuardianReader

Robert M. Gary

Richard Tobin

Richard Tobin

Robert M. Gary

Keith M. Corbett

Keith M. Corbett

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads