Special characters in docs

R

Robert M. Gary

I receive an XML document in which one of the text nodes contains a
characters not in the character set (in this case its a ^L). The DOM that
creates the document converts it to a &#0x. However, I cannot get a parser
to accept the document with this character in it. Each time the parser gets
to &X0c it dies. I'm using DOM in Java 1.4 and Xerces 2.6 in C++ (Solaris
Sparc).
I've also created test programs in both Java and C++. My test program
creates the DOM, generates a document and then tries to parse the document
it just created. In each case, it fails!!! How crazy! The parser can't read
the doc it just created!?
BTW: In the actual problem I'm trying to solve, I really don't have any
control over the document I'm receiving.
Are there any options on the parser I can try???
Thanks you so much!

-Robert
 
R

Richard Tobin

Robert M. Gary said:
I receive an XML document in which one of the text nodes contains a
characters not in the character set (in this case its a ^L).

XML 1.0 documents can't contain that character, either literally or as
a character reference. So the solution is to get whoever's providing
this so-called XML document to remove it.

XML 1.1 documents can use but using XML 1.1 may limit the
applications you can use. If you decide to do this, you will
have to put an XML declaration with version="1.1" at the top of
the document.
BTW: In the actual problem I'm trying to solve, I really don't have any
control over the document I'm receiving.

Tell whoever does have control of it to fix it!

-- Richard
 
E

ExGuardianReader

Richard said:
XML 1.0 documents can't contain that character, either literally or as
a character reference. So the solution is to get whoever's providing
this so-called XML document to remove it.

You can't have ?

Where is this documented?

I had a problem sending XML documents in where there was some text with
characters < 31. I encoded them with , but Xerces complained when it
saw and  Which happened to be the first characters in the two
elements in question so I switched to using base64 for those data.

What was going on?
 
R

Robert M. Gary

Richard Tobin said:
XML 1.0 documents can't contain that character, either literally or as
a character reference. So the solution is to get whoever's providing
this so-called XML document to remove it.

XML 1.1 documents can use but using XML 1.1 may limit the
applications you can use. If you decide to do this, you will
have to put an XML declaration with version="1.1" at the top of
the document.

But why does Xerces create that character then? Xerces apparently went to
great lengths to convert my ^L into a &#x0c, why can't it parse what it
produced?
Tell whoever does have control of it to fix it!

Right now, this is the Java 1.4 XML DOM builder. I suspect this is actually
Xerces repackaged.

-Robert
 
R

Richard Tobin

ExGuardianReader said:
You can't have ?
Correct.

Where is this documented?

In the XML specification.

Section 4.1 (http://www.w3.org/TR/REC-xml/#sec-references) says:

[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';' [WFC: Legal Character]

Well-formedness constraint: Legal Character

Characters referred to using character references MUST match the
production for Char.

and section 2.2 says:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
[#xE000-#xFFFD] | [#x10000-#x10FFFF]
I had a problem sending XML documents in where there was some text with
characters < 31.

Yes; in XML 1.0 the only characters < 32 that are allowed are CR, LF
and TAB.

-- Richard
 
R

Richard Tobin

Robert M. Gary said:
But why does Xerces create that character then?

I have no idea.
Right now, this is the Java 1.4 XML DOM builder. I suspect this is actually
Xerces repackaged.

Someone must have given DOM builder the ^L in the first place. Tell
them to stop doing it.

-- Richard
 
R

Robert M. Gary

I think I mistyped the substitute character in this email. The DOM is
changing ^L to be &#12 but then if you try to parse the resultant document
into another DOM, it says it can't understand teh &#12 that that other DOM
just created!!!
 
K

Keith M. Corbett

Robert M. Gary said:
I think I mistyped the substitute character in this email. The DOM is
changing ^L to be &#12 but then if you try to parse the resultant document
into another DOM, it says it can't understand teh &#12 that that other DOM
just created!!!

The DOM serializer you're using to save the output is trying its best to
handle what you gave it.

The parser you're using to read the resulting instance is doing its best to
conform to the XML specification.

Just another case of garbage out, garbage in. :)

/kmc
 
K

Keith M. Corbett

Robert M. Gary said:
BTW: In the actual problem I'm trying to solve, I really don't have any
control over the document I'm receiving.

Knowing that your program input may contain illegal characters, you could
pre-process to transform or eliminate the offending data. This is more or
less trivial depending on the character encoding(s) you need to support.

/kmc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top