XML parse validation

S

sarosh.shirazi

Hi,

I'm facing an illegal character problem when I read an XML file. Below
code was used to do the reading.

XmlReaderSettings settings = new XmlReaderSettings();
settings.CheckCharacters = false;

string fXmlFileName = _FilePath;
XmlReader reader = XmlReader.Create(fXmlFileName,
settings);
XML= new XPathDocument(reader);

The exception comes on the constructor of XPathDocument. I want to
read the file overlooking the characters like (UTF-8 encoding). A
solution pointed out to me was to parse it manually by reading it in
ascii and replacing the characters but this damages my performance
level so i want to avoid it. Any suggestion in this regard would be
most welcome... How can i avoid validation???
 
J

Joseph Kesselman

The exception comes on the constructor of XPathDocument. I want to
read the file overlooking the characters like (UTF-8 encoding).

This isn't a validation issue, but a well-formedness issue. That
character is not legal in XML; if it is present, your file is simply not
an MXL file.

Change the code which is generating the XML to avoid putting forbidden
characters into the document in the first place (if you really need to
express random binary data, the usual workaround it to encode it as
something like base-64 before putting it into the XML and decode it
before using it).

The alternative, as you pointed out, is to prefilter the data before it
gets to the XML parser. I don't know enough about C# to give you any
advice, but in Java setting up a filtered input stream is quite
straightforward.
 
J

Joseph Kesselman

(Note: I'm assuming you're not working in Java because you spelled
"string" with a lowercase S. If that was just a typo, and you are using
Java, then a filter would do the job. But the real question remains: Why
are you generating broken XML in the first place, and shouldn't you fix
that rather than trying to work around it?)
 
M

Martin Honnen

I'm facing an illegal character problem when I read an XML file. Below
code was used to do the reading.

XmlReaderSettings settings = new XmlReaderSettings();
settings.CheckCharacters = false;

string fXmlFileName = _FilePath;
XmlReader reader = XmlReader.Create(fXmlFileName,
settings);
XML= new XPathDocument(reader);

The exception comes on the constructor of XPathDocument. I want to
read the file overlooking the characters like (UTF-8 encoding). A
solution pointed out to me was to parse it manually by reading it in
ascii and replacing the characters but this damages my performance
level so i want to avoid it. Any suggestion in this regard would be
most welcome... How can i avoid validation???

If you set CheckCharacters to false then the XmlReader allows character
references like so I am not sure why you get a parse error. Are you
sure you have characters references like ? If you have such
characters literally in the document then CheckCharacters set to false
does not help. In that case the XML APIs do not help at all, you need to
preprocess the document to get rid of those characters.
 
S

sarosh.shirazi

If you set CheckCharacters to false then the XmlReader allows character
references like so I am not sure why you get a parse error. Are you
sure you have characters references like ? If you have such
characters literally in the document then CheckCharacters set to false
does not help. In that case theXMLAPIs do not help at all, you need to
preprocess the document to get rid of those characters.

--

        Martin Honnen
       http://JavaScript.FAQTs.com/- Hide quoted text -

- Show quoted text -

To Joseph: It's part of the requirement that such characters would
come up...so i'll have to bear the heck :)
To Martin: Yeah these characters are coming up literally in the
file...
Is there any way other than ascii preprocessing or preparsing. I know
the tags which shall have these chars. Can i somehow have these
particular tags and their data simply ignored in XML?
 
A

Andy Dingley

To Joseph: It's part of the requirement that such characters would
come up.

I doubt this very much. The _character_ / codepoint "&x00" is a
different concept to the byte or octet "&x00". Although Unicode
encodings may well involve such a byte value at the level of the raw
wire protocol, they certainly don't allow it as a valid character
(sic, codepoint).

XML, at the level you describe it, is a character stream. In XML the
entity is a reference to this possible (albeit forbidden) 00
value as a _character_, not just a raw byte.

It sounds as if your problem here is an encoding problem (i.e. a
Unicode problem, not an XML problem), even before it gets as far as
being an XML well-formedness issue. Raw bytes 0f 00 are just bytes
(which might have some correct place in the encoding you're using) but
they're not intended to encode a resultant _character_ of 00, or the
same thing as a numeric entity of
 
P

Peter Flynn

On Tue, 08 Jan 2008 22:29:35 -0800, sarosh.shirazi wrote:

[snip]
To Joseph: It's part of the requirement that such characters would come
up...so i'll have to bear the heck :)

Then as Joseph said, your file is not an XML file, so you must use
non-XML software to process it.
Is there any way other than ascii preprocessing or preparsing.

Not as far as I am aware.
I know
the tags which shall have these chars. Can i somehow have these
particular tags and their data simply ignored in XML?

No, because (as already explained) your file is not an XML file.
You cannot use XML software and methods on non-XML files in this
way (apart from the method Martin suggested).

If you can fix it by exchanging the invalid characters on a 1:1 basis,
then just use a simple inline filter like tr, which is extremely fast.

Alternatively, change all the invalid characters to some form of markup,
eg <junk char="0"/> so that they can be transformed back again after
processing. A stream editor like sed is very fast for this kind of thing.

And tell your data source that their data will process more easily if
they generate well-formed XML. A "requirement" like the one you mention
is simply evidence of bad planning on their part.

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top