SAX and invalid chars

C

Christian

Hello

My Problem is that I have to parse an XML file that contains som invalid
chars (i.e. 0x0E or 0x1E)

So running this normally will break parsing.
Though easy solution I could think of would be create a stream to pipe
the input through an filter lower bytes out.
The problem is that if my XML is not in windows-1252 but some other char
encoding I might break encoding by this.

Is there any patent solution to the problem?


Christian
 
A

Arne Vajhøj

Christian said:
My Problem is that I have to parse an XML file that contains som invalid
chars (i.e. 0x0E or 0x1E)

So running this normally will break parsing.
Though easy solution I could think of would be create a stream to pipe
the input through an filter lower bytes out.
The problem is that if my XML is not in windows-1252 but some other char
encoding I might break encoding by this.

Is there any patent solution to the problem?

Do the same as the XML parser.

Read the XML header and get encoding from there.

Arne
 
M

Mike Schilling

Arne said:
Do the same as the XML parser.

Read the XML header and get encoding from there.

Which is easy if you know that the XML file is in some superset of
ASCII, since the entrie XML header will then be in ASCII. It's
tricker if the XML file might be in any encoding at all (e.g. EBCDIC,
UTF-16, etc.) In the latter case, look at Appendix F
(http://www.w3.org/TR/REC-xml/#sec-guessing) for some useful tips.
 
C

Christian

Mike said:
Which is easy if you know that the XML file is in some superset of
ASCII, since the entrie XML header will then be in ASCII. It's
tricker if the XML file might be in any encoding at all (e.g. EBCDIC,
UTF-16, etc.) In the latter case, look at Appendix F
(http://www.w3.org/TR/REC-xml/#sec-guessing) for some useful tips.

Thx for your pointers..

Though the solution seems to be to heavy ... and as I am only expecting
utf-8 and windows-1252 I probably do with the hack of just removing the
bytes ... (and search the api now if there is some easy way to throw an
exception if none of these encodings are used..)

thx
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top