XML-Parsing with UTF-8 Byte-Order-Mark (BOM)

Patrick.Gebhardt · Jun 25, 2007

Hello,

i have a really weird problem.

The environment is a client - server application, where the client
reads an UTF-8 encoded XML file (with cyrillic characters e.g.) which
is then send to the server, where it is parsed in 2 different ways -
first using a normal SaxParser then via Castor (which is using the
_same_ parser library)

relevant Libs: xercesImpl 2.9.0, castor 0.9.5

The client-XML file is UTF-8 with BOM (hex: EB BB BF).

The client sends this file via a commons-httpclient POST call to the
server using the correct content-type.
I ensure on the server side, that the file is received correcly, i can
read the cyrrilic characters in the logfile after doing the following
in the servlet:

the following is obviously pseudoCode:

doPost() {
request.setCharacterEncoding("UTF8");
InputStream in = request.getInputStream();

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];

int count = in.read(buffer);
while( count != -1) {
baos.write(buffer, 0, count);
count = in.read(buffer);
}

byte[] xml = baos.toByteArray();
String s = new String(xml, "UTF8"); --> string is correct, contains
cyrrillic characters

--- until here, everything is fine.

--- Now i have to parse the xml to find a node-attribute and decide
upon the value into which
--- castor classes i have to unmarshal the XML.
--- To be able to call castor, i need a second input stream which
castor will be using.
--- therefore i copy the byte[] and create a second stream.
--- (the files are really small, therefore i dont expect memory
problems)

byte[] xmlCastor = new byte[xml.length];
System.arraycopy(xml, 0, xmlCastor, 0, xml.length);

ByteArrayInputStream bais = new ByteArrayInputStream(xml);
ByteArrayInputStream baisCastor = new
ByteArrayInputStream(xmlCastor);

-- i can verify in the logfile, that these 2 byte arrays contain the
same cyrillic characters.

-- now i call the SaxParser with the first stream, and i receive the
node attribute.
-- then i pass the second stream to castor ... and bummer ...

Caused by: org.xml.sax.SAXException: Parsing Error: Content is not
allowed in prolob.

-- that is because of the byte-order mark, the Parser does not like
it.
-- 2 identical streams (as far as i can tell) called by the same
parser ... one runs into an exception,
-- the second does not

-- I have _exactly one_ Parser in my Tomcat in WEB-INF/Lib, and that
is xercesImpl-2.9.0.jar.
-- Is it somehow possible that Tomcat provides a different version ? I
cannot verify how Castor is
-- choosing his XML parser, but i do it the following way:

SAXParserFactory pf = SAXParserFactory.newInstance();
XMLReader parser = pf.newSAXParser().getXMLReader();
parser.parse(new InputSource(bais));

Any helpful Tips appreciated!

P.S: i can't change very much of the infrastructure ... Castor e.g is
definitly a set condition.

XML-Parsing with UTF-8 Byte-Order-Mark (BOM)	3	Jun 25, 2007
XML and Invalid byte UTF-8	7	May 9, 2005
codec for UTF-8 with BOM	3	May 2, 2011
PEP 8: Byte Order Mark (BOM) vs coding cookie	2	Aug 24, 2008
Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.	6	Jan 21, 2010
utf-16 little endian byte order mark with libxml-ruby	1	Jul 25, 2007
2to3 ParseError with UTF-8 BOM	3	Nov 5, 2009
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013

XML-Parsing with UTF-8 Byte-Order-Mark (BOM)

Patrick.Gebhardt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads