XML and Invalid byte UTF-8

R

R

Hello everybody.

I have a problem with fatal error while parsing XML.

I have a server and a client.
My server creates XML from web page given by the client, after parsing
it to
XML the content is being sent to client.

this is client code:

// read text from socket
while (null != line)
{
sb.append(line);
line = br.readLine();
}

// debug - this works I can see my XML response!
// System.out.println(sb.toString());

// parse my String back to DOM Document
DocumentBuilder xdb2 = XMLParserUtils.getXMLDocBuilder();
ByteArrayInputStream bais = new
ByteArrayInputStream(sb.toString().getBytes());
Document doc = xdb.parse(new InputSource(bais));

and then I recieve this fatal error:

[Fatal Error] :1:1335: Invalid byte 1 of 1-byte UTF-8 sequence.
org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:264)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:292)

How can I avoid this problems?
Should I encode (how?) text sent through sockets?

thanks in advance for Your help
best regards
R
 
R

Ross Bamford

Hello everybody.

I have a problem with fatal error while parsing XML.

... [snip] ...

[Fatal Error] :1:1335: Invalid byte 1 of 1-byte UTF-8 sequence.
org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:264)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:292)

How can I avoid this problems?

Sounds like invalid XML. Do you have the "<?xml version='1.0'?>" element
at the start of your data?

Hope that helps,
Ross
 
R

R

the thing is that the String that is read from socket has XML prolog:

<?xml version="1.0" encoding="UTF-8"?>

any idea what should I do?

may it be the encoding?

thanks in advance
best regards
R
 
R

R

hm...

I think that encoding is broken (but I don't know how to fix it)

XML is in UTF-8

// debug
/ /System.out.println(sb.toString());
ByteArrayInputStream bais = new
ByteArrayInputStream(sb.toString().getBytes());
Document doc = xdb.parse(new InputSource(bais));

sb.toString() - maybe this is why xdb.parse(new InputSource(bais));
raises fatal error?
(maybe UTF-8 is converted to polish ISO-8859-2?)

am I right?
If so - how can it be fixed? (I'm newbie and not quite familiar with
Java)

thanks for Your help
best regards
R
 
R

Ross Bamford

I think that encoding is broken (but I don't know how to fix it)

Hmm, It's possible... Firstly, looking back at your first message i
notice you grabbed a docbuilder to 'xdb2' but then parsed with 'xdb' - I
assume this was a typo (otherwise check this up!)?

Without seeing more of your code, I'm not sure where you're getting the
data from (a socket I think you said?). If so, why not just pass in the
original InputStream to the parser? There is a parse(InputStream)
override that should correctly handle your encoding. If you have text
input there are decorators in java.io that will help.

Generally speaking you don't really want to convert things into bytes
unless you really need to - leave that to the lower level code in the
JDK (et al.) which has advanced support for encodings :)

Apart from that, back to my first suggestion - strip your input down to
the bare minimum and see if that helps.

Cheers,
Ross
 
A

A. Bolmarcich

hm...

I think that encoding is broken (but I don't know how to fix it)

XML is in UTF-8

// debug
/ /System.out.println(sb.toString());
ByteArrayInputStream bais = new
ByteArrayInputStream(sb.toString().getBytes());
Document doc = xdb.parse(new InputSource(bais));

sb.toString() - maybe this is why xdb.parse(new InputSource(bais));
raises fatal error?
(maybe UTF-8 is converted to polish ISO-8859-2?)

am I right?

The expression sb.toString().getBytes() uses the default encoding,
which for you may be ISO-8859-2.
If so - how can it be fixed? (I'm newbie and not quite familiar with
Java)

Chances are the encoding declaration of the XML declaration is UTF-8
(implicitly or explicity). Create the ByteArrayInputStram by using
the expression sb.toString().getBytes("UTF-8") so that the bytes are
the UTF-8 encoding of the Unicode characters of sb.toString().
 
T

Thomas Weidenfeller

R said:
Hello everybody.

I have a problem with fatal error while parsing XML.

I have a server and a client.
My server creates XML from web page given by the client, after parsing
it to
XML the content is being sent to client.

this is client code:

// read text from socket
while (null != line)
{
sb.append(line);
line = br.readLine();
}

I really don't see the reason why you first read in all data into a
thing and then start to feed that string into the parser. Why don't you
pars the data directly. I also don't see the reason why you want to go
to the sequence of

byte data from socket
-> text encoding to String(Buffer)
-> back to byte data for XML parser
-> wrapped as an InputSource

Non of this is necessary. You can provide the InputStream from the
Socket to the XML parser.

But if you really want to first read all the data:
// debug - this works I can see my XML response!
// System.out.println(sb.toString());

// parse my String back to DOM Document
DocumentBuilder xdb2 = XMLParserUtils.getXMLDocBuilder();
ByteArrayInputStream bais = new
ByteArrayInputStream(sb.toString().getBytes());

From the String.getBytes() documentations:

| Encodes this String into a sequence of bytes using the
| platform's default charset, storing the result into a new byte array.
^^^^^^^^^^^^^^^^^^^^^^^^^^

Is your platform's default charset UTF-8? I doubt it. You want to have
the getBytes(String charsetName) method instead. But ...
Document doc = xdb.parse(new InputSource(bais));

.... did you recognize that InputSource can directly read from a String?

/Thomas
 
Joined
May 10, 2010
Messages
1
Reaction score
0
This one works...

Yes, your are right. This solution fixed my problem, very smart!

Mingwei


A. Bolmarcich said:
On 2005-05-09, R <[email protected]> wrote:
> hm...
>
> I think that encoding is broken (but I don't know how to fix it)
>
> XML is in UTF-8
>
> // debug
> / /System.out.println(sb.toString());
> ByteArrayInputStream bais = new
> ByteArrayInputStream(sb.toString().getBytes());
> Document doc = xdb.parse(new InputSource(bais));
>
> sb.toString() - maybe this is why xdb.parse(new InputSource(bais));
> raises fatal error?
> (maybe UTF-8 is converted to polish ISO-8859-2?)
>
> am I right?


The expression sb.toString().getBytes() uses the default encoding,
which for you may be ISO-8859-2.

> If so - how can it be fixed? (I'm newbie and not quite familiar with
> Java)


Chances are the encoding declaration of the XML declaration is UTF-8
(implicitly or explicity). Create the ByteArrayInputStram by using
the expression sb.toString().getBytes("UTF-8") so that the bytes are
the UTF-8 encoding of the Unicode characters of sb.toString().
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top