parsing xml from a stream

P

Peter Horlock

Hi,

I am using apache xmlbeans 2.2 to parse XML from an InputStream
and to create Java Beans from it.

The input is ISO-8859-1 encoded. The first 3 lines, as well as
the last 3 lines, are empty lines, and I can't (currently) change that.
Before, we were using method.getResponseBodyAsString().trim();
and gave the result to xmlbeans - that worked, but resulted in a lot of
warnings in the Server LOGS, as the input sometimes can be pritty big.

Here's what I am doing now:
InputStream inputStream = method.getResponseBodyAsStream();

XmlOptions xmlOptions = new XmlOptions();
xmlOptions.setCharacterEncoding("ISO-8859-1");
xmlOptions.setLoadStripComments();
xmlOptions.setLoadTrimTextBuffer();
xmlOptions.setLoadStripWhitespace();

org.apache.xmlbeans.SchemaType type =
(org.apache.xmlbeans.SchemaType);
org.apache.xmlbeans.XmlBeans.getContextTypeLoader().parse
( inputStream, type, xmlOptions );

This however, throws the following error:
[...]
Caused by: java.io.CharConversionException: Malformed UTF-8 character:
0xfc 0x72 0x6b 0x65
at org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode
(UTF8XMLDecoder.java:141)
at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader
$FastStreamDecoder.read(XMLStreamReader.java:762)
at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read
(XMLStreamReader.java:162)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill
(PiccoloLexer.java:3474)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex
(PiccoloLexer.java:3958)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:
1290)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:
1400)
at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:
714)
at
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:
3435)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1270)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1257)
at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse
(SchemaTypeLoaderBase.java:345)

------------
When I instead used
method.getResponseBodyAsString().trim();
and created an InputStream based on the trimmed String, then it worked.
So I asume something is wrong with the empty lines at the beginning and
end of the document. How can I get rid of them without converting the
entire stream to a String (e.g. getResponseBodyAsString())???

Thanks in advance,

Peter
 
M

Mike Schilling

Peter said:
Hi,

I am using apache xmlbeans 2.2 to parse XML from an InputStream
and to create Java Beans from it.

The input is ISO-8859-1 encoded. The first 3 lines, as well as
the last 3 lines, are empty lines, and I can't (currently) change
that. Before, we were using method.getResponseBodyAsString().trim();
and gave the result to xmlbeans - that worked, but resulted in a lot
of warnings in the Server LOGS, as the input sometimes can be pritty
big.
Here's what I am doing now:
InputStream inputStream = method.getResponseBodyAsStream();

XmlOptions xmlOptions = new XmlOptions();
xmlOptions.setCharacterEncoding("ISO-8859-1");
xmlOptions.setLoadStripComments();
xmlOptions.setLoadTrimTextBuffer();
xmlOptions.setLoadStripWhitespace();

org.apache.xmlbeans.SchemaType type =
(org.apache.xmlbeans.SchemaType);
org.apache.xmlbeans.XmlBeans.getContextTypeLoader().parse
( inputStream, type, xmlOptions );

This however, throws the following error:
[...]
Caused by: java.io.CharConversionException: Malformed UTF-8
character:
0xfc 0x72 0x6b 0x65
at
org.apache.xmlbeans.impl.piccolo.xml.UTF8XMLDecoder.decode
(UTF8XMLDecoder.java:141)
at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader
$FastStreamDecoder.read(XMLStreamReader.java:762)
at org.apache.xmlbeans.impl.piccolo.xml.XMLStreamReader.read
(XMLStreamReader.java:162)
at
org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yy_refill
(PiccoloLexer.java:3474)
at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex
(PiccoloLexer.java:3958)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:
1290)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:
1400)
at
org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:
714)
at
org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:
3435)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1270)
at
org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:
1257)
at
org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse
(SchemaTypeLoaderBase.java:345)

------------
When I instead used
method.getResponseBodyAsString().trim();
and created an InputStream based on the trimmed String, then it
worked. So I asume something is wrong with the empty lines at the
beginning
and end of the document. How can I get rid of them without
converting
the entire stream to a String (e.g. getResponseBodyAsString())???

Write a subclass of FilterInputStream that trims off any leading
whitespace. I suspect the trailing whitespace won't cause any
problems, which is good, because it's harder to recognize.

This is very odd, though. If the input is ISO-8859-1, and you've told
the parser that it's ISO-8859-1, what the hell is it complaining about
malformed UTF-8 characters for? The blank lines can't be causing it,
because they'd be ASCII characters, which have the same values in
ISO-8859-1 and UTF-8.
 
P

Peter Horlock

Write a subclass of FilterInputStream that trims off any leading
whitespace.  I suspect the trailing whitespace won't cause any
problems, which is good, because it's harder to recognize.
Hm - but if I did that - wouldn't that then parse the entire stream
just before it will be
handled by xmlbeans? I mean, I changed from getResponseBodyAsString to
getResponseBodyAsStream
for performance reasons - now I wouldn't just want to do the same
manually! ;-)

This is very odd, though. If the input is ISO-8859-1, and you've told
the parser that it's ISO-8859-1, what the hell is it complaining about
malformed UTF-8 characters for? The blank lines can't be causing it,
because they'd be ASCII characters, which have the same values in
ISO-8859-1 and UTF-8.

Yeah, I don't get it either. I wish I could print out the character
('s) it complains about:
Malformed UTF-8 character:0xfc 0x72 0x6b 0x65

Maybe it's something completely different?!

Cheers,

Peter
 
M

Mayeul

Peter said:
Hm - but if I did that - wouldn't that then parse the entire stream
just before it will be
handled by xmlbeans? I mean, I changed from getResponseBodyAsString to
getResponseBodyAsStream
for performance reasons - now I wouldn't just want to do the same
manually! ;-)



Yeah, I don't get it either. I wish I could print out the character
('s) it complains about:
Malformed UTF-8 character:0xfc 0x72 0x6b 0x65

String s = new String(new byte[] {(byte)0xfc, 0x72, 0x6b, 0x65},
"iso-8859-1");
System.out.println(s);

This prints ürke, a fairly probable occurrence in text flow. Which makes
it probable the offending bytes are all right if taken as iso-8859-1
encoded text, but indeed malformed if taken as utf-8 encoded text.

It does look like the parser is ignoring your iso-8859-1 configuration
to me.
 
M

Mayeul

Peter said:
Hm - but if I did that - wouldn't that then parse the entire stream
just before it will be
handled by xmlbeans?

No, a FilterInputStream's general purpose is to filter what may or may
not be read from the original InputStream, with or without modifications.

It reads a flow and allows a flow to be read.

It normally does not need to read the entire InputStream just to decide
whether current content must be filtered or not, and it is definitely
not the best tool when this is what you need.


The suggested approach here, is to detect at start if you're reading
empty lines and just filter them out (by just not passing them along and
keep reading until they're over,) then stop filtering anything.
 
M

Mike Amling

Steven said:
Something that occurs to me is that XML without an <?xml encoding="..."
?> declaration at the very start has to be treated as UTF-8, unless you
have an out-of-band setting (which the OP does). It sounds like
setCharacterEncoding() isn't being passed down to the parser (of a
stream), so it's defaulting to UTF-8.


Since there appears to be a parse(Reader, ...) method, and the charset
is known, why not use new
InputStreamReader(method.getResponseBodyAsStream(), "ISO-8859-1"), and
pass that?

--Mike Amling
 
M

Mike Schilling

Peter said:
Hm - but if I did that - wouldn't that then parse the entire stream
just before it will be
handled by xmlbeans?

No. The FilterInputStream woiuld simply refuse to pass characters to
its caller until it saw non-whitespace for the first time. Very
roughly, its read method would look like

priate boolean nonWsSeen;

public int read() throws IOException
{
while (true)
{
int i = in.read();
if (nonWsSeen || i < 0)
return i;
char c = (char)i;
if (!Character.isWhitespace(c))
{
nonWsSeen = true;
return c;
}
}
 
M

Mike Schilling

Mike said:
Could there be an explicit erroneous <?xml ... encoding="UTF-8"?>
in
the stream and the parser is letting it override the xmlOptions?

Could be, though out-of-band settings are supposed to override in-band
settings. But if so, Steven's suggestion of using an
InputStreamReader to do the conversion is the right workaround.
 
P

Peter Horlock

Thanks guys, I solved it. It def. was the leading whitespaces.
The parser must have take UTF-8, even though the document said
ISO-8859-1 in the third line -
but it must be the first line.
So I found an Implementation of an InputStreamReader that trims of
leading whitespaces, and gave that to a new Inputstream - voila - it's
working now! :)

Peter
 
R

Roedy Green

[...]
Caused by: java.io.CharConversionException: Malformed UTF-8 character:

Sounds like somewhere along the line Java thinks the encoding is UTF-8
when it is something else.

See http://mindprod.com/jgloss/encoding.html

Check every place in your program where you specify an encoding or
accept a default -- opening a reader, converting bytes to String,
inside the XML ...
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin. For, as has been pointed out several times, there is no such thing as a random number — there are only methods to produce random numbers, and a strict arithmetic procedure of course is not such a method."
~ John von Neumann (born: 1903-12-28 died: 1957-02-08 at age: 53)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top