Strangeness with Japanese, XML, Java

R

Robert M. Gary

I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...

1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.

2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Other character sets (UTC, etc) result in a different document.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).

Has anyone else seen this??
Here is my transformer...

Document new_document = documentBuilder.parse("japan2.xml");
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);

Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());

//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );


Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
invokeId="2"><AlertList><Alert><Name>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Name><AffectedObjects
type="Obj"><Obj><Name>ja_mo-¤¢¤¨¤¤¤ª¤¦</Name></Obj></AffectedObjects><Properties><Property><Name>Severity</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
name="TOD"><Int32>1112980583</Int32></Attr><Attr
name="DMPAlarmObject"><Str>ja_mo-¤¢¤¨¤¤¤ª¤¦</Str></Attr><Attr
name="CLASS"><Str>NetExpert</Str></Attr><Attr
name="MANAGER"><Str>NetExpert</Str></Attr><Attr
name="DMPAlarmName"><Str>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Str></Attr><Attr
name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
name="MsgType"><Str>Alarm</Str></Attr><Attr
name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>

When I try to read it using IBM's ICU character set tool uconv I get the
following...
=> uconv -f UTF-8 ~/test/xml/japan.xml
Conversion to Unicode from codepage failed at input byte position 116.
Bytes: a4 Error: Illegal character found
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true"
invokeId="1"><AlertList><Alert><Name>ja_alert-

However, when I tell it the document is EUC-JP it works...
=> uconv -f EUC-JP ~/test/xml/japan.xml
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true" invokeId=......

So, the document appears to be EUC-JP even though the Java DOM says its
UTF-8
-Robert
 
S

Soren Kuula

Hi said:
I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...
1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.

1) Find out under which encoding your java source editor saves your java
source files. Check your result.

2) javac -encoding said:
2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.

How do you serialize your DOMs? I guess you will have
UTF-8-decode(EUC-JP-encode(UTF-8decode(EUC-JP-encode(literals))))
if you edit in EUC-JP, compile as UTF-8 and run your data throgh a
Writer that takes the platform default encoding ... that's a mess :)

Check that you override the platform default encoding and really go
UTF-8 when you serialize.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).

Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.
Other character sets (UTC, etc) result in a different document.

Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).

I'm pretty sure the error is where you output the data (you haven't
shown it..)
Has anyone else seen this??

All the time...
Document new_document = documentBuilder.parse("japan2.xml");

Verify until you are bloody sure what the encoding is of your input
document, and that it really matches with what the header says.
I think a mismatch will not result in an exception or anything, only bad
contents...
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);
Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");

//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());

//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );

transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);

transformer.transform(new_source, new_result);

String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );

PSE show us how it got into that file.
Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
....

Soren
 
S

Soren Kuula

Hi, Robert and myself,
Soren said:
Problem is where you serialize the document, not where you construct,
modify or transform it. And possibly in the decoding (by javac) of your
program text literals.


Probably the document is read in correctly .. anything else than unicode
and EUC will not be able to contain all the Japanese, and will bust.

Sorry, I misunderstood you there .. you mean, the OUTput is identical
except for the header?

I would take that as an indication that whatever you use for serializing
the DOM a byte sequence (file) does not look at what you set the
transformer to use. You will have to control that elsewhere.

Are you by any chance instantiating your own Writers when serializing?
Tried to give them different sencoding settings?

Soren
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,809
Latest member
moe77

Latest Threads

Top