R
Robert M. Gary
I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...
1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.
2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Other character sets (UTC, etc) result in a different document.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).
Has anyone else seen this??
Here is my transformer...
Document new_document = documentBuilder.parse("japan2.xml");
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);
Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");
//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());
//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );
transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);
transformer.transform(new_source, new_result);
String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );
Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
invokeId="2"><AlertList><Alert><Name>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Name><AffectedObjects
type="Obj"><Obj><Name>ja_mo-¤¢¤¨¤¤¤ª¤¦</Name></Obj></AffectedObjects><Properties><Property><Name>Severity</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
name="TOD"><Int32>1112980583</Int32></Attr><Attr
name="DMPAlarmObject"><Str>ja_mo-¤¢¤¨¤¤¤ª¤¦</Str></Attr><Attr
name="CLASS"><Str>NetExpert</Str></Attr><Attr
name="MANAGER"><Str>NetExpert</Str></Attr><Attr
name="DMPAlarmName"><Str>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Str></Attr><Attr
name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
name="MsgType"><Str>Alarm</Str></Attr><Attr
name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>
When I try to read it using IBM's ICU character set tool uconv I get the
following...
=> uconv -f UTF-8 ~/test/xml/japan.xml
Conversion to Unicode from codepage failed at input byte position 116.
Bytes: a4 Error: Illegal character found
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true"
invokeId="1"><AlertList><Alert><Name>ja_alert-
However, when I tell it the document is EUC-JP it works...
=> uconv -f EUC-JP ~/test/xml/japan.xml
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true" invokeId=......
So, the document appears to be EUC-JP even though the Java DOM says its
UTF-8
-Robert
character set is EUC-JP
I'm seeing two strange things when using Japanese character sets...
1) If I write a program that does
System.out.println("$^%$%^^" ); //assume those are Japanese characters that
are multibyte under EUC-JP
The resulting output looks NOTHING like the characters I typed in.
Apparently the character set being used to read the literal is different
from the default.
2) If I create an XML document using the built in DOM which contains
elements with values in Japanese, I get strangeness when I transform that
into an XML document. If I do not set the character set in the transformer
the document will say its in UTF-8 (the XML header will). However, the
actual document is NOT UTF-8. I downloaded IBM's ICU character set utilities
(it knows nothing of XML, just character sets) and when I try to read the
document when telling uconv it is UTF-8 it claims it is invalid UTF-8.
However, if I try to read it telling it the document is EUC-JP it says its
good.
Also, when I change the transformer to use EUC-JP it creates the same
document bit-for-bit (other than changing the XML header to say EUC-8).
Other character sets (UTC, etc) result in a different document.
So, my conclusion is that by default the XML DOM says its UTF-8 in the
header, but ALWAYS uses the platform default unless you specify something
else (UTC for example).
Has anyone else seen this??
Here is my transformer...
Document new_document = documentBuilder.parse("japan2.xml");
System.out.println("I just read japan2.xml");
DOMSource new_source = new DOMSource(new_document);
StringWriter new_writer = new StringWriter();
StreamResult new_result = new StreamResult(new_writer);
Properties p = transformer.getOutputProperties();
//try explicit EUC
//p.setProperty(OutputKeys.ENCODING, "EUC-JP");
//try default (EUC)
//p.setProperty(OutputKeys.ENCODING,
// new OutputStreamWriter(new
ByteArrayOutputStream()).getEncoding());
//try UTF explicityly
//p.setProperty(OutputKeys.ENCODING, "UTF-8" );
transformer.setOutputProperties(p);
Properties p2 = transformer.getOutputProperties();
p2.list(System.out);
transformer.transform(new_source, new_result);
String new_text_doc = new_writer.toString();
System.out.println("XML doc is "+new_text_doc );
Resulting document...
XML doc is <?xml version="1.0" encoding="UTF-8"?><GenAlertsReq
confirmed="true"
invokeId="2"><AlertList><Alert><Name>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Name><AffectedObjects
type="Obj"><Obj><Name>ja_mo-¤¢¤¨¤¤¤ª¤¦</Name></Obj></AffectedObjects><Properties><Property><Name>Severity</Name><Value>major</Value></Property><Property><Name>Manager</Name><Value>NetExpert</Value></Property></Properties></Alert></AlertList><AttrList><Attr
name="TOD"><Int32>1112980583</Int32></Attr><Attr
name="DMPAlarmObject"><Str>ja_mo-¤¢¤¨¤¤¤ª¤¦</Str></Attr><Attr
name="CLASS"><Str>NetExpert</Str></Attr><Attr
name="MANAGER"><Str>NetExpert</Str></Attr><Attr
name="DMPAlarmName"><Str>ja_alert-¤È¤Á¤Ä¤Ê¤Î¤Ë</Str></Attr><Attr
name="ARCHIVE_LENGTH"><Int32>0</Int32></Attr><Attr
name="DMPAlarmSeverity"><Str>major</Str></Attr><Attr
name="MsgType"><Str>Alarm</Str></Attr><Attr
name="MGR_PORT_KEY"><Int32>93</Int32></Attr><Attr
name="ARCHIVE_OFFSET"><Int32>0</Int32></Attr></AttrList></GenAlertsReq>
When I try to read it using IBM's ICU character set tool uconv I get the
following...
=> uconv -f UTF-8 ~/test/xml/japan.xml
Conversion to Unicode from codepage failed at input byte position 116.
Bytes: a4 Error: Illegal character found
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true"
invokeId="1"><AlertList><Alert><Name>ja_alert-
However, when I tell it the document is EUC-JP it works...
=> uconv -f EUC-JP ~/test/xml/japan.xml
<?xml version="1.0" encoding="UTF-8"?>
<GenAlertsReq confirmed="true" invokeId=......
So, the document appears to be EUC-JP even though the Java DOM says its
UTF-8
-Robert