change ISO8859-1 to GB2312

M

moonhkt

Hi All

Our database codepage is iso8859-1. Some data input with GB2312 data.
When export data to iso8859-1 format with GB2312 data, Is it possible
to change iso8859-1 to GB2312 format ?

Machine AIX.


I try below coding not work.

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String;
public class read_iso {
public static void main(String[] args) {
File aFile = new File("abc.txt");
try {
String str = "";
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile),
"iso8859-1"));

while (( str = in.readLine()) != null )
{
System.out.println(str);
System.out.println(new String (str.getBytes("iso8859-1")));
System.out.println(new String
(str.getBytes("iso-8859-1"),"GB2312")); /* not */
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

}
}
 
L

Lew

Our database codepage is iso8859-1. Some data input with GB2312 data.
When export data to iso8859-1 format with GB2312 data, Is it possible
to change iso8859-1 to GB2312 format ?

Machine AIX.


I try below coding not work.

import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String;
public class read_iso {

You should follow the Java naming conventions.
public static void main(String[] args) {
File aFile = new File("abc.txt");
try {

.... and indentation conventions.
String str = "";

And not initialize to values that are never used, only discarded.
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(aFile),
"iso8859-1"));

while (( str = in.readLine()) != null )
{
System.out.println(str);
System.out.println(new String (str.getBytes("iso8859-1")));

Didn't you say the data was input in GB2312 encoding?

Whatever, this constructs a string using the platform native encoding from
bytes encoded using ISO-8859-1. If that isn't the native encoding, you got
worries.
System.out.println(new String
(str.getBytes("iso-8859-1"),"GB2312")); /* not */

Now you're decoding bytes using GB2312 from bytes encoded using ISO-8859-1.
That can't work.

System.out always uses the platform default string encoding.
}
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

Don't silently eat exceptions.

My approach to the encoding would be a lot more straightforward. None of this
wacky "new String()" stuff.

<sscce source="eegee/FooCoder.java">
package eegee;

import java.io.*;
import org.apache.log4j.Logger;
import static org.apache.log4j.Logger.getLogger;

public class FooCoder
{
private transient final Logger logger = getLogger( FooCoder.class );

public static void main( String[] args )
{
new FooCoder().recode();
}

public void recode()
{
final BufferedReader rin;
final BufferedWriter owt;
try
{
rin = new BufferedReader( new InputStreamReader(
getClass().getResourceAsStream( "temp.txt" ),
"ISO-8859-1" ));
owt = new BufferedWriter( new OutputStreamWriter(
System.out, "GB2312" ));
}
catch ( IOException exc )
{
logger.error( exc );
return;
}
try
{
for ( String str; (str = rin.readLine()) != null; )
{
owt.write( str );
owt.newLine();
}
owt.flush();
}
catch ( IOException exc )
{
logger.error( exc );
}
finally
{
try
{
rin.close();
owt.close();
}
catch ( IOException exc )
{
logger.error( exc );
}
}
}
}
</sscce>
 
M

moonhkt

Our database codepage is iso8859-1. Some data input with GB2312 data.
When export data to iso8859-1 format with GB2312 data, Is it possible
to change iso8859-1 to GB2312 format ?
Machine AIX.
I try below coding not work.
import java.nio.charset.Charset ;
import java.io.*;
import java.lang.String;
public class  read_iso {

You should follow the Java naming conventions.
public static void main(String[] args) {
File aFile = new File("abc.txt");
try {

... and indentation conventions.
     String str = "";

And not initialize to values that are never used, only discarded.
     BufferedReader in = new BufferedReader(
         new InputStreamReader(new FileInputStream(aFile),
"iso8859-1"));
    while (( str = in.readLine()) != null )
    {
        System.out.println(str);
        System.out.println(new String (str.getBytes("iso8859-1")));

Didn't you say the data was input in GB2312 encoding?

Whatever, this constructs a string using the platform native encoding from
bytes encoded using ISO-8859-1.  If that isn't the native encoding, you got
worries.
        System.out.println(new String
(str.getBytes("iso-8859-1"),"GB2312"));  /* not */

Now you're decoding bytes using GB2312 from bytes encoded using ISO-8859-1.
That can't work.

System.out always uses the platform default string encoding.
    }
} catch (UnsupportedEncodingException e) {
} catch (IOException e) {
}

Don't silently eat exceptions.

My approach to the encoding would be a lot more straightforward.  None of this
wacky "new String()" stuff.

<sscce source="eegee/FooCoder.java">
  package eegee;

  import java.io.*;
  import org.apache.log4j.Logger;
  import static org.apache.log4j.Logger.getLogger;

  public class FooCoder
  {
    private transient final Logger logger = getLogger( FooCoder.class );

    public static void main( String[] args )
    {
     new FooCoder().recode();
    }

    public void recode()
    {
     final BufferedReader rin;
     final BufferedWriter owt;
     try
     {
       rin = new BufferedReader( new InputStreamReader(
         getClass().getResourceAsStream( "temp.txt" ),
         "ISO-8859-1" ));
       owt = new BufferedWriter( new OutputStreamWriter(
         System.out, "GB2312" ));
     }
     catch ( IOException exc )
     {
       logger.error( exc );
       return;
     }
     try
     {
       for ( String str; (str = rin.readLine()) != null; )
       {
         owt.write( str );
         owt.newLine();
       }
       owt.flush();
     }
     catch ( IOException exc )
     {
       logger.error( exc );
     }
     finally
     {
       try
       {
         rin.close();
         owt.close();
       }
       catch ( IOException exc )
       {
         logger.error( exc );
       }
     }
  }}

</sscce>

Hi Lew
Thank a lot.
How to check platform native encoding ?

Change your code as below. My test file can conv to UTF-8, view in
Reflection UTF-8 Emulation, the font is ok.
View in IE the font is ok.

temp.txt file
| 10 TEST1 |测试1
| |
| 11 TEST2 |测试2
| |
| 12 TEST3 |测试3
| |
| 13 TEST4 |测试4
| |
| 14 TEST5 |测试5
| |


import java.io.*;
public class conv_ig
{
public static void main( String[] args )
{
new conv_ig().recode();
}
public void recode()
{
final BufferedReader rin;
final BufferedWriter owt;
try
{
rin = new BufferedReader( new InputStreamReader(
/* getClass().getResourceAsStream( "temp.txt" ),
"ISO-8859-1" ));
owt = new BufferedWriter( new OutputStreamWriter(System.out,
"GB2312" ));
*/
getClass().getResourceAsStream( "temp.txt" ),"GB2312" ));
owt = new BufferedWriter( new OutputStreamWriter(
System.out, "UTF-8" ));
}
catch ( IOException exc )
{
/* logger.error( exc ); */
return;
}
try
{
for ( String str; (str = rin.readLine()) != null; )
{
owt.write( str );
owt.newLine();
}
owt.flush();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
finally
{
try
{
rin.close();
owt.close();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
}
}
}
 
L

Lew

moonhkt said:
Change your code as below. My test file can conv to UTF-8, view in
Reflection UTF-8 Emulation, the font is ok.

What is "Reflection UTF-8"?

Not a bad job there, but I have to wonder why you ruined the indentation and
still are flouting the naming conventions. Code should be readable.

Also, it is exceedingly bad that you eliminated logging. You should keep the
logging. Switch to java.util.logging if you don't like log4j or don't care to
add the JAR, but for Pete's sake keep the logging. Yikes.

Here's a pop quiz for you - given that few code examples I've seen use the
idiom I did of a separate try block for opening the Reader and Writer from the
one for using them, why do you think I bothered?

Is it better or worse than the common idiom, or simply a matter of style and
more power to you for whichever?
View in IE the font is ok.

temp.txt file
| 10 TEST1 |测试1
| |
| 11 TEST2 |测试2
| |
| 12 TEST3 |测试3
| |
| 13 TEST4 |测试4
| |
| 14 TEST5 |测试5
| |


import java.io.*;
public class conv_ig
{
public static void main( String[] args )
{
new conv_ig().recode();
}
public void recode()
{
final BufferedReader rin;
final BufferedWriter owt;
try
{
rin = new BufferedReader( new InputStreamReader(
/* getClass().getResourceAsStream( "temp.txt" ),
"ISO-8859-1" ));
owt = new BufferedWriter( new OutputStreamWriter(System.out,
"GB2312" ));
*/
getClass().getResourceAsStream( "temp.txt" ),"GB2312" ));
owt = new BufferedWriter( new OutputStreamWriter(
System.out, "UTF-8" ));
}
catch ( IOException exc )
{
/* logger.error( exc ); */
return;
}
try
{
for ( String str; (str = rin.readLine()) != null; )
{
owt.write( str );
owt.newLine();
}
owt.flush();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
finally
{
try
{
rin.close();
owt.close();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
}
}
}
 
M

moonhkt

moonhkt said:
Change your code as below. My test file can conv to UTF-8, view in
Reflection UTF-8 Emulation, the font is ok.

What is "Reflection UTF-8"?

Not a bad job there, but I have to wonder why you ruined the indentation and
still are flouting the naming conventions.  Code should be readable.

Also, it is exceedingly bad that you eliminated logging.  You should keep the
logging.  Switch to java.util.logging if you don't like log4j or don't care to
add the JAR, but for Pete's sake keep the logging.  Yikes.

Here's a pop quiz for you - given that few code examples I've seen use the
idiom I did of a separate try block for opening the Reader and Writer from the
one for using them, why do you think I bothered?

Is it better or worse than the common idiom, or simply a matter of style and
more power to you for whichever?


View in IE the font is ok.
temp.txt file
| 10 TEST1    |测试1
|                        |
| 11 TEST2    |测试2
|                        |
| 12 TEST3    |测试3
|                        |
| 13 TEST4    |测试4
|                        |
| 14 TEST5    |测试5
|                        |
import java.io.*;
public class conv_ig
{
     public static void main( String[] args )
     {
      new conv_ig().recode();
     }
      public void recode()
{
    final BufferedReader rin;
      final BufferedWriter owt;
      try
      {
        rin = new BufferedReader( new InputStreamReader(
         /* getClass().getResourceAsStream( "temp.txt" ),
          "ISO-8859-1" ));
          owt = new BufferedWriter( new OutputStreamWriter(System.out,
"GB2312" ));
         */
        getClass().getResourceAsStream( "temp.txt" ),"GB2312" ));
        owt = new BufferedWriter( new OutputStreamWriter(
          System.out, "UTF-8" ));
      }
      catch ( IOException exc )
      {
        /* logger.error( exc );  */
        return;
      }
      try
      {
        for ( String str; (str = rin.readLine()) != null; )
        {
          owt.write( str );
          owt.newLine();
        }
        owt.flush();
      }
      catch ( IOException exc )
      {
        /* logger.error( exc );  */
      }
      finally
      {
        try
        {
          rin.close();
          owt.close();
        }
        catch ( IOException exc )
        {
         /* logger.error( exc );  */
        }
      }
}
}

Sorry about this. This is dirty method to test the code. Reflection
is Telnet software using UTF-8 Emulation to check the the string
encoding.
I will check How to using java.util.logging .

Can you give some example where "ruined the indentation " ? and what
about the the naming conventions ?
 
L

Lew

moonhkt said:
public class conv_ig
{
public static void main( String[] args )
{
new conv_ig().recode();
}
public void recode()
{
....

Please do not quote sigs.
Sorry about this. This is dirty method to test the code. Reflection
is Telnet software using UTF-8 Emulation to check the the string
encoding.

Oh, THAT Reflection.
I will check How to using java.util.logging .

Can you give some example where "ruined the indentation " ? and what
about the the naming conventions ?

I apologize about the indentation comment - apparently I was seeing an
artifact of word wrap imposed by the posting software and not something that
you did.

As for the naming conventions:
<http://java.sun.com/docs/codeconv/index.html>

You named the class:
The convention is to name a class with an initial upper-case letter and camel
case (mixed case, first letter of each word within the compound capitalized
and the rest lower-case), as explained in the Java Code Conventions document.

Methods and non-constant variables (or, more conventionally, non-final
variables) begin with a lower-case letter and are otherwise in camel case.

Underscores should only be used in names that comprise all upper-case letters,
namely those of constant (or more conventionally, final) member variables.
 
R

RedGrittyBrick

Sorry about this. This is dirty method to test the code. Reflection
is Telnet software using UTF-8 Emulation to check the the string
encoding.

There's much wrong in the above.

Reflection is a *terminal-emulator* marketed by Attachmate (who
presumably absorbed WRQ, it's original developers).

Reflection does not *emulate* UTF-8, Reflection handles several
character encodings amongst which is UTF-8. Reflection doesn't *check*
the encoding (AFAIK), it just *uses* the configured encoding to
determine which glyph to display for a received byte sequence.

What Reflection *does* emulate is a variety of serial character-mode
terminals such as VT220, Wyse-50 and varieties of ANSI "standard" terminals.

Telnet is only one of several application layers supported by Reflection
for host communication, though I suppose it is the principal one. FTP
and SSH are others.
 
M

moonhkt

There's much wrong in the above.

Reflection is a *terminal-emulator* marketed by Attachmate (who
presumably absorbed WRQ, it's original developers).

Reflection does not *emulate* UTF-8, Reflection handles several
character encodings amongst which is UTF-8. Reflection doesn't *check*
the encoding (AFAIK), it just *uses* the configured encoding to
determine which glyph to display for a received byte sequence.

What Reflection *does* emulate is a variety of serial character-mode
terminals such as VT220, Wyse-50 and varieties of ANSI "standard" terminals.

Telnet is only one of several application layers supported by Reflection
for host communication, though I suppose it is the principal one. FTP
and SSH are others.

Hi All
Thank for explain how reflection works.

Our database is ISO8859-1 format with some GB2312 and other non
ISO8859-1 data. Now, we want print GB2312 code in work order routing.
We planing to purchase a Chinese line printer for printing GB2312. The
line printer can print the file under UNIX. Why the output file no
need to convert GB2312 format before printing ?
Any Suggestion ? And Java Conversion program can convert my output to
UTF-8.

moonhkt
 
M

moonhkt

You don't provide any details so I can only guess. My guess is that the
Database thinks it has (for example) six European letters when in fact
it has three Chinese characters. The database is happy to store and
retrieve the bytes sequences that would, under 8859-1 encoding represent
six European letters. When the retrieved byte sequences are sent to the
printer, because the printer is configured to use the GB2312 encoding,
it interprets those same byte sequences, not as six European letters but
as three Chinese characters.

On the other hand, so far as I know, Unix/Linux printing systems like
CUPS allow you to specify a character encoding as an option to commands
like lp. they also pick them up from the locale (see environment
variables) This allows CUPS to do whatever is needed to print those
characters correctly.


I'm sure it can. If a Java program knows what encodings are to be used
for data input and data output then the standard classes allow you to
handle data correctly*. How that would help in your situation I don't
know. if your database thinks it is handing 8859-1 encoded European
characters to your Java program when in fact some of that needs to be
interpreted as GB3212 then I expect you will have to do something ugly
in Java. UTF-8 is, in general, a good thing. Configuring your database,
your programs, your locale and your printer for UTF-8 might well be a
good thing to do.

Hi All
Today, Our printer vendor suggest us provide Hanzi EBCDIC code for
testing Chinease printing.
Due to IBM Hosts All support Hanzi EBCDIC code.
How to Convert GB2312/UTF-8 to EBCDID

I try cp1047 on cp1838, All ASCII code like before. By compare using
diff to check the different.
 
R

RedGrittyBrick

Hi All
Today, Our printer vendor suggest us provide Hanzi EBCDIC code for
testing Chinease printing.
Due to IBM Hosts All support Hanzi EBCDIC code.

You have an IBM System z?

Throwing EBCDIC code-pages into the mix with 8859-1, GB2312 and UTF-8
seems to me to be making your life more complex when you need to make it
simpler. Still, presumably your printor vendor's saleman has your best
interests at heart.

How to Convert GB2312/UTF-8 to EBCDID

I try cp1047 on cp1838, All ASCII code like before. By compare using
diff to check the different.

What JCL did you use to run diff?
 
M

moonhkt

You have an IBM System z?

Throwing EBCDIC code-pages into the mix with 8859-1, GB2312 and UTF-8
seems to me to be making your life more complex when you need to make it
simpler. Still, presumably your printor vendor's saleman has your best
interests at heart.




What JCL did you use to run diff?

Hi All

Our system is P630.
No , Suppose just two charset on file. ISO8859-1/GB2312 to UTF-8 or
EBCDID
For compare the different output by using UNIX diff command.

moonhkt
 
M

moonhkt

Our system is P630.
No , Suppose just two charset on file. ISO8859-1/GB2312 to UTF-8 or
EBCDID
For compare the different output by using UNIX diff command.

Your task can be broken down into three elements:
1) Read ISO-8859-1 encoded text from database.
2) Convert incorrectly encoded text back into Unicode UTF-16
3) Convert UTF-16 to UTF-8 (or EBCDIC)

For the first part, Your JDBC drivers should provide a way to make sure
the correct encoding conversion is performed so that whatever encoding
the database is using is known to the driver and it can convert text to
the UTF-16 encoding used by Java. See your DBMS documentation.

The second part is tricky. Your database thinks the GB2312 data is
ISO-8859-1 (because you lied to it). Now java is under the same illusion
and has done the arithmetic that would normally convert from ISO-8859-1
to Unicode/UTF-16. This has probably made an unholy mess of the GB2312
data. You have to reverse this. It's late, I'm tired and I just don't
care enough at the moment to think about how this would be done. (later)
I think I would use java.lang.String's methods to convert to byte[]
using ISO-8859-1 conversion then restore to String form using GB2312
conversion. I'm assuming the GB2312 data pretending to be ISO-8859-1 is
in a separate field in a table and hence in a separate
ResultSet.getString() result. If not ... oh dear.

The last part is easy - see below. I just output some GB2312 characters
using EUC-CN encoding into a HTML file because my web-browser, Firefox,
understands GB2312 - it's a convenient way to check the correctness of
the conversion. You want UTF-8 or EBCDIC not GB2312 but the principle is
the same.

-------------------------------8<------------------------------
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.io.UnsupportedEncodingException;

public class TestGB2312 {

   public static void main(String[] args) {
     /*
      * Note: The fun characters are specified as Unicode escapes.
      * We later get Java to convert to GB2312 in EUC_CN encoding.
      */
     String data = "<html><head><meta charset=\"gb2312\"></head><body>"
           + "<p>Character set:GB2312</p>" + "<p>Encoding: EUC_CN</p>"
           + "<p>Roman Numerals: \u2160\u2161\u2162\u2163</p>"
           + "<p>Han (Numerals): \u3220\u3221\u3222\u3223</p>"
           + "</body></html>";

     writeFileAsGB2312("GB2312.html", data);
   }

   private static void writeFileAsGB2312(String fileName, String data) {
     PrintWriter pw;
     try {
       pw = new PrintWriter(fileName, "GB2312");
       pw.println(data);
       pw.close();
     } catch (FileNotFoundException e) {
       e.printStackTrace();
     } catch (UnsupportedEncodingException e) {
       e.printStackTrace();
     }
   }

}

-------------------------------8<------------------------------

Where I've got "GB2312" and "gb2312" you might want "UTF-8" and "utf8".

See
<http://java.sun.com/javase/6/docs/technotes/guides/intl/encoding.doc....>

I imagine you knew all the above and were hoping for help with the part
which I numbered 2.

Thank. I am not testing JDBC.
But tired to GB2312 file , to UTF-8 then BIG5

10 TEST1 |测试1
11 TEST2 |测试2
13 TEST4 |测试4

it can conv to UTF-8

When conv UTF-8 to BIG5, can not. Do you know why ?

Checked with IE, the BIG5 code is "?"


import java.io.*;
public class Conv_cp
{
public static void help ()
{
System.out.println("Missing parameter");
System.out.println("1- Input file name ");
System.out.println("2- FromCode ");
System.out.println("3- ToCode ");
System.exit(0);
}
public static void main( String[] args )
{
if ( args.length < 3 ) {
help ();
}
new Conv_cp().recode(args[0] , args[1] , args[2] );
}

public void recode(String fnin, String cpf , String cpt)
{
final BufferedReader rin;
final BufferedWriter owt;
try
{
rin = new BufferedReader( new InputStreamReader(
/* getClass().getResourceAsStream( "temp.txt" ),
"ISO-8859-1" ));
owt = new BufferedWriter( new
OutputStreamWriter(System.out, "GB2312" ));
*/
getClass().getResourceAsStream( fnin ),cpf ));
owt = new BufferedWriter( new OutputStreamWriter(
System.out, cpt ));
}
catch ( IOException exc )
{
/* logger.error( exc ); */
return;
}
try
{
for ( String str; (str = rin.readLine()) != null; )
{
owt.write( str );
owt.newLine();
}
owt.flush();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
finally
{
try
{
rin.close();
owt.close();
}
catch ( IOException exc )
{
/* logger.error( exc ); */
}
}
}
}
 
R

RedGrittyBrick

Thank [you]. I am not testing [with] JDBC.

When you wrote "Our database is ISO8859-1 format with some GB2312 and
other non ISO8859-1 data." I got the impression that a DBMS was
involved. If you were using Hibernate or some other framework rather
than JDBC, the same principles would apply.

But tired to GB2312 file , to UTF-8 then BIG5

BIG5! Another character set and encoding! I think that makes seven
you've mentioned in this thread! Any more?

10 TEST1 |测试1
11 TEST2 |测试2
13 TEST4 |测试4

[the program below] can conv[ert a file containing the above data] to UTF-8

When [it] conv[erts from] UTF-8 to BIG5, [it] can not [successfully convert
all characters].Do you know why ?

You are ignoring exceptions. Exceptions might be telling you something
you really need to know about. Don't ignore exceptions.

I'm not familiar with GB2312 and Big5 but I expect that there are
characters in GB2312 that are not in Big5. It is almost certain.

GB2312 originated in the People's Republic of China, where simplified
Chinese characters were mandatory. I think this policy has been relaxed now.

I suspect Big5 originated in either the British colony of Hong Kong or
in the Republic of China (Taiwan/Formosa). In both these places,
Traditional Chinese characters were (and still are) used.

Whether the conversion from GB2312 to UTF-16 and then to Big5 can
convert a simplified character to a traditional counterpart is unknown
to me. Perhaps this causes conversion problems?

Checked [the resulting file] with IE, the BIG5 code is [displayed as] "?"


You have to tell IE what encoding to use to display the file. That was
why I wrote HTML markup containing <meta charset="gb2312">. You can
probably force an encoding using a menu option in IE. You certainly can
in Firefox.

If IE does not have access to a font containing the required glyph, it
will display a placeholder character. I don't use IE much so I'm not
certain what the placeholder IE displays, a small box, a question-mark
or something else.

If Java writes a character that is not present in the specified output
character set then I expect it might also substitute a placeholder
character.

Also Big5 is weird, apparently it doesn't exactly encode characters, it
encodes logograms or parts of graphical characters. It also has to be
paired with a single-byte character-set that isn't specified in the Big5
standard. Also there are variants of Big5. Lots of scope for encoding
issues. Maybe Java and IE disagree about Big5 variants?
<http://en.wikipedia.org/wiki/Big5>

P.S. IE6 is old and a security hazard, I'd upgrade.
 
M

moonhkt

Thank [you]. I am not testing [with] JDBC.

When you wrote "Our database is ISO8859-1 format with some GB2312 and
other non ISO8859-1 data." I got the impression that a DBMS was
involved. If you were using Hibernate or some other framework rather
than JDBC, the same principles would apply.
But tired to GB2312 file , to UTF-8 then BIG5

BIG5! Another character set and encoding! I think that makes seven
you've mentioned in this thread! Any more?
10 TEST1    |测试1
11 TEST2    |测试2
13 TEST4    |测试4
[the program below] can conv[ert a file containing the above data] to UTF-8
When [it] conv[erts from] UTF-8 to BIG5, [it] can not [successfully convert
all characters].Do you know why ?

You are ignoring exceptions. Exceptions might be telling you something
you really need to know about. Don't ignore exceptions.

I'm not familiar with GB2312 and Big5 but I expect that there are
characters in GB2312 that are not in Big5. It is almost certain.

GB2312 originated in the People's Republic of China, where simplified
Chinese characters were mandatory. I think this policy has been relaxed now.

I suspect Big5 originated in either the British colony of Hong Kong or
in the Republic of China (Taiwan/Formosa). In both these places,
Traditional Chinese characters were (and still are) used.

Whether the conversion from GB2312 to UTF-16 and then to Big5 can
convert a simplified character to a traditional counterpart is unknown
to me. Perhaps this causes conversion problems?
Checked [the resulting file] with IE, the BIG5 code is [displayed as] "?"


You have to tell IE what encoding to use to display the file. That was
why I wrote HTML markup containing <meta charset="gb2312">. You can
probably force an encoding using a menu option in IE. You certainly can
in Firefox.

If IE does not have access to a font containing the required glyph, it
will display a placeholder character. I don't use IE much so I'm not
certain what the placeholder IE displays, a small box, a question-mark
or something else.

If Java writes a character that is not present in the specified output
character set then I expect it might also substitute a placeholder
character.

Also Big5 is weird, apparently it doesn't exactly encode characters, it
encodes logograms or parts of graphical characters. It also has to be
paired with a single-byte character-set that isn't specified in the Big5
standard. Also there are variants of Big5. Lots of scope for encoding
issues. Maybe Java and IE disagree about Big5 variants?
<http://en.wikipedia.org/wiki/Big5>

P.S. IE6 is old and a security hazard, I'd upgrade.


Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/
Simplified Chinese and Traditional Chinese. Those Language imported by
lookup function. e.g. When User Input "G" in particular , the lookup
program will get "Green" in corresponding Language Character set.
Also, I checked other GB2312 Database(Progress Database), the Encoding
Value of "测试" (in English "TEST") same as IS08859-1. Checked by unix
tool "od -ct x1 file_name".

For BIG5 conversion, I just for testing how to change GB2312 to BIG5.
My Boss ask me for check what is the encoding value for "TEST" in
GB2312 or BIG5. So, I want convert to BIG5 to check what encoding
value in BIG5.

I will add the exceptions back.

Thank a lot.


moonhkt
 
R

RedGrittyBrick

Our ISO8859-1 Database(Progress Database) have some Japanese/Korea/
Simplified Chinese and Traditional Chinese. Those Language imported
by lookup function. e.g. When User Input "G" in particular , the
lookup program will get "Green" in corresponding Language Character
set. Also, I checked other GB2312 Database(Progress Database), the
Encoding Value of "测试" (in English "TEST") same as IS08859-1. Checked
by unix tool "od -ct x1 file_name".

For BIG5 conversion, I just for testing how to change GB2312 to
BIG5. My Boss ask me for check what is the encoding value for "TEST"
in GB2312 or BIG5. So, I want convert to BIG5 to check what encoding
value in BIG5.

"测试" is simplified Chinese.
"測試" is traditional Chinese.

So far as I know:
GB2312 is simplified Chinese.
Big5 is traditional Chinese.

Therefore:
You cannot write "测试" in Big5
You cannot write "測試" in GB2312

Unless I am mistaken.


One simplified Chinese character may correspond to several traditional
Chinese characters. Java cannot translate "测试" to "測試" because that
is a process that requires artistic skill, literary skill and an
understanding of the context.

I do not read, write, speak nor understand Chinese so I only offer the
above as my somewhat uninformed understanding of the situation.
 
M

moonhkt

"测试" is simplified Chinese.
"測試" is traditional Chinese.

So far as I know:
GB2312 is simplified Chinese.
Big5 is traditional Chinese.

Therefore:
You cannot write "测试" in Big5
You cannot write "測試" in GB2312

Unless I am mistaken.

One simplified Chinese character may correspond to several traditional
Chinese characters. Java cannot translate "测试" to "測試" because that
is a process that requires artistic skill, literary skill and an
understanding of the context.

I do not read, write, speak nor understand Chinese so I only offer the
above as my somewhat uninformed understanding of the situation.

Hi RGB


"测试" in GB2312 and "測試" in BIG5.

My testing is Change GB2312 to UTF-8 (OK). Then UTF-8 to BIG5, This
change not OK.
Is some missing or other reason ?

One simplified Chinese character may correspond to several traditional
Chinese characters. It may not true.
Anyway, Thank for you help.
 
R

RedGrittyBrick

"测试" in GB2312 and "測試" in BIG5.

Yes. Different characters. Not the same.
My testing is Change GB2312 to UTF-8 (OK).

Yes. Because Unicode includes all characters that are in GB2312.
Then UTF-8 to BIG5, This change not OK.

No, because Big5 is a lot smaller than Unicode and does not include 测
or 试 characters*
Is some missing or other reason ?

Yes, 测 and 试 characters are missing from Big5*
One simplified Chinese character may correspond to several traditional
Chinese characters. It may not true.

It is true for some characters. For example:
å° = 臺 or å° or 檯 or æž± or 颱

There is a list at
<http://en.wikipedia.org/wiki/Multip...ing_Simplified_Chinese_to_Traditional_Chinese>

I suspect Java, for this reason, does not attempt to translate a
simplified Chinese character to a traditional Chinese character.



* I haven't checked because finding Chinese characters in enormous lists
is hard work for me. So I might be wrong :)
 
R

RedGrittyBrick

Yes. Different characters. Not the same.


Yes. Because Unicode includes all characters that are in GB2312.


No, because Big5 is a lot smaller than Unicode and does not include 测
or 试 characters*


Yes, 测 and 试 characters are missing from Big5*


It is true for some characters. For example:
å° = 臺 or å° or 檯 or æž± or 颱

There is a list at
<http://en.wikipedia.org/wiki/Multip...ing_Simplified_Chinese_to_Traditional_Chinese>


I suspect Java, for this reason, does not attempt to translate a
simplified Chinese character to a traditional Chinese character.



* I haven't checked because finding Chinese characters in enormous lists
is hard work for me. So I might be wrong :)


See http://www.chinesetools.eu/tools/gb2big5/chinese-convert.js
You can probably adapt this to Java.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,730
Latest member
AudryNolan

Latest Threads

Top