New rules for literal characters in source code?

S

Stefan Ram

When you use a Windows-1252 editor to edit Java source and
then the Java process prints it to a Windows CP-850 console,
umlauts, like »ü«, will not be rendered correctly, because
the process will print the character »³« that has the code
in CP 850 that »ü« has in Windows 1252.

ü ---Windows 1252---> 252 ---CP 850---> ³

Until Windows XP and Java 1.6, one could edit the Java
source code in a console with CP 850 (using the Windows
console program »EDIT«), thus writing the »"ü"« using CP
850. When such a source code then is executed, it will print
a literal »ü« in CP 850, because the literal byte in the
source code has the value »129«, which is the code of »ü« in
CP 850. I used this in my classes as a quick way to show
that Java is able to print umlauts into the console, one
just needs to use an editor with the same codepage as the
console.

In Windows 7 with Java 1.7, this now gives an error. Java
tries to outsmart me and detects that the »ü« in the CP-850
source code does not exist in the supposed charset
Windows-1252, it gives me an error message. If I then
compile with »-encoding CP-850«, the error will be gone, but
Java will be too smart: It detects that »ü« means »ü« in CP
850 and converts the literal byte value 129 from the source
code to the value »ü« has in Unicode, then it will print
this to the CP-850 console, neutralizing the intended effect
of using a CP-850 editor »EDIT« and giving me »³«, again.

So, this change might make some source code invalid or
change its behavoir.

Well.
 
I

Ian Pilcher

When you use a Windows-1252 editor to edit Java source and
then the Java process prints it to a Windows CP-850 console,
umlauts, like »ü«, will not be rendered correctly, because
the process will print the character »³« that has the code
in CP 850 that »ü« has in Windows 1252.

Windows still isn't using UTF-8?

Good grief!
 
A

Arne Vajhøj

Windows still isn't using UTF-8?

Most Windows editors similar to other OS'es editors
support multiple character sets including CP-1252 and UTF-8.

Arne
 
A

Arne Vajhøj

When you use a Windows-1252 editor to edit Java source and
then the Java process prints it to a Windows CP-850 console,
umlauts, like »ü«, will not be rendered correctly, because
the process will print the character »³« that has the code
in CP 850 that »ü« has in Windows 1252.

ü ---Windows 1252---> 252 ---CP 850---> ³

Until Windows XP and Java 1.6, one could edit the Java
source code in a console with CP 850 (using the Windows
console program »EDIT«), thus writing the »"ü"« using CP
850. When such a source code then is executed, it will print
a literal »ü« in CP 850, because the literal byte in the
source code has the value »129«, which is the code of »ü« in
CP 850. I used this in my classes as a quick way to show
that Java is able to print umlauts into the console, one
just needs to use an editor with the same codepage as the
console.

In Windows 7 with Java 1.7, this now gives an error. Java
tries to outsmart me and detects that the »ü« in the CP-850
source code does not exist in the supposed charset
Windows-1252, it gives me an error message. If I then
compile with »-encoding CP-850«, the error will be gone, but
Java will be too smart: It detects that »ü« means »ü« in CP
850 and converts the literal byte value 129 from the source
code to the value »ü« has in Unicode, then it will print
this to the CP-850 console, neutralizing the intended effect
of using a CP-850 editor »EDIT« and giving me »³«, again.

So, this change might make some source code invalid or
change its behavoir.

I don't think it will affect that much code.

But interesting little gem.

Arne
 
B

BGB

Windows still isn't using UTF-8?

> Good grief!
>

most things in Windows are done 1 of 2 ways:
using ASCII and codepages;
using UTF-16.

granted, it wouldn't likely be all that difficult to write a UTF-8 ->
UTF-16 console printer, but it will involve the relevant parts of the
Win32 API.


so, the issue may not be so much Windows, but more what the particular
JVM does regarding console output.

most likely, it does the least effort thing, which is to directly emit
bytes, which in turn means ASCII.


if it really matters, there is always JNI and the ability to overload
the PrintStream class...
 
A

Arne Vajhøj

most things in Windows are done 1 of 2 ways:
using ASCII and codepages;
using UTF-16.

granted, it wouldn't likely be all that difficult to write a UTF-8 ->
UTF-16 console printer, but it will involve the relevant parts of the
Win32 API.

so, the issue may not be so much Windows, but more what the particular
JVM does regarding console output.

most likely, it does the least effort thing, which is to directly emit
bytes, which in turn means ASCII.

if it really matters, there is always JNI and the ability to overload
the PrintStream class...

There are also UTF-8 support.

Even notepad can read and write UTF-8.

But the console is special. MS wanted it to be DOS compatible.
So it is typical CP-437 or CP-850.

Arne
 
A

Andreas Leitgeb

Stefan Ram said:
If I then
compile with »-encoding CP-850«, the error will be gone, but
Java will be too smart: It detects that »ü« means »ü« in CP
850 and converts the literal byte value 129 from the source
code to the value »ü« has in Unicode,

So far it is not "too smart", but just simply "correct" and
doing the right thing.
then it will print
this to the CP-850 console,

This is where it goes awry. There should be a way to tell java
(the jvm) to use CP-850 as default encoding instead of cp-1252
when running in console.
 
B

BGB

There are also UTF-8 support.

Even notepad can read and write UTF-8.

yes, notepad can, but I meant more in terms of most of the Win32 API
calls, which tend to support either ASCII+codepages or UTF-16, meaning
if one wants UTF-8 they typically have to support it manually (such as
by converting the string and passing it to the UTF-16 version of the call).

"WriteConsoleA()" expects ASCII, and "WriteConsoleW()" does UTF-16...

granted, converting UTF-8 -> UTF-16 and similar is fairly trivial...

But the console is special. MS wanted it to be DOS compatible.
So it is typical CP-437 or CP-850.

yes, by default...

one can still use Unicode though, at least accurding to the MSDN /
Platform SDK documentation, it just requires using the appropriate call.

now, what about whatever the JVM's "println()" call does by default?...


worst case, one could use JNI to make the "WriteConsoleW()" calls and
similar, and implement a custom PrintStream wrapper (and use it in place
of the default one in "System.out").

granted, there may well be simpler and cleaner ways to do it, but I am
no real expert on the JVM.


or such...
 
A

Arne Vajhøj

Why would you do such a stupid thing?

Most people working in IT does not have religious aversions
against specific character sets.

CP-1252 is just a character set like so many other.

Arne
 
A

Arne Vajhøj

yes, notepad can, but I meant more in terms of most of the Win32 API
calls, which tend to support either ASCII+codepages or UTF-16, meaning
if one wants UTF-8 they typically have to support it manually (such as
by converting the string and passing it to the UTF-16 version of the call).

"WriteConsoleA()" expects ASCII, and "WriteConsoleW()" does UTF-16...

That is true.

TCHAR, _T, #define UNICODE and all that good stuff.

But this sub thread started about using an editor.

Arne
 
B

BGB

Most people working in IT does not have religious aversions
against specific character sets.

CP-1252 is just a character set like so many other.

yep.

in an ideal world, probably everything would default to UTF-8, but alas...


I thought I remembered there being a codepage option in Notepad, but all
I seem to be seeing now is ASCII / UTF-8 / Unicode / Unicode Big-Endian.

or such...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top