UTF-8 problems with windows

M

Michael Jung

I have the following code fragment in a tiny webserver:

...
os = sock.socket().getOutputStream();
osr = new PrintWriter(new PrintStream(os, true, "UTF-8"));
osr.println("HTTP/1.1 200 OK");
osr.println("Content-Type: text/html; charset=utf-8");
osr.println();
osr.println(test());
...

private String test() {
String ret = null;
try {
StringBuffer tmpl = new StringBuffer
("<html><head></head><body>H\u00e2n</body></html>");
ret = tmpl.toString();
}
catch (Exception e) {
e.printStackTrace();
}
System.out.println(ret);
return ret;
}

With Linux, firefox and opera there is no problem and
the a with circumflex is printed nicely.

On Windows xp I get neither firefox nor IE to work correctly.

Firefox shows some FFFD square, but when I change from the (detected)
UTF-8 encoding to ISO-8859-1, it displays things correctly. But that
would be the rwong encoding!?

IE shows some empty rectangle in the main browser window, but when
looking at the page source, everything is shown correctly!?

I have seen the correct output, but don't remember how I got it; so
it's not missing glyphs.

This is probably not a Java question, as I suspect some windows magic
to happen here. Maybe it has something to do with the infamous BOM?
(I tried setting "file.encoding" to "UTF-8" for what it's worth. And
the cmd prompt from the out.println then o with circumflex, but that's
due to the windows legacy encoding, I think.)

Michael
 
K

Knute Johnson

Michael said:
I have the following code fragment in a tiny webserver:

...
os = sock.socket().getOutputStream();
osr = new PrintWriter(new PrintStream(os, true, "UTF-8"));
osr.println("HTTP/1.1 200 OK");
osr.println("Content-Type: text/html; charset=utf-8");
osr.println();
osr.println(test());
...

private String test() {
String ret = null;
try {
StringBuffer tmpl = new StringBuffer
("<html><head></head><body>H\u00e2n</body></html>");
ret = tmpl.toString();
}
catch (Exception e) {
e.printStackTrace();
}
System.out.println(ret);
return ret;
}

With Linux, firefox and opera there is no problem and
the a with circumflex is printed nicely.

On Windows xp I get neither firefox nor IE to work correctly.

Firefox shows some FFFD square, but when I change from the (detected)
UTF-8 encoding to ISO-8859-1, it displays things correctly. But that
would be the rwong encoding!?

IE shows some empty rectangle in the main browser window, but when
looking at the page source, everything is shown correctly!?

I have seen the correct output, but don't remember how I got it; so
it's not missing glyphs.

This is probably not a Java question, as I suspect some windows magic
to happen here. Maybe it has something to do with the infamous BOM?
(I tried setting "file.encoding" to "UTF-8" for what it's worth. And
the cmd prompt from the out.println then o with circumflex, but that's
due to the windows legacy encoding, I think.)

Michael

Michael:

I've been playing around with this and I can't get it to work correctly
on Windows or Linux. I tried just putting a file with the 0xE2
character on my web server (which is set to default to UTF-8) and I get
a black square rotated 45 degrees with a white ? in it. If I reset the
character encoding to IS0-8859-1 on the browser the character appears
correctly. There is something I don't understand here and hopefully you
will get a better answer.
 
R

Roedy Green

With Linux, firefox and opera there is no problem and
the a with circumflex is printed nicely.


0x00e2 is supposed to be &acirc; in UTF-8, Unicode and ISO-8859-1

However, in a proprietary windows encoding, it could be anything. What
encoding is your System.out.println using?

To find out, dump a set of chars 0 .. 255 to System.out and redirect
them to a file. Then look at the file with the EncodingRecogniser
utility.
See http://mindprod.com/jgloss/encoding.html

You might find windows-1252, Cp437, Cp850...

Also try dumping out the character as hex. You will see it is likely
just fine. It is just System.out screwing it up.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
R

Roedy Green

On Windows xp I get neither firefox nor IE to work correctly.

some other things to try:

1. use Wireshark to snoop on the messages your server is sending. See
if problem is in the server or the client browser. Make sure your
headers and body are encoded as you intended.

see http://mindprod.com/jgloss/wireshark.html

2. Check the font. If your font does not support &acirc; it won't
support an embedded 0x00e2. Try embedding &acirc; (the entity, not
the hex) in your text body.
use http://mindprod.com/jgloss/fontshower.html to make sure the font
supports &acirc;
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
M

Michael Jung

This looks rather strange. I'd prefer to go for something like this:

new PrintWriter(new OutputStreamWriter(os, "UTF-8"))

I used to have new PrintWriter(os), but wanted to enforce the encoding
and PrintWriter doesn't take one. *That* would be a convenience
constructor needed.
Here's what I suspect:

* PrintStream is an OutputStream, so most of its methods just takes
bytes, and it happens to have a few more which take chars and
Strings. These extra methods will do the char->UTF-8 conversion
(an internal OutputStreamWriter is created), but the byte-based
methods can't - they're already bytes.
* PrintWriter can take an OutputStream. If it does so, it will also
insert its own OutputStreamWriter (using the local system's charset).
* Chars passed to the PrintWriter are converted using its
OutputStreamWriter, and never get passed on to the
char/String-based methods of the PrintStream, so its charset
encoder does not get used.

Result: you're writing using the native encoding of your server,
regardless of what you tell the PrintStream.

Now that you mention it, this is what I found in the PrintStream Javadoc:

"All characters printed by a PrintStream are converted into bytes
using the platform's default character encoding. The PrintWriter class
should be used in situations that require writing characters rather
than bytes."

It even says so in the Javadoc of the constructor I used. *blush*

Thank you very much.

Bonus question: what is the encoding parameter good for in the
constructor of the PrintStream? It actually lead me on the false
track.

Michael
 
J

jolz

osr.println("HTTP/1.1 200 OK");
osr.println("Content-Type: text/html; charset=utf-8");
osr.println();
osr.println(test());
With Linux, firefox and opera there is no problem and
the a with circumflex is printed nicely.

I don't think it is required to work even with plain ASCII, especially
on linux.:

1.
public void println()

Terminate the current line by writing the line separator string.
The line separator string is defined by the system property
line.separator, and is not necessarily a single newline character ('\n').

2.
Response = Status-Line ; Section 6.1
*(( general-header ; Section 4.5
| response-header ; Section 6.2
| entity-header ) CRLF) ; Section 7.1
CRLF
[ message-body ] ; Section 7.2

CRLF = CR LF

CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
 
M

Michael Jung

Thomas Pornin said:
In the Javadoc of JDK-1.1.8, PrintStream was documented as
being deprecated. Both public constructors include the comment:
"Note: PrintStream() is deprecated." and go on to state that
PrintWriter should be used.

In JDK-1.3.1, the comments about deprecation are gone (I do not have
the Javadoc for JDK-1.2, so I cannot check there). PrintStream got
"reprecated". At some point between 1.1.8 and 1.3.1, Sun realized
that explicit deprecation is not enough to get rid of a troublesome
class, and that too much code was using PrintStream to allow for
a simple removal (it would break too much existing code).

It would not be enough, but it would help. Or does the danger of
refactoring wrongly (by people trying to get rid of every warning in
sight) outweigh the benfits of a cleaner IF with deprecated parts?

Michael
 
L

Lew

Thomas said:
Backward compatibility goes to
a great extent to explain why Java is as it is nowadays. Examples
of quirks include the following: ....
-- There are both java.net.URI and java.net.URL, with oh-so-slightly
different handlings of nominally invalid URLs (especially when there
are spaces in the string).

That one doesn't belong on your list. The classes exist to handle the
functional differences between URIs generally and URLs specifically. As the
URI Javadocs state:
 
M

Mike Schilling

I'd put this one as "chars are fixed at 16 bits rather than simply
'big enough to hold all Unicode characters'". 24 bits would be
sufficient to get rid of surrogates.

And I'd add:
NullPointerExceptions in a language that insists it doesn't have
pointers.

In DOM, the null namespace is represents by a null String. In SAX,
by an empty string.
That one doesn't belong on your list. The classes exist to handle
the
functional differences between URIs generally and URLs specifically.
As the URI Javadocs state:

It belongs on a different list, one where Java accurately models a
historical quirk in a different domain.
 
M

Michael Jung

jolz said:
osr.println("HTTP/1.1 200 OK");
osr.println("Content-Type: text/html; charset=utf-8");
osr.println();
osr.println(test());
[...]

I don't think it is required to work even with plain ASCII, especially
on linux.:
public void println()
Terminate the current line by writing the line separator
string. The line separator string is defined by the system property
line.separator, and is not necessarily a single newline character
('\n').
Response = Status-Line ; Section 6.1
*(( general-header ; Section 4.5
| response-header ; Section 6.2
| entity-header ) CRLF) ; Section 7.1
CRLF
[ message-body ] ; Section 7.2

1. What I described would have been a strange phenomenom of this error
indeed.
2. You are right.
3. I have yet to meet a client to complain.

Michael
 
L

Lew

Mike said:
And I'd add:
NullPointerExceptions in a language that insists it doesn't have
pointers.

What language is that? Not Java.

Java certainly does not "insist" that it doesn't have pointers. Java most
assuredly does have pointers. There is even an index entry in the JLS for
"pointers" and the JLS uses the term in §4.3.1
The reference values (often just references) are pointers to these objects ...
§6.8.7
... as in buf holding a pointer to a buffer of some kind ...
§15.13.2
... the check for a null pointer ...

and, of course, the dozens of references in the JLS to 'NullPointerException'
itself.

That is averring that that language has pointers, the exact opposite of
"insist[ing] it doesn't have pointers". Heck, it's practically shouting to
anyone who will listen that the language does have pointers.

I don't know how this canard that Java doesn't have pointers ever got started.
 
A

Alan Morgan

What language is that? Not Java.

Java certainly does not "insist" that it doesn't have pointers. Java most
assuredly does have pointers. There is even an index entry in the JLS for
"pointers" and the JLS uses the term in §4.3.1

Java doesn't have explicit pointers. Java has references that are
implemented using pointers (but what language doesn't have some feature
implemented using pointers?).

The NullPointerException is more like a NullReferenceException. I note
that the JLS says "Integer operators can throw a NullPointerException if
unboxing conversion of a null reference is required".

Also, the JLS talks about types and mentions primitive types and reference
types (where pointeres are mentioned), but doesn't mention pointer types
anywhere (and, in the index, for "pointers" it says "see references").
Almost every mention of "pointer" in that doc is in the context of
NullPointerException.

One could be forgiven for thinking that pointers are an implementation
detail and not part of the language proper.

[snipped references, heh, to pointers]
I don't know how this canard that Java doesn't have pointers ever got started.

Probably from Java programmers and various *other* bits of documentation
like http://www.j2ee.me/docs/white/langenv/Simple.doc2.html

which says "You no longer have dangling pointers and trashing of memory
because of incorrect pointers, because there are no pointers in Java".

Alan
 
L

Lew

Alan said:
Java doesn't have explicit pointers. Java has references that are
implemented using pointers (but what language doesn't have some feature
implemented using pointers?).

That's not what the JLS says. The JLS says that references *are* pointers.
It is not a question of implementation but of language definition. Your
discussion of implementation is by the wayside.
The NullPointerException is more like a NullReferenceException. I note

The terms are synonymous, according to the JLS.
that the JLS says "Integer operators can throw a NullPointerException if
unboxing conversion of a null reference is required".

Yes, because the pointer might be null.
Also, the JLS talks about types and mentions primitive types and reference
types (where pointeres are mentioned), but doesn't mention pointer types
anywhere (and, in the index, for "pointers" it says "see references").

Again, because according to the JLS references are pointers. Why did you
elide that quote from your response?
Almost every mention of "pointer" in that doc is in the context of
NullPointerException.

So? (Note that you say "almost" - that only counts in horseshoes and hand
grenades.)
One could be forgiven for thinking that pointers are an implementation
detail and not part of the language proper.

Why? It's not accurate to think that.
[snipped references, heh, to pointers]
I don't know how this canard that Java doesn't have pointers ever got started.

Probably from Java programmers and various *other* bits of documentation
like http://www.j2ee.me/docs/white/langenv/Simple.doc2.html
which says "You no longer have dangling pointers and trashing of memory
because of incorrect pointers, because there are no pointers in Java".

The statement "a language that insists it doesn't have pointers", and the
question at large of whether the language has a feature, is handled by
reference to the definition of the language itself, i.e., the JLS, which
clearly states that the language does have pointers, not by reference to some
mistaken writer's wrong statements in a non-normative document at a
non-authoritative site. Showing that the canard has started as you just did
does not answer the question of how it got started.

The JLS states outright that the Java language has pointers, as a language and
without regard for its implementation. The JLS is the definition of the Java
language, ergo the conclusion is definitive.
 
R

Roedy Green

I used to have new PrintWriter(os), but wanted to enforce the encoding
and PrintWriter doesn't take one. *That* would be a convenience
constructor needed.

PrintWriter constructor has a csn parameter ( char set/encodingh
parameter)

see http://mindprod.com/applet/fileio.html
for sample code.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
R

Roedy Green

I don't know how this canard that Java doesn't have pointers ever got started.

Go back to Java 1.0. Almost every day some C++ programmer would ask,
"How could you write any serious code without pointers?"

There someone would explain there were pointers, just were called
"references" to distinguish them from C++'s wild pointers because they
had some safety features like no pointer arithmetic, and no pointing
into the middle of objects.


--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
M

Michael Jung

PrintWriter constructor has a csn parameter ( char set/encodingh
parameter)

You know that that one is meaningless in the context the question
arose in?

Michael
 
R

Roedy Green

You know that that one is meaningless in the context the question
arose in?

For me, reading most posts is like reading a foreign language. I can
pick out a few keywords, and I guess at the general problem area, and
I hand out some standard advice. Posted code is usually so atrocious
it would be like sifting through dog poo to find bugs in it.

I find it difficult to disginguish between somebody explaining
something complicated, and somebody who has no clue. I don't have the
patience for detailed analysis. Others are much better at it. I
usually find out O.P. is not an addled newbie when someone else
responds with advanced information.

The way I see it, whether what I say solves the O.P.'s particular
problem is secondary. I am talking to the broader audience of people
who have similar problems, and people who may later find the thread
with google.

I feel disgust at an O.P. who complains when someone posts information
he already knows or that is not relevant to his particular problem, as
if the whole point of the newsgroup were to serve him alone.

Who does he think he is, the CEO of a company I work for?

If my newsreader had picons, as Galahad of old had, it would be much
easier to keep track of the skill sets of the various participants. As
it is, except for a handful, all turn into a blur.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
R

Roedy Green

I used to have new PrintWriter(os), but wanted to enforce the encoding
and PrintWriter doesn't take one.

If you meant by "enforce", "use a uniform encoding for all your
PrintWriters", use a named constant in all your constructors. In a
large project, I often have a Configure class with almost nothing in
it but named constants that people might likely want to change.

Or of course you could extend PrintWriter to force the default. It is
not a final class.

There are really two logical choices, UTF-8 for interplatform
communication or the platform default.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
W

Wojtek

Roedy Green wrote :
If my newsreader had picons, as Galahad of old had, it would be much
easier to keep track of the skill sets of the various participants. As
it is, except for a handful, all turn into a blur.

I use MesNews. It has rules which can be applied for various criteria.
For instance, my favourite posters are shown in red text in the message
thread subject column.

Not a true picon, but it does work...
 
T

Tom Anderson

And I'd add:
NullPointerExceptions in a language that insists it doesn't have
pointers.

Where in the JLS does it insist - or even vaguely affirm - that it doesn't
have pointers?

From the index:

pointers
See references

ISTM that the JLS considers the two words to be synonyms.

tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,130
Members
46,689
Latest member
liammiller

Latest Threads

Top