Still having charset problems with Tomcat 5 on Windows

B

bdobby

Hi, I'm back trying to sort out what happens to £ (UK currency symbol)
in a JSP form running on Tomcat 5 under Windows. I have reduced the
problem to a simple example, which I enclose below. If I enter £ in
the textarea and submit the form, the £ gets prefixed with an accented
A. The A also appears in the query string in the browser's address bar
as %C2. However, if I save the source of the displayed JSP as an HTML
file, submitting the form displays only the £ (%A3) in the query
string.
Any help would be GREATLY appreciated.
TIA
Brian

Here is the JSP:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<%@page contentType="text/html;charset=UTF-8"%>
<%@page pageEncoding="UTF-8"%>
<%@ taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core" %>
<html>
<head><title>JSP Page</title></head>
<body>
<form>
£<br>
<textarea name=text1>
<c:eek:ut value="${param.text1}"/>
</textarea><br>
<input type=submit name=submit value='submit'/>
</form>
</body>
</html>

Here is the URL displayed after submitting the form:
http://localhost:8084/Test/index.jsp?text1=£&submit=submit
 
G

Gerard Krupa

Hi, I'm back trying to sort out what happens to £ (UK currency symbol)
in a JSP form running on Tomcat 5 under Windows. I have reduced the
problem to a simple example, which I enclose below. If I enter £ in
the textarea and submit the form, the £ gets prefixed with an accented
A. The A also appears in the query string in the browser's address bar
as %C2. However, if I save the source of the displayed JSP as an HTML
file, submitting the form displays only the £ (%A3) in the query
string.
Any help would be GREATLY appreciated.
TIA
Brian

The accented A is a UTF-8 character with its MSB set indicating that the
pound sign is encoded into two bytes instead of one. This is normal
behaviour for UTF-8 and nothing to worry about. It occurs in this case
because you have made the page encoding UTF-8 (this is sent in the HTTP
headers and will not be present when the page is saved to file). Try
setting the encoding to iso-8859-1 in the two <%@page> tags and see what
happens.

There are fundamental flaws in specifying and detecting the character
set used for submitted form data so you can't always assume that the
data will be passed in the same character set that was used to deliver
the page. The link
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html has some tips
on how to overcome this.

HTH
Gerard
 
V

VK

Nothing wrong with your script, it's a browser (at least IE) flow.

Look at this search query from google ("pound", "sign", <pound sigh>):

http://www.google.com/search?hl=en&lr=&q=pound+sign+£&btnG=Search

The problem lies in very unstable Unicode reading for chars with first bite
eq 0.

Somehow the system gets lost with such chars when the coding is set to UTF-8
It cannot "get" that %A3 or such is really %00A3.
Instead the system tries to "guess" the right Unicode table.
Strangely enough 99% of its guess is Korean, so it's prefixing the chars
with %C2 - right in the middle of Hangul (Korean syllable alphabet).
More about the special Korean meaning in IE (which seams to be a debugging
trash left by one of IE developers) you can read in comp.lang.javascript,
look the thread by keywords "Bizarre JS brackets bug".

The situation is not so desperate though: at least YOU know what table to
use, so drop C2 (or whatever trach you'll get) and re-prefix it with 00
Another solution would be to use char-entities instead wherever it's
possible.
 
G

Gerard Krupa

VK said:
Somehow the system gets lost with such chars when the coding is set to UTF-8
It cannot "get" that %A3 or such is really %00A3.
Instead the system tries to "guess" the right Unicode table.
Strangely enough 99% of its guess is Korean, so it's prefixing the chars
with %C2 - right in the middle of Hangul (Korean syllable alphabet).
More about the special Korean meaning in IE (which seams to be a debugging
trash left by one of IE developers) you can read in comp.lang.javascript,
look the thread by keywords "Bizarre JS brackets bug".

C2A3 is the correct UTF-8 encoding for pound sign (correctly passed by
the browser as specified in the page encoding) - see
http://www1.tip.nl/~t876506/utf8tbl.html. When this is converted into a
java.lang.String, the system is probably using the default iso-latin
string encoding and performing a single-byte conversion. I don't
believe that any 16-bit unicode matching is being performed at all.

I have performed a quick test with IE by adding the following to a form:
<input type="hidden" name="_charset_" />
This is an IE-only trick that can tell you the encoding of submitted
parameters. This confirms that the data is being passed using UTF-8.
In fact, IE continues to encode form data in UTF-8 even if the page
encoding is changed to UTF-16.

Regards,
Gerard
 
I

Ian Pilcher

Gerard said:
There are fundamental flaws in specifying and detecting the character
set used for submitted form data so you can't always assume that the
data will be passed in the same character set that was used to deliver
the page. The link
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html has some tips
on how to overcome this.

You may also want to see

https://bugzilla.mozilla.org/show_bug.cgi?id=241540

--
========================================================================
Clearly, there is no political benefit to expediting the admission of
legal immigrants into the United States. Nevertheless, I believe that
our elected officials have an obligation to do more than simply pander
to the thinly veiled racism of their constituents.
Ian Pilcher
========================================================================
 
B

bdobby

Thanks, Gerard. Changing the page-encoding to ISO-8859-1 did the trick.
Thanks again
Brian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top