I manage an application that consists of a web front end to a MS-SQL
database for input of data. The application is in english and I only
want it to work in english, displaying english characters and
accepting english character as inputs.
"English" is a language. Your problem is more about character sets.
Wikipedia is pretty readable on these topics and you might start from
the article on "Windows-1250".
One of the users has a polish version of windows running internet
explorer. Can someone explain why on the polish installation junk
characters returned from the database.
No, we need more information. A URL would be good, but telling us
details such as which Windows codepage they're using, which HTTP
request headers their browser sends, what your server returns and also
whether your server actually changes its behaviour depending on the
headers in the request, or if it's just coded to return the same to
all requests. Even knowing the database, web server and web scripting
language would be good.
Most of all, samples of the HTML content returned are pretty important
- although it's hard to show these, as you'd have to deliver them to
us through a medium that's "encoding clean" and wouldn't change them
further (this is always a problem in remote debugging this sort of
bug).
I'm assuming that the "Polish browser" is running on Windows codepage
1250 (a UK computer would probably be running 1252 instead). This
means I'd expect it to send a HTTP request header that looks a bit
like this@
Accept-Language: pl
Accept-Charset: ISO-8859-2
I can't say more than this with any confidence, without seeing your
examples. However:
I might assume your server _doesn't_ do anything with these headers.
It assumes that everyone is English, uses the "english" character sets
and it will return the same content no matter who asked for it. In
that case, it would return content in the English language, would use
an "English-friendly" character set that's probably ISO-8859-1 (could
be others, but that's popular), and would label it as being in the
character set that it had actually used.
If that's the case, then everything would "work". Your Poles would see
correctly displayed English. They wouldn't see Polish language, and
they wouldn't even be able to save Polish-specific characters (such as
entering their own names) into the database and retrieving them
correctly without these characters (and those alone) being corrupted.
However they would have workable read-only access to English content,
without garbage.
Now as I understand you, then this isn't what's happening. Instead
your Polish read-only users are seeing English text being corrupted.
That's weird - it should never happen in a correctly implemented
system, even a system that makes no attempt to support anything beyond
English in ASCII.
It looks like you might be falling foul of Yoda's Law of Character
Encoding here, "Do, or do not. There is no 'try.'"
You can build a system that _doesn't_ do foreign encodings, and it
will work. Or you can build a system that _does_ do foreign encodings,
and it will work, and it will work for its foreign encodings too.
Where things go "wrong" (meaning garbage, not just languages that
deliberately aren't supported) it's usually caused by "half-encoding"
something. Either not encoding things at all, but sending headers as
if they had been, or vice versa. It's trying to do encoding and only
implementing half of it that causes the trouble - allowing characters
to be input from a <form> and using the server's built-in features to
recognise the browser's "non-English" encoding before storing, but
then spitting these same octets back under an "English" encoding
regardless of how they were intended is a favourite.
It obviously interprets the
characters as polish but I dont want it to interpret anything just
display them just like on the english version of windows.
This is odd, because ASCII is consistent across most encodings around
(most non-ASCII encodings work by using the "upper" characters above
127). I really shouldn't try to guess any more without hard data,
particular if Jukka is watching.
However (as a wild guess) there _are_ a few differences between the
encodings used for Windows-1250 and ISO-8859-2. It's possible that a
web server _might_ receive content in Widnows-1250, recognise it thus
as being sufficently "Polish" to not treat it as English any more, but
then send it back as ISO-8859-2 as its favoured approach for handligng
Polish. But that's a guess - we'd have to see headers.
How do I explain a solution/workaround to the developers.
For a happy, peaceful life you do three things:
* You abandon serious support for old browsers (where "old" currently
means "old enough this just isn't a problem any more")
* You code on the server in a language that understands Unicode.
* You switch to using Unicode with UTF-8 encoding throughout (for the
HTTP at least). No Windows encodings. No ISO-8859-* encodings.
* Everything Just Works. Really. It's great - so much easier than
doing it the old way.
* You rigidly police this through your developers (this is my day job,
I hate it - I simultaneously support English, Spanish, Czech,
Afrikaans, Arabic and a bunch more languages). They _will_ keep
breaking your encoding, even though it's the simplest way to work. Cut
their hands off if they do. I insist in "Copyright © FooCo" comments
at the top of ALL source files, just to keep an eye