Setting language to UTF-8

B

Bertilo Wennergren

Terence Parker:
And why don't I use UTF-8 for everything? Because, while that is the
ideal for compatibility between languages, fact of the matter is UTF-8
has entered the world too late. Languages such as BIG5 / GB have
become so dominant in Asia that these are native to most software, NOT
UTF. And that goes for websites in this part of the world too.

Can you name a web browser (still in use) that can handle BIG5/GB but
not UTF-8?
 
T

Toby A Inkster

Andreas said:
No, why? Because you need three bytes instead of two bytes for one
character?

But if a page is *primarily* in an East Asian language, then it is a
significant difference, so UTF-16 is preferable.
 
T

Toby A Inkster

Andreas said:
- UTF-16 would blow up the HTML markup by 100 %.

It would increase the size of element and attribute names as well as the
fairly common '<', '>', '&' and '"' characters.

OTOH the normal flow of text and attribute values (alt text, table
summaries, etc) come down in size. On many pages these parts are a
significant majority of the total page size.
 
T

Terence Parker

Again, languages are not the issue; character encodings are, though
naturally the language has an impact on the repertoire of feasible
encodings. If you have pages with different encodings, then the
simplest way, on Apache, is to put files in one encoding into one
directory and create a .htaccess file into that directory, with a
suitable directive to Apache in it, e.g.
AddType text/html;charset=utf-8 HTML

This is not always suitable if you are providing hosting to various people
and don't want to give them permissions to use .htaccess files. I much
rather the character set be defined within the HTML itself rather than
dished out by the server. Also, if you are providing hosting for other
people, you cannot really force people to use different directories to
separate character sets - they may have a website which has all the files in
one single directory, for example, but using multiple character sets.
Whether you can do that depends on Apache 2. Have you checked its
documentation? I would guess that using an AddType without a charset
parameter would do it. But that's really _not_ the WWW way. The WWW way
is to specify the encoding in actual HTTP headers, and <meta> tags are
just surrogates that some people need to resort to (and that _might_ be
including for certain reasons even when you have made the server send
adequate headers).

It may _not_ be the WWW way, but sometimes the proper way doesn't fulfill
ones requirements. And I think much of the world that is used to using
romanised alphabets sees the difficulties faced in countries such as China
when it comes to language representation in an oversimplistic way.
Or too early. But it is true that UTF-8 is _inefficient_ for most East
Asian languages.

Naah... too late. If the world had adopted UTF-8 before anything else (such
as Big5, GB) were devised then everyone would be using it and sticking to
it. The problem now is that so much software and so many websites are
already made and utilising GB/BIG5 as the encoding format, not to mention
that Windows releases still use them natively, that people are rather
reluctant to change.
Again, encodings, not languages. And the software needs to grow up.
UTF-8 is the way the WWW and the Internet are going, in the sense that
support to UTF-8 is the primary goal (according to official IEFT
policy) - any new protocols and software _should_ support it and
_may_ support other encodings.

You're quite pedantic about this language/charset thing aren't you? Yes
okay, I mean character set and not language. Still, I think you realised
that quite early on in my post.

I've got nothing against people using UTF-8 to represent Chinese characters,
but the fact of the matter is that many still aren't - nevermind what
*should* be used. Using UTF vs. more native formats isn't just an easy
matter of taking the text and dumping it on a website - then setting the
character set. The text itself, as i'm sure you already know, has to be in
the right format to begin with - but unless UTF is specifically specified
most native versions of Windows (which let's face it most people use, vs.
say linux or Mac OS) will use GB/Big5 to save the text. Then there is a
problem of the font. UTF fonts are not the same as the Big5 fonts - of which
there are many which are well established already in this part of the world.
If you slap on UTF-8 text on the web browser and expect a certain Big5 font
to be used - it obviously won't work. Yet, the choice of UTF-8 fonts that
comes standard with IE/Windows isn't exactly very inspiring.

Also when it comes to fonts - trying to view English using UTF-8 in, say,
Netscape 7 results in ugly looking text. And trying to view English in Big5
also occasionally looks strange. So it's better to be able to set different
character sets to different pages - not just to different directories, which
isn't flexible enough.

Can you name a web browser (still in use) that can handle BIG5/GB but
not UTF-8?

No... but... (read above) - that's not really the point.

Terence
 
T

Terence Parker

Whether you can do that depends on Apache 2. Have you checked its
documentation? I would guess that using an AddType without a charset
parameter would do it. But that's really _not_ the WWW way. The WWW way
is to specify the encoding in actual HTTP headers, and <meta> tags are
just surrogates that some people need to resort to (and that _might_ be
including for certain reasons even when you have made the server send
adequate headers).

I forgot to add that no, I haven't checked the Apache2 documentation yet
because I haven't had time. I'm sure that the charset can be turned off
(which I will do) but I have not got round to looking yet.

Having said that, before anyone tells me to RTMF, I never actually asked for
help in achieving this because I know it's probably something very trivial
and I just don't have time to deal with it now that's all.

Just so you know.

Terence
 
T

Toby A Inkster

Terence said:
I much
rather the character set be defined within the HTML itself rather than
dished out by the server.

The main problem with such a situation is that to parse the HTML to find
out the character set, the web browser must already *know* the character
set!

It's a Catch 22 situation.

If you specify the character set in the HTTP headers this problem doesn't
arise.
This is not always suitable if you are providing hosting to various
people and don't want to give them permissions to use .htaccess files.

True. So what a sensible server administrator should do is enable
Apache MultiViews and then use lots of AddCharset directives to associate
particular character sets with file extensions.

That way, users can upload files:

one_two_three.html.iso8859-1
ichi_ni_san.html.big5


and link to them like this:

<a href="one_two_three.html">Count in English</a>
<a href="ichi_ni_san.html">Count in Japanese</a>

The server will then know the correct charsets for each file. The
web designer doesn't need to use .htaccess files and everyone is happy.
 
B

Bertilo Wennergren

Terence Parker:
No... but... (read above) - that's not really the point.

I thought the point was you trying to choose which encoding to use in
your pages. What practical problems would arise with your pages if they
were encoded in UTF-8? Where, when and how would something now work well?
 
B

Bertilo Wennergren

Toby said:
True. So what a sensible server administrator should do is enable
Apache MultiViews and then use lots of AddCharset directives to associate
particular character sets with file extensions.

That way, users can upload files:

one_two_three.html.iso8859-1
ichi_ni_san.html.big5


and link to them like this:

<a href="one_two_three.html">Count in English</a>
<a href="ichi_ni_san.html">Count in Japanese</a>
The server will then know the correct charsets for each file. The
web designer doesn't need to use .htaccess files and everyone is happy.

Perhaps not everyone. A lot of page authors would find it awkward to
handle files with extensions such as ".iso8859-1" in their computers.

They would find that they have to use ".html" while authoring, and then
changing the extension before or after uploading it. That could be quite
a hassle, and almost unavoidably the extension change would sometimes be
forgotten.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,102
Messages
2,570,646
Members
47,247
Latest member
GabrieleL2

Latest Threads

Top