A typical Chinese character will take up 16 bits in a UTF-16 file,
but 24 bits in a UTF-8 file. Thus a UTF-8 file may be up to 50%
bigger than UTF-16. Most Western characters only use 8 bits in
UTF-8, but 16 in UTF-16, so for Western languages, UTF-8 can be up
to 50% smaller than UTF-16.
Yes, but don't forget the markup!
I just tried saving the BBC News Chinese front page in the two
encodings:
81672 May 27 13:34 bbc-chinese-utf16.html
43759 May 27 13:33 bbc-chinese-utf8.html
In case that might be an unfair choice, I tried a Bank of China site:
141476 May 27 13:43 boc-tw-utf16.html
73144 May 27 13:44 boc-tw-utf8.html
I'm no expert in CJK issues, so anything that I say about those
details would need to be confirmed with more-authoritative sources.
If you're better informed about this then feel free to say so, and
I'll concede. But I would make a few points.
There were already well-established local encodings for different
varieties of Chinese, producing the preferred glyphs for respective
users. AIUI, the Han Unification involved in the Unicode
representation of CJK has not been to everyone's taste.[0]
The established codings are still widely used, e.g the BoC site was in
Big5 before I used Mozilla Composer's "save and change encoding"
option to produce the above unicode-encoded variants.
But the more HTML-technical aspect would be, how well supported is
utf-16, not only for rendering pages but also for forms submission
etc.? How well do the web search services index documents served in
utf-16 ? There's little doubt in my mind that utf-8 has been supported
for quite some years now in a wide range of browsers, and search
engine support has also been good recently; but widespread support for
the encoding schemes[1] of utf-16 has been more recent.
In due course I'd expect any remaining difficulties to be overcome,
but I'm uneasy about a blanket recommendation to use utf-16. Even if
you choose it as your compact storage encoding, there might be
something to be said for transcoding to utf-8 when you serve it out to
the web.
Even *if* you're worried about the file size, you might want to use
gzip compression, which is very widely supported for HTML nowadays.
10525 May 27 13:34 bbc-chinese-utf16.html.gz
9510 May 27 13:33 bbc-chinese-utf8.html.gz
11899 May 27 13:43 boc-tw-utf16.html.gz
10115 May 27 13:44 boc-tw-utf8.html.gz
As you can see, after gzip the files are of very similar sizes, which
isn't so surprising knowing that they contain the same information.
cheers
[0]
http://en.wikipedia.org/wiki/Han_unification ,
http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html etc.
[1] there are four flavours of utf-16: there's utf-16LE, utf-16BE,
and thirdly utf-16 with BOM, in its little- and big-endian flavours.