Austin said:
You're right. And I'm saying that I don't care.
Well, I suspect most other people want to maintain backwards
compatibility. Hence the existence of UTF-8.
People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems.
Why? POSIX gives nearly binary-transparent file names; the only
exception is the single octet 0x00. Considering the 1:1 mapping between
UTF-8 and other Unicode encodings, how can the choice of one or another
"badly limit" what can be done?
Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.
dd if=/dev/urandom of=/lib/ld-linux.so.2 and watch all my programs
break, too. What's you point?
It is always possible to break a computer system if you try hard enough
(or, all too often, not hard at all); but if the user actively attempts
to make his machine malfunction, that's not the OS's problem.
I'm also guessing that you don't do much with long Japanese filenames
or deep paths that involve *anything* except US-ASCII (a subset of
UTF-8).
Well, I have Japanese file names (though not that many in the grand
scheme of things), and have a lot of files and directories named in non
US-ASCII. Yeah, I know that file name length and path length limits
suck, but that's an implementation limitation of e.g. ext3, nothing
fundamental.
This last statement is true only because you use the term "octet."
You're correct; that isn't what I meant to say. Something along the
lines of the following is better worded:
UTF-8 can take more than one octet to represent a
character; UTF-16 can take more than two; UTF-32
more than four; etc.
It's a useless term here, because UTF-8 only has any level of
efficiency for US-ASCII.
English, I've heard, is a rather common language.
Even if you step to European content, UTF-8
is no longer perfectly efficient,
Of course not --- but still generally better than UTF-16, I think.
Spanish, I've heard, is also a rather common language.
and when you step to Asian content,
UTF-8 is so bloody inefficient that most folks who have to deal with
it would rather work in a native encoding (EUC-JP or SJIS, anyone?)
which is 1..2 bytes or do everything in UTF-16.
Yes, for CJK, UTF-8 is fairly inefficient. A full 33% bigger than
UTF-16.
OTOH, it has some nice advantages over UTF-16, like being backwards
compatible with C strings, being resynchronizable (if a octet is lost),
not having byte-order issues, etc.
Now, honestly, what portion of your hard disk is taken up by file names?