D
David Masover
For example, an expression like
=20
s1 =3D s2 + s3
=20
where s2 and s3 are both Strings will always work and do the obvious
thing in 1.8, but in 1.9 it may raise an exception. Whether it does
depends not only on the encodings of s2 and s3 at that point, but also
their contents (properties "empty?" and "ascii_only?")
In 1.8, if those strings aren't in the same encoding, it will blindly=20
concatenate them as binary values, which may result in a corrupt and=20
nonsensical string.
It seems to me that the obvious thing is to raise an error when there's an=
=20
error, instead of silently corrupting your data.
This
means the same program with the same data may work on your machine, but
crash on someone else's.
Better, again, than working on my machine, but corrupting on someone else's=
=2E=20
At least if it crashes, hopefully there's a bug report and even a fix _befo=
re_=20
it corrupts someone's data, not after.
=46rom your soapbox.rb:
* Whether or not you can reason about whether your program works, you will
want to test it. 'Unit testing' is generally done by running the code with
some representative inputs, and checking if the output is what you expect.
=20
Again, with 1.8 and the simple line above, this was easy. Give it any two
strings and you will have sufficient test coverage.
Nope. All that proves is that you can get a string back. It says nothing ab=
out=20
whether the resultant string makes sense.
More relevantly:
* It solves a non-problem: how to write a program which can juggle multiple
string segments all in different encodings simultaneously. How many
programs do you write like that? And if you do, can't you just have
a wrapper object which holds the string and its encoding?
Let's see... Pretty much every program, ever, particularly web apps. The en=
d-
user submits something in the encoding of their choice. I may have to conve=
rt=20
it to store it in a database, at the very least. It may make more sense to=
=20
store it as whatever encoding it is, in which case, the simple act of=20
displaying two comments on a website involves exactly this sort of=20
concatenation.
Or maybe I pull from multiple web services. Something as simple and common =
as=20
a "trackback" would again involve concatenating multiple strings from=20
potentially different encodings.
* It's pretty much obsolete, given that the whole world is moving to UTF-8
anyway. All a programming language needs is to let you handle UTF-8 and
binary data, and for non-UTF-8 data you can transcode at the boundary.=20
For stateful encodings you have to do this anyway.
Java at least did this sanely -- UTF16 is at least a fixed width. If you're=
=20
going to force a single encoding, why wouldn't you use fixed-width strings?
Oh, that's right -- UTF16 wastes half your RAM when dealing with mostly ASC=
II=20
characters. So UTF-8 makes the most sense... in the US.
The whole point of having multiple encodings in the first place is that oth=
er=20
encodings make much more sense when you're not in the US.
* It's ill-conceived. Knowing the encoding is sufficient to pick characters
out of a string, but other operations (such as collation) depend on the
locale. And in any case, the encoding and/or locale information is often
carried out-of-band (think: HTTP; MIME E-mail; ASN1 tags), or within the
string content (think: <?xml charset?>)
How does any of this help me once I've read the string?
* It's too stateful. If someone passes you a string, and you need to make
it compatible with some other string (e.g. to concatenate it), then you
need to force it's encoding.
You only need to do this if the string was in the wrong encoding in the fir=
st=20
place. If I pass you a UTF-16 string, it's not polite at all (whether you d=
up=20
it first or not) to just stick your fingers in your ears, go "la la la", an=
d=20
pretend it's UTF-8 so you can concatenate it. The resultant string will be=
=20
neither, and I can't imagine what it'd be useful for.
You do seem to have some legitimate complaints, but they are somewhat=20
undermined by the fact that you seem to want to pretend Unicode doesn't exi=
st.=20
As you noted:
"However I am quite possibly alone in my opinion. Whenever this pops up on
ruby-talk, and I speak out against it, there are two or three others who
speak out equally vociferously in favour. They tell me I am doing the
community a disservice by warning people away from 1.9."
Warning people away from 1.9 entirely, and from character encoding in=20
particular, because of the problems you've pointed out, does seem incredibl=
y=20
counterproductive. It'd make a lot more sense to try to fix the real proble=
ms=20
you've identified -- if it really is "buggy as hell", I imagine the ruby-co=
re=20
people could use your help.