Marc said:
But it has to be. There is no automagic guessing possible.
Automagic guessing isn't possible if strings keep track of what encoding
their data is. And why shouldn't they? We're a long way from the day
when a "string" was nothing more than an array of bytes. Adding a teeny
bit of metadata makes life much easier.
IMHO a strange design decision.
I get that you don't grok it, but I think that's because you haven't
worked with it. RB added encoding data to its strings years ago, and
changed the default string encoding to UTF-8 at about the same time, and
life has been delightful since then. The only time you ever have to
think about it is when you're importing a string from some unknown
source (e.g. a socket), at which point you need to tell RB what encoding
it is. From that point on, you can pass that string around, extract
substrings, split it into words, concatenate it with other strings,
etc., and it all Just Works (tm).
In comparison, Python requires a lot more thought on the part of the
programmer to keep track of what's what (unless, as you point out, you
convert everything into unicode strings as soon as you get them, but
that can be a very expensive operation to do on, say, a 500MB UTF-8 text
file).
A lot more hassle compared to an opaque
unicode string type which uses some internal encoding that makes
operations like getting a character at a given index easy or
concatenating without the need to reencode.
No. RB supports UCS-2 encoding, too, and is smart enough to take
advantage of the fixed character width of any encoding when that's what
a string happens to be. And no reencoding is used when it's not
necessary (e.g., concatenating two strings of the same encoding, or
adding an ASCII string to a string using any ASCII superset, such as
UTF-8). There's nothing stopping you from converting all your strings
to UCS-2 when you get them, if that's your preference.
But saying that having only one string type that knows it's Unicode, and
another string type that hasn't the foggiest clue how to interpret its
data as text, is somehow easier than every string knowing what it is and
doing the right thing -- well, that's just silly.
Best,
- Joe