Ruby 1.8 vs 1.9

D

David Masover

For example, an expression like
=20
s1 =3D s2 + s3
=20
where s2 and s3 are both Strings will always work and do the obvious
thing in 1.8, but in 1.9 it may raise an exception. Whether it does
depends not only on the encodings of s2 and s3 at that point, but also
their contents (properties "empty?" and "ascii_only?")

In 1.8, if those strings aren't in the same encoding, it will blindly=20
concatenate them as binary values, which may result in a corrupt and=20
nonsensical string.

It seems to me that the obvious thing is to raise an error when there's an=
=20
error, instead of silently corrupting your data.
This
means the same program with the same data may work on your machine, but
crash on someone else's.

Better, again, than working on my machine, but corrupting on someone else's=
=2E=20
At least if it crashes, hopefully there's a bug report and even a fix _befo=
re_=20
it corrupts someone's data, not after.

=46rom your soapbox.rb:

* Whether or not you can reason about whether your program works, you will
want to test it. 'Unit testing' is generally done by running the code with
some representative inputs, and checking if the output is what you expect.
=20
Again, with 1.8 and the simple line above, this was easy. Give it any two
strings and you will have sufficient test coverage.

Nope. All that proves is that you can get a string back. It says nothing ab=
out=20
whether the resultant string makes sense.

More relevantly:

* It solves a non-problem: how to write a program which can juggle multiple
string segments all in different encodings simultaneously. How many
programs do you write like that? And if you do, can't you just have
a wrapper object which holds the string and its encoding?

Let's see... Pretty much every program, ever, particularly web apps. The en=
d-
user submits something in the encoding of their choice. I may have to conve=
rt=20
it to store it in a database, at the very least. It may make more sense to=
=20
store it as whatever encoding it is, in which case, the simple act of=20
displaying two comments on a website involves exactly this sort of=20
concatenation.

Or maybe I pull from multiple web services. Something as simple and common =
as=20
a "trackback" would again involve concatenating multiple strings from=20
potentially different encodings.

* It's pretty much obsolete, given that the whole world is moving to UTF-8
anyway. All a programming language needs is to let you handle UTF-8 and
binary data, and for non-UTF-8 data you can transcode at the boundary.=20
For stateful encodings you have to do this anyway.

Java at least did this sanely -- UTF16 is at least a fixed width. If you're=
=20
going to force a single encoding, why wouldn't you use fixed-width strings?

Oh, that's right -- UTF16 wastes half your RAM when dealing with mostly ASC=
II=20
characters. So UTF-8 makes the most sense... in the US.

The whole point of having multiple encodings in the first place is that oth=
er=20
encodings make much more sense when you're not in the US.

* It's ill-conceived. Knowing the encoding is sufficient to pick characters
out of a string, but other operations (such as collation) depend on the
locale. And in any case, the encoding and/or locale information is often
carried out-of-band (think: HTTP; MIME E-mail; ASN1 tags), or within the
string content (think: <?xml charset?>)

How does any of this help me once I've read the string?

* It's too stateful. If someone passes you a string, and you need to make
it compatible with some other string (e.g. to concatenate it), then you
need to force it's encoding.

You only need to do this if the string was in the wrong encoding in the fir=
st=20
place. If I pass you a UTF-16 string, it's not polite at all (whether you d=
up=20
it first or not) to just stick your fingers in your ears, go "la la la", an=
d=20
pretend it's UTF-8 so you can concatenate it. The resultant string will be=
=20
neither, and I can't imagine what it'd be useful for.

You do seem to have some legitimate complaints, but they are somewhat=20
undermined by the fact that you seem to want to pretend Unicode doesn't exi=
st.=20
As you noted:

"However I am quite possibly alone in my opinion. Whenever this pops up on
ruby-talk, and I speak out against it, there are two or three others who
speak out equally vociferously in favour. They tell me I am doing the
community a disservice by warning people away from 1.9."

Warning people away from 1.9 entirely, and from character encoding in=20
particular, because of the problems you've pointed out, does seem incredibl=
y=20
counterproductive. It'd make a lot more sense to try to fix the real proble=
ms=20
you've identified -- if it really is "buggy as hell", I imagine the ruby-co=
re=20
people could use your help.
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

On Wed, Nov 24, 2010 at 12:20 PM, Phillip Gawlowski <
You conveniently left out that Ruby thinks dividing by 0.0 results in
infinity.
That's not just wrong, but absurd to the extreme.


Its wrongness is an interpretation (I would also prefer that it just break,
but I can certainly see why some would say it should be infinity). And it
doesn't apply only to Ruby:

Java:
public class Infinity {
public static void main(String[] args) {
System.out.println(1.0/0.0); // prints "Infinity"
}
}

JavaScript:
document.write(1.0/0.0) // prints "Infinity"

C:
#include <stdio.h>
int main( ) {
printf( "%f\n" , 1.0/0.0 ); // prints "inf"
return 0;
}
 
P

Phillip Gawlowski

Its wrongness is an interpretation (I would also prefer that it just break,
but I can certainly see why some would say it should be infinity). And it
doesn't apply only to Ruby:

It cannot be infinity. It does, quite literally not compute. There's
no room for interpretation, it's a fact of (mathematical) life that
something divided by nothing has an undefined result. It doesn't
matter if it's 0, 0.0, or -0.0. Undefined is undefined.

That other languages have the same issue makes matters worse, not
better (but at least it is consistent, so there's that).

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
 
J

Josh Cheek

On Wed, Nov 24, 2010 at 1:16 PM, Phillip Gawlowski <
It cannot be infinity. It does, quite literally not compute. There's
no room for interpretation, it's a fact of (mathematical) life that
something divided by nothing has an undefined result. It doesn't
matter if it's 0, 0.0, or -0.0. Undefined is undefined.
From my Calculus book (goo.gl/D7PoI)

"by observing from the table of values and the graph of y =3D 1/x*=C2=B2* i=
n Figure
1, that the values of 1/x*=C2=B2* can be made arbitrarily large by taking x=
close
enough to 0. Thus the values of f(x) do not approach a number, so lim_(x->0=
)
1/x*=C2=B2* does not exist. To indicate this kind of behaviour we use the
notation lim_(x->0) 1/x*=C2=B2* =3D =E2=88=9E"

Since floats define infinity, regardless of its not being a number, it is
not "absurd to the extreme" to result in that value when doing floating
point math.


That other languages have the same issue makes matters worse, not
better (but at least it is consistent, so there's that).
The question was "Is there anything in the above which applies only to Ruby
and not to floating point computation in another other mainstream
programming language?" the answer isn't "other languages have the same
issue", it's "no".
 
Y

Yuri Tzara

Phillip Gawlowski wrote in post #963658:
It cannot be infinity. It does, quite literally not compute. There's
no room for interpretation, it's a fact of (mathematical) life that
something divided by nothing has an undefined result. It doesn't
matter if it's 0, 0.0, or -0.0. Undefined is undefined.

It is perfectly reasonable, mathematically, to assign infinity to 1/0.
To geometers and topologists, infinity is just another point. Look up
the one-point compactification of R^n. If we join infinity to the real
line, we get a circle, topologically. Joining infinity to the real plane
gives a sphere, called the Riemann sphere. These are rigorous
definitions with useful results.

I'm glad that IEEE floating point has infinity included, otherwise I
would run into needless error handling. It's not an error to reach one
pole of a sphere (the other pole being zero).

Infinity is there for good reason; its presence was well-considered by
the quite knowledgeable IEEE designers.
 
D

David Masover

On Wed, Nov 24, 2010 at 1:16 PM, Phillip Gawlowski <
=20
=20
From my Calculus book (goo.gl/D7PoI)
=20
"by observing from the table of values and the graph of y =3D 1/x*=C2=B2*= in
Figure 1, that the values of 1/x*=C2=B2* can be made arbitrarily large by
taking x close enough to 0. Thus the values of f(x) do not approach a
number, so lim_(x->0) 1/x*=C2=B2* does not exist. To indicate this kind of
behaviour we use the notation lim_(x->0) 1/x*=C2=B2* =3D =E2=88=9E"

Specifically, the _limit_ is denoted as infinity, which is not a real numbe=
r.
Since floats define infinity, regardless of its not being a number, it is
not "absurd to the extreme" to result in that value when doing floating
point math.

Ah, but it is, for two reasons:

=46irst, floats represent real numbers. Having exceptions to that, like NaN=
or=20
Infinity, is pointless and confusing -- it would be like making nil an=20
integer. And having float math produce something which isn't a float doesn'=
t=20
really make sense.

Second, 1/0 is just undefined, not infinity. It's the _limit_ of 1/x as x g=
oes=20
to 0 which is infinity. This only has meaning in the context of limits,=20
because limits are just describing behavior -- all the limit says is that a=
s x=20
gets arbitrarily close to 0, 1/x gets arbitrarily large, but you still can'=
t=20
_actually_ divide x by 0.

They didn't teach me that in Calculus, they're teaching me that in proofs.
=20
The question was "Is there anything in the above which applies only to Ru= by
and not to floating point computation in another other mainstream
programming language?" the answer isn't "other languages have the same
issue", it's "no".

I don't know that there's anything in the above that applies only to Ruby.=
=20
However, Ruby does a number of things differently, and arguably better, tha=
n=20
other languages -- for example, Ruby's integer types transmute into Bignum=
=20
rather than overflowing.
 
A

Adam Ms.

Phillip Gawlowski wrote in post #963658:
It cannot be infinity. It does, quite literally not compute. There's
no room for interpretation, it's a fact of (mathematical) life that
something divided by nothing has an undefined result. It doesn't
matter if it's 0, 0.0, or -0.0. Undefined is undefined.

That other languages have the same issue makes matters worse, not
better (but at least it is consistent, so there's that).

This is not even wrong.

From the definitive source:
http://en.wikipedia.org/wiki/Division_by_zero

The IEEE floating-point standard, supported by almost all modern
floating-point units, specifies that every floating point arithmetic
operation, including division by zero, has a well-defined result. The
standard supports signed zero, as well as infinity and NaN (not a
number). There are two zeroes, +0 (positive zero) and =E2=88=920 (negativ=
e zero)
and this removes any ambiguity when dividing. In IEEE 754 arithmetic, a
=C3=B7 +0 is positive infinity when a is positive, negative infinity when=
a
is negative, and NaN when a =3D =C2=B10. The infinity signs change when
dividing by =E2=88=920 instead.

-- =

Posted via http://www.ruby-forum.com/.=
 
J

Jörg W Mittag

David said:
Java at least did this sanely -- UTF16 is at least a fixed width. If you're
going to force a single encoding, why wouldn't you use fixed-width strings?

Actually, it's not. It's simply mathematically impossible, given that
there are more than 65536 Unicode codepoints. AFAIK, you need (at the
moment) at least 21 Bits to represent all Unicode codepoints. UTF-16
is *not* fixed-width, it encodes every Unicode codepoint as either one
or two UTF-16 "characters", just like UTF-8 encodes every Unicode
codepoint as 1, 2, 3 or 4 octets.

The only two Unicode encodings that are fixed-width are the obsolete
UCS-2 (which can only encode the lower 65536 codepoints) and UTF-32.

You can produce corrupt strings and slice into a half-character in
Java just as you can in Ruby 1.8.
Oh, that's right -- UTF16 wastes half your RAM when dealing with mostly ASCII
characters. So UTF-8 makes the most sense... in the US.

Of course, that problem is even more pronounced with UTF-32.

German text blows up about 5%-10% when encoded in UTF-8 instead of
ISO8859-15. Arabic, Persian, Indian, Asian text (which is, after all,
much more than European) is much worse. (E.g. Chinese blows up *at
least* 50% when encoding UTF-8 instead of Big5 or GB2312.) Given that
the current tendency is that devices actually get *smaller*, bandwidth
gets *lower* and latency gets *higher*, that's simply not a price
everybody is willing to pay.
The whole point of having multiple encodings in the first place is that other
encodings make much more sense when you're not in the US.

There's also a lot of legacy data, even within the US. On IBM systems,
the standard encoding, even for greenfield systems that are being
written right now, is still pretty much EBCDIC all the way.

There simply does not exist a single encoding which would be
appropriate for every case, not even the majority of cases. In fact,
I'm not even sure that there is even a single encoding which is
appropriate for a significant minority of cases.

We tried that One Encoding To Rule Them All in Java, and it was a
failure. We tried it again with a different encoding in Java 5, and it
was a failure. We tried it in .NET, and it was a failure. The Python
community is currently in the process of realizing it was a failure. 5
years of work on PHP 6 were completely destroyed because of this. (At
least they realized it *before* releasing it into the wild.)

And now there's a push for a One Encoding To Rule Them All in Ruby 2.
That's *literally* insane! (One definition of insanity is repeating
behavior and expecting a different outcome.)

jwm
 
J

James Edward Gray II

The only two Unicode encodings that are fixed-width are the obsolete
UCS-2 (which can only encode the lower 65536 codepoints) and UTF-32.

And even UTF-32 would have the complications of "combining characters."

James Edward Gray II=
 
R

Robert Klemme

Phillip Gawlowski wrote in post #963602:

This may be true for the western world but I believe I remember one of
our Japanese friends state that Unicode does not cover all Asian
character sets completely; it could have been a remark about Java's
implementation of Unicode though, I am not 100% sure.
But that basically is my point. In order to make your program
comprehensible, you have to add extra incantations so that strings are
tagged as UTF-8 everywhere (e.g. when opening files).

However this in turn adds *nothing* to your program or its logic, apart
from preventing Ruby from raising exceptions.

Checking input and ensuring that data reaches the program in proper
ways is generally good practice for robust software. IMHO dealing
explicitly with encodings falls into the same area as checking whether
an integer entered by a user is strictly positive or a string is not
empty.

And I don't think you have to do it for one off scripts or when
working in your local environment only. So there is no effort
involved.

Brian, it seems you want to avoid the complex matter of i18n - by
ignoring it. But if you work in a situation where multiple encodings
are mixed you will be forced to deal with it - sooner or later. With
1.9 you get proper feedback while 1.8 may simply stop working at some
point - and you may not even notice it quickly enough to avoid damage.

Kind regards

robert
 
P

Phillip Gawlowski

This is not even wrong.

From the definitive source:
http://en.wikipedia.org/wiki/Division_by_zero

For certain values of "definitive", anyway.
The IEEE floating-point standard, supported by almost all modern
floating-point units, specifies that every floating point arithmetic
operation, including division by zero, has a well-defined result. The
standard supports signed zero, as well as infinity and NaN (not a
number). There are two zeroes, +0 (positive zero) and -0 (negative zero)
and this removes any ambiguity when dividing. In IEEE 754 arithmetic, a
=F7 +0 is positive infinity when a is positive, negative infinity when a
is negative, and NaN when a =3D =B10. The infinity signs change when
dividing by -0 instead.

Yes, the IEEE 754 standard defines it that way.

The IEEE standard, however, does *not* define how mathematics work.
Mathematics does that. In math, x_0/0 is *undefined*. It is not
infinity (David kindly explained the difference between limits and
numbers), it is not negative infinity, it is undefined. Division by
zero *cannot* happen. If it would, we would be able to build, for
example, perpetual motion machines.

So, from a purely mathematical standpoint, the IEEE 754 standard is
wrong by treating the result of division by 0.0 any different than
dividing by 0 (since floats are only different in their nature to
*computers* representing everything in binary [which cannot represent
floating point numbers at all, much less any given irrational
number]).


--=20
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
 
P

Phillip Gawlowski

This may be true for the western world but I believe I remember one of
our Japanese friends state that Unicode does not cover all Asian
character sets completely; it could have been a remark about Java's
implementation of Unicode though, I am not 100% sure.

Since UTF-8 is a subset of UTF-16, which in turn is a subset of
UTF-32, and Unicode is future-proofed (at least, ISO learned from the
mess created in the 1950s to 1960s) so that new glyphs won't ever
collide with existing glyphs, my point still stands. ;)

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
 
R

Robert Klemme

Since UTF-8 is a subset of UTF-16, which in turn is a subset of
UTF-32,

I tried to find more precise statement about this but did not really
succeed. I thought all UTF-x were just different encoding forms of
the same universe of code points.
and Unicode is future-proofed

Oh, so then ISO committee actually has a time machine? Wow! ;-)
(at least, ISO learned from the
mess created in the 1950s to 1960s) so that new glyphs won't ever
collide with existing glyphs, my point still stands. ;)

Well, I support your point anyway. That was just meant as a caveat so
people are watchful (and test rather than believe). :) But as I
think about it it more likely was a statement about Java's
implementation (because a char has only 16 bits which is not
sufficient for all Unicode code points).

Kind regards

robert
 
P

Phillip Gawlowski

I tried to find more precise statement about this but did not really
succeed. =A0I thought all UTF-x were just different encoding forms of
the same universe of code points.

It's an implicit feature, rather than an explicit one:
Wester languages get the first 8 bits for encoding. Glyphs going
beyond the Latin alphabet get the next 8 bits. If that isn't enough, n
additional 16 bits are used for encoding purposes.

Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus, also,
the future-proofing, in case even more glyphs are needed.

Well, I support your point anyway. =A0That was just meant as a caveat so
people are watchful (and test rather than believe). :) =A0But as I
think about it it more likely was a statement about Java's
implementation (because a char has only 16 bits which is not
sufficient for all Unicode code points).

Of course, test your assumptions. But first, you need an assumption to
start from. ;)

--=20
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
 
R

Robert Klemme

It's an implicit feature, rather than an explicit one:
Wester languages get the first 8 bits for encoding. Glyphs going
beyond the Latin alphabet get the next 8 bits. If that isn't enough, n
additional 16 bits are used for encoding purposes.

What bits are you talking about here, bits of code points or bits in
the encoding? It seems you are talking about bits of code points.
However, how these are put into any UTF-x encoding is a different
story and also because UTF-8 knows multibyte sequences it's not
immediately clear whether UTF-8 can only hold a subset of what UTF-16
can hold.
Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus, also,
the future-proofing, in case even more glyphs are needed.

Quoting from http://tools.ietf.org/html/rfc3629#section-3

Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So we have for code point encoding

7 bits
6 + 5 =3D 11 bits
2 * 6 + 4 =3D 16 bits
3 * 6 + 3 =3D 21 bits

This makes 2164864 (0x210880) possible code points in UTF-8. And the
pattern can be extended.

Looking at http://tools.ietf.org/html/rfc2781#section-2.1 we see that
UTF-16 (at least this version) supports code points up to 0x10FFFF.
This is less than what UTF-8 can hold theoretically.

Coincidentally 0x10FFFF has 21 bits which is what fits into UTF-8.

I stay unconvinced that UTF-8 can handle a subset of code points of
the set UTF-16 can handle.

I also stay unconvinced that UTF-8 encodings are a subset of UTF-16
encodings. This cannot be true because in UTF-8 the encoding unit is
one octet, while in UTF-16 it's two octets. As a practical example
the sequence "a" will have length 1 octet in UTF-8 (because it happens
to be an ASCII character) and length 2 octets in UTF-16.

"All standard UCS encoding forms except UTF-8 have an encoding unit
larger than one octet, [...]"
http://tools.ietf.org/html/rfc3629#section-1
Of course, test your assumptions. But first, you need an assumption to
start from. ;)

:)

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
J

James Edward Gray II

Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus, also,
the future-proofing, in case even more glyphs are needed.

You are confusing us.

UTF-8, UTF-16, and UTF-32 are encodings of Unicode code points. They =
are all capable of representing all code points. Nothing in this =
discussion is a subset of anything else.

James Edward Gray II=
 
J

James Edward Gray II

But as I think about it it more likely was a statement about Java's
implementation (because a char has only 16 bits which is not
sufficient for all Unicode code points).

I believe you are referring to the complaints the Asian cultures =
sometimes raise against Unicode. If so, I'll try to recap the issues, =
as I understand them.

First, Unicode is a bit larger than their native encodings. Typically =
they get everything they need into two bytes where Unicode requires more =
for their languages.

The Unicode team also made some controversial decisions that affected =
the Asian languages, like Han Unification =
(http://en.wikipedia.org/wiki/Han_unification).

Finally, they have a lot of legacy data in their native encodings and =
perfect conversion is sometimes tricky due to some context sensitive =
issues.

I think the Asian cultures have warmed a bit to Unicode over time (my =
opinion only), but it's important to remember that adopting it involved =
more challenges for them.

James Edward Gray II
 
R

Robert Klemme

I believe you are referring to the complaints the Asian cultures sometime=
s raise against Unicode. =A0If so, I'll try to recap the issues, as I under=
stand them.
First, Unicode is a bit larger than their native encodings. =A0Typically =
they get everything they need into two bytes where Unicode requires more fo=
r their languages.
The Unicode team also made some controversial decisions that affected the=
Asian languages, like Han Unification (http://en.wikipedia.org/wiki/Han_un=
ification).
Finally, they have a lot of legacy data in their native encodings and per=
fect conversion is sometimes tricky due to some context sensitive issues.

James, thanks for the summary. It is much appreciated.
I think the Asian cultures have warmed a bit to Unicode over time (my opi=
nion only), but it's important to remember that adopting it involved more c=
hallenges for them.

I believe that is in part due to our western ignorance. If we would
deal with encodings properly we would probably feel a similar pain -
at least it would cause more pain for us. I have frequently seen i18n
aspects being ignored (my pet peeve is time zones). Usually this
breaks your neck as soon as people from other cultures start using
your application - or such simple things happen as a change of a
database server's timezone which then differs from the application
server's. :)

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
P

Philip Rhoades

James,


You are confusing us.

UTF-8, UTF-16, and UTF-32 are encodings of Unicode code points. They
are all capable of representing all code points. Nothing in this
discussion is a subset of anything else.


This is all really interesting but I don't understand what you mean by
"code points" - is what you have said expressed diagrammatically somewhere?

Thanks,

Phil.
--
Philip Rhoades

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
 
R

Robert Klemme

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top