Strange behaviour of Strings in Range

Michael Neumann · May 1, 2004

Hi,

r1 = ("\000" .. "\377") # all characters?

r1.to_a
# => ..... "6", "7", "8", "9"]

r1.to_a.size
# => 58

Hm, I guess this is because of "9".succ gives "10", and "10" has a size
of two.

But why does "9".succ results in "10"?

Regards,

Michael

Robert Klemme · May 1, 2004

Michael Neumann said:
Hi,

r1 = ("\000" .. "\377") # all characters?

r1.to_a
# => ..... "6", "7", "8", "9"]

r1.to_a.size
# => 58

Hm, I guess this is because of "9".succ gives "10", and "10" has a size
of two.

But why does "9".succ results in "10"?

IMHO this is a perlism so you can count with strings

irb(main):010:0> ("0".."20").to_a
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20"]

robert

Yukihiro Matsumoto · May 2, 2004

Hi,

In message "Strange behaviour of Strings in Range"

| r1 = ("\000" .. "\377") # all characters?
|
| r1.to_a
| # => ..... "6", "7", "8", "9"]
|
| r1.to_a.size
| # => 58
|
|Hm, I guess this is because of "9".succ gives "10", and "10" has a size
|of two.
|
|But why does "9".succ results in "10"?

It's caused by "succ" magic. Let me think about either subtracting
magic, or adding more magic.

matz.

Hal Fulton · May 2, 2004

Yukihiro said:
Hi,

In message "Strange behaviour of Strings in Range"

| r1 = ("\000" .. "\377") # all characters?
|
| r1.to_a
| # => ..... "6", "7", "8", "9"]
|
| r1.to_a.size
| # => 58
|
|Hm, I guess this is because of "9".succ gives "10", and "10" has a size
|of two.
|
|But why does "9".succ results in "10"?

It's caused by "succ" magic. Let me think about either subtracting
magic, or adding more magic.

"9" is not really a character anyway, but a string consisting of
one character.

In current Ruby, 0..0377 would work, since a character is essentially
a Fixnum.

Will Rite have a better-defined notion of "character"? Perhaps including
Unicode and such?

Hal

Yukihiro Matsumoto · May 3, 2004

Hi,

In message "Re: Strange behaviour of Strings in Range"

|Will Rite have a better-defined notion of "character"? Perhaps including
|Unicode and such?

No. The definition of "character" should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

matz.

Hal Fulton · May 3, 2004

Yukihiro said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|Will Rite have a better-defined notion of "character"? Perhaps including
|Unicode and such?

No. The definition of "character" should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

These are two or three separate issues, I believe.

I know that no one encoding scheme will suffice for Asian languages
as well as European. Unicode in that sense is largely a dream, as I
understand it.

And I do not favor a Char class, which seems unnecessary to me.

But here are some related questions, to get more specific:

1. Will str[0] always be a Fixnum?

2. Will ?x always be a Fixnum?

3. In addition to each_byte, would each_char make sense? As I see it,
it would default to be the same as each_byte, but would be replaced
for a wide-char or multibyte variable-length encoding.

But I18N is one of the areas of my greatest ignorance in Ruby.

Thanks,
Hal

Yukihiro Matsumoto · May 3, 2004

Hi,

In message "Re: Strange behaviour of Strings in Range"

|But here are some related questions, to get more specific:
|
|1. Will str[0] always be a Fixnum?

Rite gives 1 char string for str[0].

|2. Will ?x always be a Fixnum?

It will be 1 char string.

|3. In addition to each_byte, would each_char make sense? As I see it,
|it would default to be the same as each_byte, but would be replaced
|for a wide-char or multibyte variable-length encoding.

It makes sense, but I've not decided yet to add it.

matz.

Robert Klemme · May 3, 2004

Yukihiro Matsumoto said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|Will Rite have a better-defined notion of "character"? Perhaps including
|Unicode and such?

No. The definition of "character" should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

So then what's Unicode for in the first place? I thought the aim was to
have a universal encoding for all chars. Did I miss something?

IMHO Ruby as it is today determines the notion of "character" by the way
strings and regexps are handled and thus a char is a byte. IMHO
characters are so basic that you can't delegate that to the application
domain. You can delegate transformations but not having an internal
standard representation strikes me as difficult.

Maybe I'm overlooking something, if so, please let me know.

Regards

robert

gabriele renzi · May 3, 2004

So then what's Unicode for in the first place? I thought the aim was to
have a universal encoding for all chars. Did I miss something?

I found this article really interesting, maybe it can help you too.
http://www.joelonsoftware.com/articles/Unicode.html

Robert Klemme · May 3, 2004

gabriele renzi said:
I found this article really interesting, maybe it can help you too.
http://www.joelonsoftware.com/articles/Unicode.html

Nicely written but nothing I didn't new already. Still the question
remains what Ruby does about handling mixed content internally. IMHO the
most efficient way is to store code points internally. An alternative
would be to store a raw binary stream together with it's encoding but that
would make comparisons (which happen all the time, just think of hash
lookups) slow for strings with different encodings.

IMHO the Java approach* (although it burns mem by using 16 bit per char)
is the most practical among current programming languages. And I wouldn't
bother Ruby borrowing that - especially when considering attempts to use
Java bytecode and a JVM as runtime system.

Regards

robert

* Characters are stored internally with 16 bits, thus allowing a lot
(although not all) of the Unicode code points to be representable. Input
and output always uses an encoding (either explicit or implicit the
platform's default encoding). There's built in support for a number of
well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.

Yukihiro Matsumoto · May 3, 2004

Hi,

In message "Re: Strange behaviour of Strings in Range"

|So then what's Unicode for in the first place? I thought the aim was to
|have a universal encoding for all chars. Did I miss something?

It's _their_ intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

matz.

Robert Klemme · May 3, 2004

Yukihiro Matsumoto said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|So then what's Unicode for in the first place? I thought the aim was to
|have a universal encoding for all chars. Did I miss something?

It's _their_ intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

But does that mean one shouldn't try? I mean, Java shows that it can work
quite well (though I don't know about using Japanese "characters" with
Java). I know, it's a difficult topic especially since people sticked
with ASCII for such a long time, but I've always felt that encodings are a
weak spot of Ruby. But then, maybe I'm overlooking something or some
feature...

Kind regards

robert

ts · May 3, 2004

R> * Characters are stored internally with 16 bits, thus allowing a lot

What do you do when you need 24 bits ?

R> (although not all) of the Unicode code points to be representable. Input
R> and output always uses an encoding (either explicit or implicit the
R> platform's default encoding). There's built in support for a number of
R> well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.

only western, like I see

)

Guy Decoux

Yukihiro Matsumoto · May 3, 2004

Hi,

In message "Re: Strange behaviour of Strings in Range"

|But does that mean one shouldn't try?

Did I say such thing? Trying is a good thing.

matz.

Gavin Sinclair · May 3, 2004

Hi,

In message "Re: Strange behaviour of Strings in Range"
on 04/05/03, "Robert Klemme" <[email protected]> writes:

|So then what's Unicode for in the first place? I thought the aim was to
|have a universal encoding for all chars. Did I miss something?

It's _their_ intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

Why is that? Is there not enough room for every character known to
man, or is there some other problem?

Gavin

Robert Klemme · May 3, 2004

ts said:
R> * Characters are stored internally with 16 bits, thus allowing a lot

What do you do when you need 24 bits ?

As far as I can see, currently 20 bits are sufficient

http://www.unicode.org/charts/

And anything after "Special" looks really quite special to me. At least
western languages as well as Kanji, Hiragana and Katakana are supported.
IMHO pragmatically 16 bits are good enough.

R> (although not all) of the Unicode code points to be representable. Input
R> and output always uses an encoding (either explicit or implicit the
R> platform's default encoding). There's built in support for a number of
R> well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.

only western, like I see )

I didn't sent the complete list. Apart from that, UTF-8 and UTF-16 handle
*all* unicode chars. See
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

Regards

robert

Yukihiro Matsumoto · May 3, 2004

Hi,

In message "Re: Strange behaviour of Strings in Range"

|Why is that? Is there not enough room for every character known to
|man, or is there some other problem?

Some other problems. I really wish things are that simple.

matz.

Robert Klemme · May 3, 2004

Yukihiro Matsumoto said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|But does that mean one shouldn't try?

Did I say such thing? Trying is a good thing.

Your note "The definition of "character" should belong to the application
domain" sounded to me like you didn't consider enhancing unicode treatment
in Ruby. I'm sorry if I misread you.

Then what's the approach planned at the moment?

Kind regards

robert

Yukihiro Matsumoto · May 3, 2004

Hi,

In message "Re: Strange behaviour of Strings in Range"

|Your note "The definition of "character" should belong to the application
|domain" sounded to me like you didn't consider enhancing unicode treatment
|in Ruby. I'm sorry if I misread you.
|
|Then what's the approach planned at the moment?

Basic idea is your "alternative" in [ruby-talk:99089].
We prove it's not insane though making prototype.

Could you search the ruby-talk archive with keyword I18N for more
detail? Or you can check ruby_m17n branch in the CVS.

matz.

ts · May 4, 2004

R> As far as I can see, currently 20 bits are sufficient

R> http://www.unicode.org/charts/

What do you do with documents with Japanese EUC encoding

R> I didn't sent the complete list. Apart from that, UTF-8 and UTF-16 handle
R> *all* unicode chars. See

Like I've said previously : european centric vision ...

Guy Decoux

Range / empty list issues??	1	Dec 11, 2023
strange behaviour of realloc()	3	May 23, 2013
Minimum Total Difficulty	0	Nov 15, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
strange behaviour if exploding parameters	3	Jun 16, 2010
Different behaviour in irb and ruby	2	Jan 3, 2011
Strange Behaviour in finding Size of a File	34	Nov 9, 2012
[BUG] string range membership	5	Nov 23, 2005

Strange behaviour of Strings in Range

Michael Neumann

Robert Klemme

Yukihiro Matsumoto

Hal Fulton

Yukihiro Matsumoto

Hal Fulton

Yukihiro Matsumoto

Robert Klemme

gabriele renzi

Robert Klemme

Yukihiro Matsumoto

Robert Klemme

ts

Yukihiro Matsumoto

Gavin Sinclair

Robert Klemme

Yukihiro Matsumoto

Robert Klemme

Yukihiro Matsumoto

ts

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads