Strange behaviour of Strings in Range

  • Thread starter Michael Neumann
  • Start date
M

Michael Neumann

Hi,

r1 = ("\000" .. "\377") # all characters?

r1.to_a
# => ..... "6", "7", "8", "9"]

r1.to_a.size
# => 58

Hm, I guess this is because of "9".succ gives "10", and "10" has a size
of two.

But why does "9".succ results in "10"?

Regards,

Michael
 
R

Robert Klemme

Michael Neumann said:
Hi,

r1 = ("\000" .. "\377") # all characters?

r1.to_a
# => ..... "6", "7", "8", "9"]

r1.to_a.size
# => 58

Hm, I guess this is because of "9".succ gives "10", and "10" has a size
of two.

But why does "9".succ results in "10"?

IMHO this is a perlism so you can count with strings

irb(main):010:0> ("0".."20").to_a
=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20"]

robert
 
Y

Yukihiro Matsumoto

Hi,

In message "Strange behaviour of Strings in Range"

| r1 = ("\000" .. "\377") # all characters?
|
| r1.to_a
| # => ..... "6", "7", "8", "9"]
|
| r1.to_a.size
| # => 58
|
|Hm, I guess this is because of "9".succ gives "10", and "10" has a size
|of two.
|
|But why does "9".succ results in "10"?

It's caused by "succ" magic. Let me think about either subtracting
magic, or adding more magic.

matz.
 
H

Hal Fulton

Yukihiro said:
Hi,

In message "Strange behaviour of Strings in Range"

| r1 = ("\000" .. "\377") # all characters?
|
| r1.to_a
| # => ..... "6", "7", "8", "9"]
|
| r1.to_a.size
| # => 58
|
|Hm, I guess this is because of "9".succ gives "10", and "10" has a size
|of two.
|
|But why does "9".succ results in "10"?

It's caused by "succ" magic. Let me think about either subtracting
magic, or adding more magic.

"9" is not really a character anyway, but a string consisting of
one character.

In current Ruby, 0..0377 would work, since a character is essentially
a Fixnum.

Will Rite have a better-defined notion of "character"? Perhaps including
Unicode and such?


Hal
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Strange behaviour of Strings in Range"

|Will Rite have a better-defined notion of "character"? Perhaps including
|Unicode and such?

No. The definition of "character" should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

matz.
 
H

Hal Fulton

Yukihiro said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|Will Rite have a better-defined notion of "character"? Perhaps including
|Unicode and such?

No. The definition of "character" should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

These are two or three separate issues, I believe.

I know that no one encoding scheme will suffice for Asian languages
as well as European. Unicode in that sense is largely a dream, as I
understand it.

And I do not favor a Char class, which seems unnecessary to me.

But here are some related questions, to get more specific:

1. Will str[0] always be a Fixnum?

2. Will ?x always be a Fixnum?

3. In addition to each_byte, would each_char make sense? As I see it,
it would default to be the same as each_byte, but would be replaced
for a wide-char or multibyte variable-length encoding.


But I18N is one of the areas of my greatest ignorance in Ruby.


Thanks,
Hal
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Strange behaviour of Strings in Range"

|But here are some related questions, to get more specific:
|
|1. Will str[0] always be a Fixnum?

Rite gives 1 char string for str[0].

|2. Will ?x always be a Fixnum?

It will be 1 char string.

|3. In addition to each_byte, would each_char make sense? As I see it,
|it would default to be the same as each_byte, but would be replaced
|for a wide-char or multibyte variable-length encoding.

It makes sense, but I've not decided yet to add it.

matz.
 
R

Robert Klemme

Yukihiro Matsumoto said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|Will Rite have a better-defined notion of "character"? Perhaps including
|Unicode and such?

No. The definition of "character" should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

So then what's Unicode for in the first place? I thought the aim was to
have a universal encoding for all chars. Did I miss something?

IMHO Ruby as it is today determines the notion of "character" by the way
strings and regexps are handled and thus a char is a byte. IMHO
characters are so basic that you can't delegate that to the application
domain. You can delegate transformations but not having an internal
standard representation strikes me as difficult.

Maybe I'm overlooking something, if so, please let me know.

Regards

robert
 
R

Robert Klemme

gabriele renzi said:
I found this article really interesting, maybe it can help you too.
http://www.joelonsoftware.com/articles/Unicode.html

Nicely written but nothing I didn't new already. Still the question
remains what Ruby does about handling mixed content internally. IMHO the
most efficient way is to store code points internally. An alternative
would be to store a raw binary stream together with it's encoding but that
would make comparisons (which happen all the time, just think of hash
lookups) slow for strings with different encodings.

IMHO the Java approach* (although it burns mem by using 16 bit per char)
is the most practical among current programming languages. And I wouldn't
bother Ruby borrowing that - especially when considering attempts to use
Java bytecode and a JVM as runtime system.

Regards

robert


* Characters are stored internally with 16 bits, thus allowing a lot
(although not all) of the Unicode code points to be representable. Input
and output always uses an encoding (either explicit or implicit the
platform's default encoding). There's built in support for a number of
well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Strange behaviour of Strings in Range"

|So then what's Unicode for in the first place? I thought the aim was to
|have a universal encoding for all chars. Did I miss something?

It's _their_ intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

matz.
 
R

Robert Klemme

Yukihiro Matsumoto said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|So then what's Unicode for in the first place? I thought the aim was to
|have a universal encoding for all chars. Did I miss something?

It's _their_ intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

But does that mean one shouldn't try? I mean, Java shows that it can work
quite well (though I don't know about using Japanese "characters" with
Java). I know, it's a difficult topic especially since people sticked
with ASCII for such a long time, but I've always felt that encodings are a
weak spot of Ruby. But then, maybe I'm overlooking something or some
feature...

Kind regards

robert
 
T

ts

R> * Characters are stored internally with 16 bits, thus allowing a lot

What do you do when you need 24 bits ?

R> (although not all) of the Unicode code points to be representable. Input
R> and output always uses an encoding (either explicit or implicit the
R> platform's default encoding). There's built in support for a number of
R> well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.

only western, like I see :))


Guy Decoux
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Strange behaviour of Strings in Range"

|But does that mean one shouldn't try?

Did I say such thing? Trying is a good thing.

matz.
 
G

Gavin Sinclair

In message "Re: Strange behaviour of Strings in Range"
on 04/05/03, "Robert Klemme" <[email protected]> writes:
|So then what's Unicode for in the first place? I thought the aim was to
|have a universal encoding for all chars. Did I miss something?
It's _their_ intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

Why is that? Is there not enough room for every character known to
man, or is there some other problem?

Gavin
 
R

Robert Klemme

ts said:
R> * Characters are stored internally with 16 bits, thus allowing a lot

What do you do when you need 24 bits ?

As far as I can see, currently 20 bits are sufficient :)
http://www.unicode.org/charts/

And anything after "Special" looks really quite special to me. At least
western languages as well as Kanji, Hiragana and Katakana are supported.
IMHO pragmatically 16 bits are good enough.
R> (although not all) of the Unicode code points to be representable. Input
R> and output always uses an encoding (either explicit or implicit the
R> platform's default encoding). There's built in support for a number of
R> well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.

only western, like I see :))

I didn't sent the complete list. Apart from that, UTF-8 and UTF-16 handle
*all* unicode chars. See
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf

Regards

robert
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Strange behaviour of Strings in Range"

|Why is that? Is there not enough room for every character known to
|man, or is there some other problem?

Some other problems. I really wish things are that simple.

matz.
 
R

Robert Klemme

Yukihiro Matsumoto said:
Hi,

In message "Re: Strange behaviour of Strings in Range"

|But does that mean one shouldn't try?

Did I say such thing? Trying is a good thing.

Your note "The definition of "character" should belong to the application
domain" sounded to me like you didn't consider enhancing unicode treatment
in Ruby. I'm sorry if I misread you.

Then what's the approach planned at the moment?

Kind regards

robert
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Strange behaviour of Strings in Range"

|Your note "The definition of "character" should belong to the application
|domain" sounded to me like you didn't consider enhancing unicode treatment
|in Ruby. I'm sorry if I misread you.
|
|Then what's the approach planned at the moment?

Basic idea is your "alternative" in [ruby-talk:99089].
We prove it's not insane though making prototype.

Could you search the ruby-talk archive with keyword I18N for more
detail? Or you can check ruby_m17n branch in the CVS.


matz.
 
T

ts

R> As far as I can see, currently 20 bits are sufficient :)
R> http://www.unicode.org/charts/

What do you do with documents with Japanese EUC encoding

R> I didn't sent the complete list. Apart from that, UTF-8 and UTF-16 handle
R> *all* unicode chars. See

Like I've said previously : european centric vision ...


Guy Decoux
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,145
Messages
2,570,826
Members
47,371
Latest member
Brkaa

Latest Threads

Top