Unicode

Zephyr Pellerin · Sep 15, 2007

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

James Edward Gray II · Sep 15, 2007

I hate to discuss something related to the development timeline, I
know its tenable, but When will it be reasonable to expect Unicode
support from Ruby?

Ruby has some UTF-8 support today. Support will increase with the
m17n support though.

See last question and answer here:

http://blog.grayproductions.net/articles/the_ruby_vm_episode_iv

James Edward Gray II

Phlip · Sep 15, 2007

Zephyr said:
I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support from
Ruby?

"Unicode" is not an encoding. Are you asking for UTF-8, UTF-16, or something
else?

Todd Burch · Sep 17, 2007

Zephyr said:
I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

I was just looking at the source code for 1.8.6 this weekend. The C
syntax that's being used is pre-ANSI-C (which means in 1988, it was
"old" syntax).

Rotsa Ruck.

Todd

Phlip · Sep 17, 2007

I hate to discuss something related to the development timeline, I know

I was just looking at the source code for 1.8.6 this weekend. The C
syntax that's being used is pre-ANSI-C (which means in 1988, it was
"old" syntax).

Apples and oranges. Unicode libraries like iconv use C linkage, so they can
bond with most C implementations regardless of their compliance. (C linkage
is very weak and simplistic.) All Cs can handle 8-bit strings, and can be
programmed to use 16-bit strings, which are the requirements for UTF-8 and
UTF-16.

Like most languages, Ruby's source is in a primitive form of C to maximize
the number of compilers, and hence the number of platforms and hardwares,
that it runs on. I would suspect - unless Matz is an even greater genius
than average - that Ruby's C style has been carefully retrofitted, after the
language passed its first few version ticks.

Rotsa Ruck.

Racial slur noted.

Todd Burch · Sep 17, 2007

Phlip said:
Racial slur noted.

You got a problem with Scooby Doo?

For the record, this was NOT intended to slur anything. It was not my
intent, nor is my nature, to slur. However, reading this in hindsight,
it certainly could be taken this way. Please accept my apologies.

Now, I'll rephrase.

Lotsa luck getting something like Unicode implemented when the
underlying C contructs are using such an outdated syntax as ruby's does.

But, as Phlip implies, it's just a simple matter of programming.

Todd

Phlip · Sep 17, 2007

Todd said:
For the record, this was NOT intended to slur anything. It was not my
intent, nor is my nature, to slur. However, reading this in hindsight,
it certainly could be taken this way. Please accept my apologies.

Oh my apologies too - Scooby Doo is quite over my head. All I could
imagine was Matz in a kimono serving Sake.

Todd Burch · Sep 17, 2007

Yukihiro said:
Hi,

Old K&R style has nothing related to Unicode support of the language.
If you think it does, please elaborate.

It just reflects the history of the language. When I started
developing Ruby, old Sun CC compiler does not understand new style,
and I wanted Ruby to run on that platform, which I was using then.

For your information, the next release (1.9) finally abandoned the old
style.

matz.

Thanks Matz.

I'm new to C programming, but not new to programming. Therefore, my
assumption (yes, assumption) was that using whatever compiler swithes
were necessary to accept the old-style syntax would obviate the
opportunity to bring in "modern" libraries with unicode support, and/or
prohibit those aspects of the language that would enable the use of
unicode features.

So, apparently, since they ("they" being unicode support and the
syntax/compiler switches) are not related, and that's great.

By the way, as an aside, I really like the language you developed and
have made available. I primarily use Ruby with SketchUp (a 3D modeling
program - http://www.sketchup.com) for extending the functionality of
the product. (SketchUp has a Ruby API) I was looking at the source to
see what it would take to implement a debugger than would work with Ruby
while running under SketchUp. I would like to step through expression
evaluation as the script runs.

(Big aspirations for a new C programmer like myself!)

Todd

Michal Suchanek · Sep 21, 2007

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
[], ...
You have to remember to convert the string back, though.

Thanks

Michal

Jimmy Kofler · Sep 22, 2007

Michal said:
I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Click to expand...

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
[], ...
You have to remember to convert the string back, though.

Thanks

Michal

... or you may use the /re/u regex option to handle UTF-8 encoded
strings (cf. http://snippets.dzone.com/posts/show/4527 ).

Cheers,

j.k.

Felipe Contreras · Sep 28, 2007

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Click to expand...

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
[], ...
You have to remember to convert the string back, though.

What about UTF-16?

http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/

John Joyce · Sep 29, 2007

I hate to discuss something related to the development timeline,
I know
its tenable, but When will it be reasonable to expect Unicode
support
from Ruby?

Click to expand...

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each,
reverse,
[], ...
You have to remember to convert the string back, though.

Click to expand...

What about UTF-16?

http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/

Go to unicode.org
There you can read a full explanation (or a brief one) about why you
don't need to worry about UTF-16
UTF-8 is all you need.
Unicode is something everyone needs to read up on at some point.
I have to read up on every now and then because my brain leaks.

Felipe Contreras · Sep 29, 2007

I hate to discuss something related to the development timeline,
I know
its tenable, but When will it be reasonable to expect Unicode
support
from Ruby?

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each,
reverse,
[], ...
You have to remember to convert the string back, though.

Click to expand...

What about UTF-16?

http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/

Click to expand...

Go to unicode.org
There you can read a full explanation (or a brief one) about why you
don't need to worry about UTF-16
UTF-8 is all you need.
Unicode is something everyone needs to read up on at some point.
I have to read up on every now and then because my brain leaks.

Yes but what about stuff already encoded in UTF-16?

John Joyce · Sep 29, 2007

Yes but what about stuff already encoded in UTF-16?

That's why I said read up on unicode!
After you read that stuff you'll understand why it's no problem.
I'm not going to explain it. Many people understand it, but when
explaining it might make mistakes.
Read the unicode stuff carefully. It's vital for many things.

The only thing you might run into is BOM or Endian-ness, but it's
doubtful it will be an issue in most cases.

This might get you started.
http://www.unicode.org/faq/utf_bom.html#37

Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it....

James Edward Gray II · Sep 29, 2007

The short version is that UTF-16 is basically wasteful.

That's not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt >
japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward Gray II

John Joyce · Sep 30, 2007

That's not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt >
japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward Gray II

interesting that you would generate more lines, fewer words, and
fewer bytes (probably explained by fewer words..)
wc defines words as whitespace delimited, Extremely interesting
considering that Japanese uses no whitespace except in page layout.
Grammar does not dictate any whitespace at all. At most in Japanese
prose you might have one whitespace between sentences, perhaps only
between "paragraphs"

I don't know how iconv handles things. man iconv says it uses iswspace
(3) which is in wctype.h but I always hate reading those headers.
I tried using iconv on a file in utf-8 to utf-16 and then back again.
Results are similar, but interstingly, it's no indication of file
size. Files are the same size
I then tried the same with some code in C++ and similar results occured.
It would seem to be a whitspace issue. I didn't realize this, but it
does look like utf-8 is generating fewer whitespace characters while
generating a bigger file...?
I'm curious what the deal is there.

In theory utf-8 should do better than utf-16 for characters in the
ASCII range...
at least that was my understanding. And assuming code files are
largely ASCII character sets...
hmm...!?

John Joyce · Sep 30, 2007

That's not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt >
japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
16 45 3968 japanese_prose_in_utf16.txt

James Edward Gray II

Scratch that! I must've gone cross-eyed!
My c++ code was indeed smaller file size in utf-8 than utf-16 as I
expected!
Interestingly, *nix's apparently use utf-32 internally regardless of
the source encoding... very interesting

Felipe Contreras · Sep 30, 2007

Hi,

That's why I said read up on unicode!
After you read that stuff you'll understand why it's no problem.
I'm not going to explain it. Many people understand it, but when
explaining it might make mistakes.
Read the unicode stuff carefully. It's vital for many things.

The only thing you might run into is BOM or Endian-ness, but it's
doubtful it will be an issue in most cases.

This might get you started.
http://www.unicode.org/faq/utf_bom.html#37

Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

As you suggested I read the article:
http://www.joelonsoftware.com/articles/Unicode.html

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it....

What is there to know about Unicode? There's a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I'm sorry if I'm being rude, but I really don't like when people tell
me to read stuff I already know.

My question is still there:

Let's say I want to rename a file "fooobar", and remove the third "o",
but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
of course there will still be a 0x00 in there. That's if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

I don't mind reading some more if I can actually find the answer.

Best regards.

John Joyce · Sep 30, 2007

Hi,

As you suggested I read the article:
http://www.joelonsoftware.com/articles/Unicode.html

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

What is there to know about Unicode? There's a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I'm sorry if I'm being rude, but I really don't like when people tell
me to read stuff I already know.

My question is still there:

Let's say I want to rename a file "fooobar", and remove the third "o",
but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
of course there will still be a 0x00 in there. That's if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

I don't mind reading some more if I can actually find the answer.

Best regards.

Hmm... you should consider converting it to utf-8 via iconv.
There is a gem for iconv
This will keep your data intact, but you might need to convert it
back to utf-16 later.

I believe filenames on windows are actually utf-8,
Files' contents are generally written in utf-16

Could be wrong on this...
but test it and see!
Try to to open a file with non-ascii range characters in irb and see
what happens.
If it fails, no harm done.

John Joyce · Sep 30, 2007

oh, and Mr. Contreras,
I did not mean to say RTFM to you. Sorry if it seemed like that.

Question about my projects	3	Jul 23, 2021
DP Social Timeline - need javascript help	1	Aug 28, 2023
[PAID][REMOTE] Hiring programmer/dev for indie game	2	Feb 19, 2023
Become a C++ programmer	6	Feb 21, 2023
String#split regex \W on non-ASCII text	1	Nov 9, 2010
Unicode (UTF-8) in C	13	Mar 16, 2014
Menu issues - please help me	0	Jun 12, 2024
Is Unicode support so hard...	12	Apr 20, 2013

Unicode

Zephyr Pellerin

James Edward Gray II

Phlip

Todd Burch

Phlip

Todd Burch

Phlip

Todd Burch

Michal Suchanek

Jimmy Kofler

Felipe Contreras

John Joyce

Felipe Contreras

John Joyce

James Edward Gray II

John Joyce

John Joyce

Felipe Contreras

John Joyce

John Joyce

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads