unicode in ruby

R

Richard Gyger

i'm using IO.foreach to parse the lines in a file. now i'm trying to get
it to work with unicode encoded files. does ruby support unicode? how do
i compare a variable with a unicode constant string?

the script goes something like:

IO.foreach("myfile.txt") { |line|
if line.downcase[0,2] == "id"
 
M

Michal Suchanek

T24gMy84LzA2LCBSaWNoYXJkIEd5Z2VyIDxyaWNoYXJkQGJ5dGV0aGluay5jb20+IHdyb3RlOgo+
IGknbSB1c2luZyBJTy5mb3JlYWNoIHRvIHBhcnNlIHRoZSBsaW5lcyBpbiBhIGZpbGUuIG5vdyBp
J20gdHJ5aW5nIHRvIGdldAo+IGl0IHRvIHdvcmsgd2l0aCB1bmljb2RlIGVuY29kZWQgZmlsZXMu
IGRvZXMgcnVieSBzdXBwb3J0IHVuaWNvZGU/IGhvdyBkbwo+IGkgY29tcGFyZSBhIHZhcmlhYmxl
IHdpdGggYSB1bmljb2RlIGNvbnN0YW50IHN0cmluZz8KPgo+IHRoZSBzY3JpcHQgZ29lcyBzb21l
dGhpbmcgbGlrZToKPgo+IElPLmZvcmVhY2goIm15ZmlsZS50eHQiKSB7IHxsaW5lfAo+ICAgIGlm
IGxpbmUuZG93bmNhc2VbMCwyXSA9PSAiaWQiCgpUbyBnZXQgdW5pY29kZSBkb3duY2FzZSB5b3Ug
cHJvYmFibHkgd2FudCBpY3U0ci4gVG8gaGFuZGxlIHRoZSBjYXNlcwp5b3UgYXJlIGludGVyZXN0
ZWQgaW4geW91IGNvdWxkIHdyaXRlIHlvdXIgb3duLiBIb3dldmVyLCB0aGUgW10Kb3BlcmF0b3Ig
b2YgcnVieSBzdHJpbmdzIHJldHVybnMgYnl0ZXMsIG5vdCBjaGFyYWN0ZXJzLgoKaHRoCgpNaWNo
YWwKCi0tCiAgICAgICAgICAgICBTdXBwb3J0IHRoZSBmcmVlZG9tIG9mIG11c2ljIQpNYXliZSBp
dCdzIGEgd2VpcmQgZ2VucmUgIC4uICBidXQgd2VpcmQgaXMgKm5vdCogaWxsZWdhbC4KTWF5YmUg
bmV4dCB0aW1lIHRoZXkgd2lsbCBzZW5kIGEgc3BlY2lhbCBmb3JjZXMgY29tbWFuZG8KdG8geW91
ciBwaWNuaWMgLi4gYmVjYXVzZSB0aGV5IHRoaW5rIHlvdSBhcmUgd2VpcmQuCiB3d3cubXVzaWMt
dmVyc3VzLWd1bnMub3JnICBodHRwOi8vZW4ucG9saWNlam5pc3RhdC5jego=
 
R

Richard Gyger

so, you guys are telling me a language developed since the year 2000=20
doesn't support unicode strings natively? in my opinion, that's a pretty=20
glaring problem.
=20
=20

=20

i'm using IO.foreach [.. no \n ]
=20


you don't make use of "\n" at uni-berlin.de when wrapping ?

could be more readable ;-)
=20
 
L

Logan Capaldo

so, you guys are telling me a language developed since the year
2000 doesn't support unicode strings natively? in my opinion,
that's a pretty glaring problem.

Ruby doesn't really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string ;) Having said
that, have you tried:
$KCODE="u" # Assumes the source file is encoded as UTF8, effects
literal strings, regexps, etc.

If your source file is UTF16 or some other non-UTF8 encoding you'll
have to use iconv to get into UTF8 to compare with the literals in
your source.
 
M

Michal Suchanek

T24gMy84LzA2LCBSaWNoYXJkIEd5Z2VyIDxyaWNoYXJkQGJ5dGV0aGluay5jb20+IHdyb3RlOgo+
IHNvLCB5b3UgZ3V5cyBhcmUgdGVsbGluZyBtZSBhIGxhbmd1YWdlIGRldmVsb3BlZCBzaW5jZSB0
aGUgeWVhciAyMDAwCj4gZG9lc24ndCBzdXBwb3J0IHVuaWNvZGUgc3RyaW5ncyBuYXRpdmVseT8g
aW4gbXkgb3BpbmlvbiwgdGhhdCdzIGEgcHJldHR5Cj4gZ2xhcmluZyBwcm9ibGVtLgoKRm9yIG1l
IGl0IGlzIGEgcHJvYmxlbSBhcyB3ZWxsLiBCdXQgZ2V0dGluZyB1bmljb2RlIHJpZ2h0IGlzIGhh
cmQuCkxvb2sgYXQgdGhlIHNpemUgb2YgdGhlIGljdSBsaWJyYXJ5IGFuZCB0aGUgc2l6ZSBvZiBy
dWJ5IGl0c2VsZi4KQW55d2F5LCB1bmljb2RlIHJlZ2V4cHMgYXJlIHBsYW5uZWQgZm9yIHJ1Ynkg
Mi4wIGlpcmMuCgpUaGFua3MKCk1pY2hhbAoKCi0tCiAgICAgICAgICAgICBTdXBwb3J0IHRoZSBm
cmVlZG9tIG9mIG11c2ljIQpNYXliZSBpdCdzIGEgd2VpcmQgZ2VucmUgIC4uICBidXQgd2VpcmQg
aXMgKm5vdCogaWxsZWdhbC4KTWF5YmUgbmV4dCB0aW1lIHRoZXkgd2lsbCBzZW5kIGEgc3BlY2lh
bCBmb3JjZXMgY29tbWFuZG8KdG8geW91ciBwaWNuaWMgLi4gYmVjYXVzZSB0aGV5IHRoaW5rIHlv
dSBhcmUgd2VpcmQuCiB3d3cubXVzaWMtdmVyc3VzLWd1bnMub3JnICBodHRwOi8vZW4ucG9saWNl
am5pc3RhdC5jego=
 
E

Eric Jacoboni

Logan Capaldo said:
Ruby doesn't really support any strings natively. It just happens to
have a bytevector class that acts a lot like a string ;)

.... that acts a lot like a string /of ASCII chars/, actually. Rather
anachronic, imho.

I can't consider that "il était une fois".length == 18 is the way it
should be with a string in a modern language.

Of course, tweaking with -K and jcode and/or other third parties
modules and/or various hacks allow some enhancements (we have a
jlength method that seems working), but that's not the Peru, either
(case methods support only ASCII chars, etc.)

Waiting for a plain support in Rite (much more important to me than
the "end" issues...).
 
R

rtilley

Eric said:
Waiting for a plain support in Rite (much more important to me than
the "end" issues...)

Speaking of Rite... is there a timeline on its release yet? One year?
Two years? More?
 
R

Richard Gyger

exactly. utf-8 doesn't mean one byte per char necessarily.

how have folks solved this problem when writing web sites in rails?
 
P

PJ Hyett

It's a huge f*cking pain in the ass. We've been trying to convert
Wayfaring.com over to UTF8 off and on for about a month and it's
completely useless. Either you start the site using UTF8 (using crappy
hacks IMO) or forgetaboutit. We're about to break ground on a new site
and I almost don't want to do it until ruby 2.0 comes out with the
unicode support built in.

-PJ
http://pjhyett.com
 
A

Anthony DeRobertis

Austin said:
Unix support for
Unicode is still in the stone ages because of the nonsense that POSIX
put on Unix ages ago. (When Unix filesystems can write UTF-16 as their
native filename format, then we're going to be much better. That will,
however, break some assumptions by really stupid programs.)

Ummm, no. UTF-16 filenames would break *every* correctly-implemented
UNIX program: UTF-16 allows the octect 0x00, which has always been the
end-of-string marker.

Personally, my file names have been in UTF-8 for quite some time now,
and it works well: What exactly is this 'stone age' you refer to?

UTF-8 can take multiple octets to represent a character. So can UTF-16,
UTF-32, and every other variation of Unicode.

Depending on content, a string in UTF-8 can consume more octects than
the same string in UTF-16, or vice versa.

Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't
get to have the fun of picking between big- and little-endian!
 
B

Bill Kelly

Hi,
I do not care about the space inefficiency. Be it inefficiency in
storing Czech text, Japanese text, English text, or any other. It has
nothing to do with the fact I do not speak Japanese.

I'm writing a cross-platform app in ruby that will include text editing and
also p2p chat. I'd like to handle extended character sets. Do we have any
recipies or best practices for handling non-ASCII character encodings
in present-day ruby v1.8.4?

How are folks currently handling non-ASCII "wide" character encodings
in Ruby?


Thanks,

Bill
 
A

Anthony DeRobertis

Austin said:
You're right. And I'm saying that I don't care.

Well, I suspect most other people want to maintain backwards
compatibility. Hence the existence of UTF-8.
People need to stop
thinking in terms of bytes (octets) and start thinking in terms of
characters. I'll say it flat out here: the POSIX filesystem definition
is going to badly limit what can be done with Unix systems.

Why? POSIX gives nearly binary-transparent file names; the only
exception is the single octet 0x00. Considering the 1:1 mapping between
UTF-8 and other Unicode encodings, how can the choice of one or another
"badly limit" what can be done?
Change and environment variable and watch your programs break that had
worked so well with Unicode. *That* is the stone age that I refer to.

dd if=/dev/urandom of=/lib/ld-linux.so.2 and watch all my programs
break, too. What's you point?

It is always possible to break a computer system if you try hard enough
(or, all too often, not hard at all); but if the user actively attempts
to make his machine malfunction, that's not the OS's problem.
I'm also guessing that you don't do much with long Japanese filenames
or deep paths that involve *anything* except US-ASCII (a subset of
UTF-8).

Well, I have Japanese file names (though not that many in the grand
scheme of things), and have a lot of files and directories named in non
US-ASCII. Yeah, I know that file name length and path length limits
suck, but that's an implementation limitation of e.g. ext3, nothing
fundamental.
This last statement is true only because you use the term "octet."

You're correct; that isn't what I meant to say. Something along the
lines of the following is better worded:

UTF-8 can take more than one octet to represent a
character; UTF-16 can take more than two; UTF-32
more than four; etc.
It's a useless term here, because UTF-8 only has any level of
efficiency for US-ASCII.

English, I've heard, is a rather common language.
Even if you step to European content, UTF-8
is no longer perfectly efficient,

Of course not --- but still generally better than UTF-16, I think.
Spanish, I've heard, is also a rather common language.
and when you step to Asian content,
UTF-8 is so bloody inefficient that most folks who have to deal with
it would rather work in a native encoding (EUC-JP or SJIS, anyone?)
which is 1..2 bytes or do everything in UTF-16.

Yes, for CJK, UTF-8 is fairly inefficient. A full 33% bigger than
UTF-16.

OTOH, it has some nice advantages over UTF-16, like being backwards
compatible with C strings, being resynchronizable (if a octet is lost),
not having byte-order issues, etc.

Now, honestly, what portion of your hard disk is taken up by file names?
 
B

Bill Kelly

From: "Austin Ziegler said:
No. UTF-32 does not have surrogates. Unicode is perfectly
representable in either 20 or 21 bits. A single character is *always*
representable in a uint32_t sized space with UTF-32.

Hi, I have zero background in non-ASCII character representations,
but the following post has been echoing in my head as a data point
for... can't believe it's been three-and-a-half years:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/46284

Does that have any relation to your current context? Curt seems to
be talking not of surrogates, but saying "combining characters"
mean variable-length issues still exist with UTF-32 ?


Regards,

Bill
 
A

Andreas

I don't get it guys. Supporting (not exclusively using) Unicode
transparently should be a no-brainer for a serious programming language
these days. I love Ruby but multi-byte string is a pain. And they are
everywhere. There's no logic in resisting. There are more chars in the
world than on your keyboard. Even in the US, there are official and
*correct * chars for quotation marks nit in the US_ASCII set. Using the
inch-sign for quotes is plain wrong. Come on, we're in th 21st century
and the world is a global place. OpenSource people should know that
best. It can't be so difficult technically - others do it, why can't
you?

All we want is a Unicode safe Ruby.

Best,
Andreas
 
A

Anthony DeRobertis

Austin said:
No. UTF-32 does not have surrogates. Unicode is perfectly
representable in either 20 or 21 bits. A single character is *always*
representable in a uint32_t sized space with UTF-32.

Depends on what you call a character; in the technical way Unicode uses
the term, yes, UTF-32 can represent every character at present.

In the way that users understand characters (what the unicode standard
calls a "grapheme") =E2=80=94 the way text-processing software needs to
manipulate characters =E2=80=94 no it can't.

d=CC=88=CC=85 is not three characters to the user.
POSIX is outdated and needs to be scrapped or fixed.

So far, you have provided no evidence of this, just assertions that
somehow UTF-8 is horribly limiting.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,667
Latest member
DaniloB294

Latest Threads

Top