ruby 1.9.1: Encoding trouble: broken US-ASCII String

T

Tom Link

Hi,

Right now, I'm not exactly thrilled by the way ruby 1.9 handles
encodings. Could somebody please explain things or point me to some
reference material:

I have the following files:

testEncoding.rb:
#!/usr/bin/env ruby
# encoding: ISO-8859-1

p __ENCODING__

text = File.read("text.txt")
text.each_line do |line|
p line =~ /foo/
end


text.rb:
Foo äöü bar.

I use: ruby 1.9.1 (2008-12-01 revision 20438) [i368-cygwin]

If I run: ruby19 testEncoding.rb, I get:
#<Encoding:ISO-8859-1>
testEncoding.rb:8:in `block in <main>': broken US-ASCII string
(ArgumentError)

Ruby detects the encoding line but suspects the text file to be 7bit
ascii nevertheless. The source file encoding is only respected if I
add the command line option -E ISO-8859-1. I could also set the
encoding explicitly for each string but ...

I found some hints that the default charset for external sources is
deduced from the locale. So I set LANG to de_AT, de_AT.ISO-8859-1 and
some more variants with no avail.

How exactly is this supposed to work? What other options do I have to
make ASCII8BIT or Latin-1 the default encoding without having to
supply an extra command-line option and without having to rely on an
environment variable? Why isn't ASCII8BIT the default in the first
place? Why isn't __ENCODING__ a global variable I can assign a value
too?

Thanks,
Thomas.
 
B

Brian Candler

Tom said:
Right now, I'm not exactly thrilled by the way ruby 1.9 handles
encodings. Could somebody please explain things or point me to some
reference material:

I asked the same over at ruby-core recently. There were some useful
replies:

http://www.ruby-forum.com/topic/173179#759661

But the upshot is that this is all pretty much undocumented so far.
(Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
that yet)
text = File.read("text.txt")

This should work:

text = File.read("text.txt", :encoding=>"ISO-8859-1")

I still don't know how the default is worked out though.

Regards,

Brian.
 
T

Tom Link

text = File.read("text.txt", :encoding=>"ISO-8859-1")

Unfortunately, this isn't compatible with ruby 1.8. A script that uses
such a construct runs only with ruby 1.9. Sigh.

Many thanks for the pointer to the other thread over at ruby core.

Regards,
Thomas.
 
B

Brian Candler

Tom said:
Unfortunately, this isn't compatible with ruby 1.8. A script that uses
such a construct runs only with ruby 1.9. Sigh.

If all else fails, read the source.

I see that the encoding falls back to rb_default_external_encoding(),
which returns default_external, setting it if necessary from
rb_enc_from_index(default_external_index)

This in turn is set from rb_enc_set_default_external

This in turn is set from cmdline_options.ext.enc.name

And this in turn is set from the -E flag (or certain legacy settings on
-K). So:

$ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
ISO-8859-1

Yay. However, if it is possible to set the default external encoding
programatically (i.e. not via the command line options) I couldn't see
how.
 
B

Brian Candler

Brian said:
$ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
ISO-8859-1

D'oh. I see from original post that you knew this already.

It seems that Ruby keeps state for:
- default external encoding (e.g. for files being read in)
- default internal encoding (not sure what this is, you can set using -E
too but it defaults to nil)

and these are independent from the encodings of source files, which use
the magic comments to declare their encoding.

You can read these using Encoding.default_external and
Encoding.default_internal, but there don't appear to be setters for
them.
 
B

Brian Candler

Ah, there is a preview here:

http://books.google.co.uk/books?id=...X&oi=book_result&resnum=4&ct=result#PPA358,M1

Something like this may do the trick:

text = File.open("..") do |f|
f.set_encoding("ISO-8859-1") rescue nil
f.read
end

But then you may as well just do:

text.force_encoding("ISO-8859-1") rescue nil

I'm not sure in which way the regexp is incompatible with the data read.
I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

I can't really replicate without a hexdump of your text.txt. But it
would be interesting to see the result of:

text.each_line do |line|
p line.encoding
p /foo/.encoding
p line =~ /foo/
end

Maybe what's really needed is a sort of "anti-/u" option which means "my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

Anyway, I'm afraid all this increases my inclination to stick with ruby
1.8.6 :-(
 
J

James Gray

But the upshot is that this is all pretty much undocumented so far.
(Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
that yet)

The Pickaxe does cover a lot of the new encoding behavior.

James Edward Gray II
 
J

James Gray

- default internal encoding (not sure what this is, you can set
using -E
too but it defaults to nil)

Default internal is the encoding IO objects will transcode incoming
data into, by default. So you could set this for UTF-8 and then read
from various different encodings (specifying each type in the open()
call), but only work with Unicode in your script.

James Edward Gray II
 
J

James Gray

I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

It does:

$ ruby_dev -e 'p "r=E9sum=E9".encode("ISO-8859-1") =3D~ /foo/'
nil
$ ruby_dev -e 'p "r=E9sum=E9 foo".encode("ISO-8859-1") =3D~ /foo/'
7
Maybe what's really needed is a sort of "anti-/u" option which means =20=
"my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

That's what BINARY means.
Anyway, I'm afraid all this increases my inclination to stick with =20
ruby
1.8.6 :-(

Perhaps it's a bit early to make this judgement since you've just =20
started learning about the new system?

There's a lot going on here, so it's a lot to take in. In places, the =20=

behavior is a little complex. However, the core team has put a lot of =20=

effort into making the system easier to use. It's getting there.

Also, even in it's current draft form, the Pickaxe answers every =20
question you've thrown at both mailing lists. Thus it should be a big =20=

help when you decide the time is right to pick it up.

James Edward Gray II
 
B

Brian Candler

James said:
It does:

$ ruby_dev -e 'p "r�sum�".encode("ISO-8859-1") =~ /foo/'
nil
$ ruby_dev -e 'p "r�sum� foo".encode("ISO-8859-1") =~ /foo/'
7

I found that too, but was confused by the "broken US-ASCII string"
exception which the OP saw.

I suppose the external_encoding is defaulting to US-ASCII on that
system.

This means his program will break on every file passed into it which has
a character with the top bit set. You can argue that's "failsafe", in
the sense of bombing out rather than continuing processing with the
wrong encoding, and it therefore forces you to change your program or
the command-line args to specify the actual encoding in use.

However, that's pretty unforgiving. I can use Unix grep on a file with
unknown character set or broken UTF-8 characters and it works quite
happily.

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

irb(main):011:0> s = "foo\xff\xff\xffbar".force_encoding("BINARY")
=> "foo\xFF\xFF\xFFbar"
irb(main):012:0> s =~ /foo/
=> 0
That's what BINARY means.

On the String side, yes.

I was thinking of an option on the Regexp: /foo/b or somesuch.
(In contrast to /foo/u in 1.8 meaning 'this Regexp matches unicode')

Or you can you set BINARY encoding on the Regexp too? I couldn't see
how.
 
T

Tom Link

There's a lot going on here, so it's a lot to take in.  In places, the  
behavior is a little complex.  However, the core team has put a lot of  
effort into making the system easier to use.  It's getting there.

It would have been nice though if the defaults had been chosen so that
they don't break 1.8 scripts -- or use some 8bit clean encoding if the
data contains 8bit wide characters instead of throwing an error.
 
J

James Gray

It would have been nice though if the defaults had been chosen so that
they don't break 1.8 scripts -- or use some 8bit clean encoding if the
data contains 8bit wide characters instead of throwing an error.

I think it's probably more important to get this encoding interface
right than to worry about 1.8 compatibility. We knew 1.9 was going to
break some things, so the time was right.

Also, if you've been using the -KU switch in Ruby 1.8 and working with
UTF-8 data, 1.9 may work pretty well for you:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/19552

That's a pretty common "best practice" in the Ruby community, from
what I've seen. Even Rails pushes this approach now.

If you have gone this way though, you may want to migrate to the even
better -U switch in 1.9.

James Edward Gray II
 
J

James Gray

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

The default encoding is pulled from your environment: LANG or
LC_CTYPE, I believe. This is very important and it makes simple
scripting fit in well with the environment.

James Edward Gray II
 
B

Brian Candler

Wouldn't it be kinder to default to BINARY if the encoding is
The default encoding is pulled from your environment: LANG or
LC_CTYPE, I believe. This is very important and it makes simple
scripting fit in well with the environment.

The code seems to say:
- if an encoding is chosen in the environment but is unknown to Ruby,
use ASCII-8BIT (aka BINARY)
- if Ruby was built on a system where it doesn't know how to ask the
environment for a language, then use US-ASCII

So I would read from this that the OP has either fallen foul of the
US-ASCII fallback (e.g. no langinfo.h when building under Cygwin), or
else his environment has explicitly picked US-ASCII.

There must have been a good reason why US-ASCII was chosen, rather than
ASCII-8BIT, for systems without langinfo.h.

Regards,

Brian.

rb_locale_encoding(void)
{
VALUE charmap = rb_locale_charmap(rb_cEncoding);
int idx;

if (NIL_P(charmap))
idx = rb_usascii_encindex();
else if ((idx = rb_enc_find_index(StringValueCStr(charmap))) < 0)
idx = rb_ascii8bit_encindex();

if (rb_enc_registered("locale") < 0) enc_alias("locale", idx);

return rb_enc_from_index(idx);
}

...

VALUE
rb_locale_charmap(VALUE klass)
{
#if defined NO_LOCALE_CHARMAP
return rb_usascii_str_new2("ASCII-8BIT");
#elif defined HAVE_LANGINFO_H
char *codeset;
codeset = nl_langinfo(CODESET);
return rb_usascii_str_new2(codeset);
#elif defined _WIN32
return rb_sprintf("CP%d", GetACP());
#else
return Qnil;
#endif
}
 
O

Ollivier Robert

Perhaps it's a bit early to make this judgement since you've just =20
started learning about the new system?

From what I've seen and experimented with 1.9 for a few months, my main gripe
is that the whole encoding support is overly complex. I know m17n is not
solved by the magic unicode wand but I'd love to have a more simple way.
 
B

Brian Candler

Yukihiro said:
The whole picture must be complex, since encoding support itself is
VERY complex indeed. History sucks. But for daily use, just remember
specifying encoding if you are not sure what is the default_encoding,
e.g.

f = open(path, "r:iso-8859-1")

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

However, I also don't like the unstated assumption that all Strings
contain text.

In RFC2045 (MIME), there is a distinction made between 7bit text, 8bit
text, and binary data.

But if you label a string as "binary", Ruby changes this to
"ASCII-8BIT". I think that is a misrepresentation of that data, if it is
not actually ASCII-based text. I would much rather it made no assertion
about the content than a wrong assertion.
 
D

Dave Thomas

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

You used to have to do that. In recent HEADS, rb sets binary encoding
automatically (unless overridden).


Dave
 
T

Tom Link

Also, if you've been using the -KU switch in Ruby 1.8 and working with
UTF-8 data, 1.9 may work pretty well for you

Well, I'm still stuck with latin-1. It's interesting though that
according to B Candler the fallback for unknown encodings should be 8-
bit clean and that US-ASCII should be only used as last resort. Maybe
it's just a cygwin thing?

Could we/I please get more information on how exactly the charset is
chosen depending on which environment variable and if this applies for
cygwin too? It appears to me that neither LANG nor LC_TYPE have any
effect on charset selection. But maybe I'm doing it wrong.

Regards,
Thomas.
 
B

Brian Candler

Yukihiro said:
open(path, "rb") is your friend. It sets encoding to binary.

Thanks.

"rb" is now performing two jobs then - prevent line-ending translation
(on those platforms which do it), and set encoding to binary. Something
to remember.
 
T

Tom Link

So I would read from this that the OP has either fallen foul of the
US-ASCII fallback (e.g. no langinfo.h when building under Cygwin), or
else his environment has explicitly picked US-ASCII.

Somebody mentions on http://bugs.python.org/issue3824 that:
"And nl_langinfo(CODESET) is useless on cygwin because it's always US-
ASCII."

And here: http://svn.xiph.org/trunk/vorbis-tools/intl/localcharset.c
"Cygwin 2006 does not have locales. nl_langinfo (CODESET) always
returns "US-ASCII"."

If I understood you right, this could cause the problems I
encountered.

Cygwin 1.7 is currently in beta. Maybe this improves things in this
respect?

Regards,
Thomas.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top