A question about Ruby 1.9's "external encoding"

A

Albert Schlef

I have the following program:

p Encoding.default_external
File.open('testing', 'w') do |f|
p f.external_encoding
end

and when I run it I the following output:

#<Encoding:UTF-8>
nil

In other words, the file's "external encoding" is nil. What does this
mean? Shouldn't this be "UTF-8", the default external encoding?

BTW, "ruby1.9.1 -v" gives me:

ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

I'm using Ubuntu 10.04.1, and that's the most updated version of Ruby
1.9.1.
 
R

Robert Klemme

I have the following program:

p Encoding.default_external
File.open('testing', 'w') do |f|
p f.external_encoding
end

and when I run it I the following output:

#<Encoding:UTF-8>
nil

In other words, the file's "external encoding" is nil. What does this
mean? Shouldn't this be "UTF-8", the default external encoding?

--------------------------------------------------- IO#external_encoding
io.external_encoding => encoding

From Ruby 1.9.1
------------------------------------------------------------------------
Returns the Encoding object that represents the encoding of the
file. If io is write mode and no encoding is specified, returns
+nil+.

I'd say it means that the default encoding is used.
BTW, "ruby1.9.1 -v" gives me:

ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

I'm using Ubuntu 10.04.1, and that's the most updated version of Ruby
1.9.1.

irb(main):001:0> Encoding.default_external
Encoding.default_external Encoding.default_external=
irb(main):001:0> Encoding.default_external
=> #<Encoding:UTF-8>
irb(main):002:0> Encoding.default_internal
=> nil
irb(main):003:0> File.open("x","w"){|io| p io.external_encoding; io.puts
"aä"}
nil
=> nil
irb(main):004:0> File.open("x","r:UTF-8"){|io| p io.external_encoding;
io.read}
#<Encoding:UTF-8>
=> "aä\n"
irb(main):005:0>

Apparently the file *is* encoded in UTF-8 because I can read it without
errors and get what I expect.

Kind regards

robert
 
B

Brian Candler

Albert Schlef wrote in post #988363:
I have the following program:

p Encoding.default_external
File.open('testing', 'w') do |f|
p f.external_encoding
end

and when I run it I the following output:

#<Encoding:UTF-8>
nil

In other words, the file's "external encoding" is nil. What does this
mean? Shouldn't this be "UTF-8", the default external encoding?

Depends what you mean by "shouldn't be". The rules for encodings in ruby
1.9 are (IMO) arbitrary and inconsistent.

In the case of external encodings: yes, they default to nil for files
opened in write mode. This means that no transcoding is done on output.
For example, if you have a String which happens to contain binary, or
ISO-8859-1, it will be written out unchanged (i.e. the sequence of bytes
in the String is the same sequence of bytes which will end up in the
file).

If you want to transcode on output, you have to set the external
encoding explicitly.

Since none of this is documented anywhere officially, I attempted to
reverse engineer it. I've documented about 200 behaviours here:
https://github.com/candlerb/string19/blob/master/string19.rb

For my own code, I still use ruby 1.8 exclusively.
 
B

Brian Candler

Robert K. wrote in post #988404:
--------------------------------------------------- IO#external_encoding
io.external_encoding => encoding

From Ruby 1.9.1
------------------------------------------------------------------------
Returns the Encoding object that represents the encoding of the
file. If io is write mode and no encoding is specified, returns
+nil+.

I'd say it means that the default encoding is used.

No, it doesn't.
Apparently the file *is* encoded in UTF-8 because I can read it without
errors

ruby 1.9 does not give errors if you read a file which is not UTF-8
encoded with the external encoding is UTF-8. You will just get strings
with valid_encoding? false.

It will give errors if you attempt UTF-8 regexp matches on the data
though.

The rules for which methods give errors and which don't are pretty odd.
For example, string[n] doesn't give an exception, even if the string is
invalid.
 
R

Robert Klemme

Robert K. wrote in post #988404:

No, it doesn't.

So, which encoding is used then? An encoding *has* to be used because
you cannot write to a file without a particular encoding. There needs
to be a defined mapping between character data and bytes in the file.
ruby 1.9 does not give errors if you read a file which is not UTF-8
encoded with the external encoding is UTF-8. You will just get strings
with valid_encoding? false.

I could see in the console that the file was read properly. Also:

irb(main):001:0> File.open("x","w"){|io| p io.external_encoding; io.puts
"aä"}
nil
=> nil
irb(main):002:0> s = File.open("x","r:UTF-8"){|io| p
io.external_encoding; io.read}
#<Encoding:UTF-8>
=> "aä\n"
irb(main):003:0> s.valid_encoding?
=> true
irb(main):004:0>
It will give errors if you attempt UTF-8 regexp matches on the data
though.

The rules for which methods give errors and which don't are pretty odd.
For example, string[n] doesn't give an exception, even if the string is
invalid.

I would concede that encodings in Ruby are pretty complex. It's easier
in Java where String never has a particular encoding and only reading
and writing uses encodings. However, Java's Strings were not capable of
handling all Asian symbols as I have learned on this list. Since 1.5
they managed to increase the range of Unicode codepoints which can be
covered - at the cost of making String handling a mess:

http://download.oracle.com/javase/6/docs/api/java/lang/String.html#codePointAt(int)

Now suddenly String.length() no longer returns the length in real
characters (code points) but rather the length in chars. I figure,
Ruby's solution might not be so bad after all.

Kind regards

robert
 
B

Brian Candler

Robert K. wrote in post #988429:
So, which encoding is used then?
None.

An encoding *has* to be used because
you cannot write to a file without a particular encoding.

Untrue. In Unix, read() and write() just work on sequences of bytes, and =

have no concept of encoding.

Perhaps you are thinking of a language like Python 3, where there is a =

distinction between "characters" and "bytes representing those =

characters" (maybe Java has that distinction too, I don't know enough =

about Java to say)

In ruby 1.9, every String is a bunch of bytes plus an encoding tag. When =

you write this out to a file, and the external encoding is nil, then =

just the bytes are written, and the encoding is ignored.
I could see in the console that the file was read properly.

What you see in the console in irb does not necessarily mean much in =

ruby 1.9, because STDOUT.external_encoding is nil by default too.
irb(main):001:0> File.open("x","w"){|io| p io.external_encoding; io.put= s
"a=C3=A4"}
nil
=3D> nil
irb(main):002:0> s =3D File.open("x","r:UTF-8"){|io| p
io.external_encoding; io.read}
#<Encoding:UTF-8>
=3D> "a=C3=A4\n"
irb(main):003:0> s.valid_encoding?
=3D> true

Now, that's more complex, and *does* show that the data is valid UTF-8. =

(I wasn't arguing that it wasn't; I was arguing that your logic was =

flawed, because even if the data were not valid UTF-8, your program =

would have run without raising an error. Therefore the fact that it runs =

without error is insufficient to show that the data is valid UTF-8)

[In Java]
Now suddenly String.length() no longer returns the length in real
characters (code points) but rather the length in chars. I figure,
Ruby's solution might not be so bad after all.

Of course, even in Unicode, the number of code points is not necessarily =

the same as the number of glyphs or "printable characters".

-- =

Posted via http://www.ruby-forum.com/.=
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,701
Latest member
XavierQ83

Latest Threads

Top