Raw bytes in 1.9

Kless · Jul 27, 2009

I want to stick raw bytes (0-255) into a variable.

This will work in Ruby 1.8 since that it always assumes that the
characters in strings are exactly bytes. But I'm not sure about Ruby
1.9 as it has per-string character encodings.

David A. Black · Jul 27, 2009

Hi --

I want to stick raw bytes (0-255) into a variable.

This will work in Ruby 1.8 since that it always assumes that the
characters in strings are exactly bytes. But I'm not sure about Ruby
1.9 as it has per-string character encodings.

In 1.9 you can iterate through strings by line, byte, character, or
codepoint. I'm not sure exactly what you want to do but have a look at
String#bytes (aka #each_byte).

David

--
David A. Black / Ruby Power and Light, LLC / http://www.rubypal.com
Q: What's the best way to get a really solid knowledge of Ruby?
A: Come to our Ruby training in Edison, New Jersey, September 14-17!
Instructors: David A. Black and Erik Kastner
More info and registration: http://rubyurl.com/vmzN

Kless · Jul 27, 2009

I need store raw strings as this one:
"V-\243\230mJ\262.\031\023-4\301\324\241Y"
and I would to know if there will any problem with Ruby 1.9

David A. Black · Jul 27, 2009

Hi --

I need store raw strings as this one:
"V-\243\230mJ\262.\031\023-4\301\324\241Y"
and I would to know if there will any problem with Ruby 1.9

$ ruby191 -e 'p "V-\243\230mJ\262.\031\023-4\301\324\241Y"'
"V-\xA3\x98mJ\xB2.\x19\x13-4\xC1\xD4\xA1Y"

Why not install Ruby 1.9.1, or get an account on ruby-versions.net?

David

--
David A. Black / Ruby Power and Light, LLC / http://www.rubypal.com
Q: What's the best way to get a really solid knowledge of Ruby?
A: Come to our Ruby training in Edison, New Jersey, September 14-17!
Instructors: David A. Black and Erik Kastner
More info and registration: http://rubyurl.com/vmzN

Brian Candler · Jul 27, 2009

Kless said:
I need store raw strings as this one:
"V-\243\230mJ\262.\031\023-4\301\324\241Y"
and I would to know if there will any problem with Ruby 1.9

The answer is, "that depends": Ruby 1.9's string handling is extremely
complicated.

* If the string is a literal within the program source, then adding a
comment

# encoding: ASCII-8BIT

as the very first line of your program (or the second line if you have a
shebang line) will make literals have this encoding by default. Having
said that, strings with backslash-escapes like that will probably get
ASCII-8BIT by default.

* If the string comes from reading a file, then you need to open it in
binary mode: File.open("xxx","rb") { |f| ... }

* If the string comes from reading from a socket, then I believe it will
be ASCII-8BIT by default

* If the string comes from reading STDIN, then you will have to be very
careful; for safety you need something like

STDIN.set_encoding "ASCII-8BIT"

Your program may or may not work without these changes, because Ruby
1.9's behaviour at runtime depends on settings in your environment. That
is, the same program with the same data might work on one computer but
crash on another computer. Using the above incantations is your first
line of defense against this stupidity.

Then you need to be sure that every single method that you call in other
people's libraries, which takes string arguments or returns string
values, behaves in the way you want. For example, if you call
Library.foo and it returns a string whose encoding is UTF-8 and contains
characters with the high bit set, and you try to concatenate it with one
of your own binary strings, the program will crash.

Here's a somewhat contrived example:

-------- main.rb (your program) --------
# encoding: ASCII-8BIT

require 'library'
binary_data = "\xff\xee\xdd"
msg = Library.err_to_str
binary_data << [msg.bytesize].pack("N")
binary_data << msg

-------- library.rb (someone else's code that you don't control)
--------
# encoding: UTF-8

module Library
def self.err_to_str
"Ã¼ber-error"
end
end

$ ruby19 main.rb
main.rb:7:in `<main>': incompatible character encodings: ASCII-8BIT and
UTF-8 (Encoding::CompatibilityError)

Your only way to protect against this is to force encodings at every
point where two strings of differing provenance might encounter each
other. e.g.

msg = Library.err_to_str
binary_data << [msg.bytesize].pack("N")
msg.force_encoding "ASCII-8BIT"
binary_data << msg

Beware also that ruby 1.9's documentation is often either missing or
misleading when it comes to character encodings. For example, ri19
Array#pack says:

Directive Meaning
---------------------------------------------------------------
@ | Moves to absolute position
A | arbitrary binary string (space padded, count is
width)
a | arbitrary binary string (null padded, count is width)

So you might expect that an arbitrary String can be packed using a*:

# encoding: ASCII-8BIT

require 'library'
binary_data = "\xff\xee\xdd"
msg = Library.err_to_str
binary_data << [msg.bytesize,msg].pack("Na*") # CRASH
puts binary_data.inspect

No, you still need a msg.force_encoding "ASCII-8BIT" before the pack.

If all this scares you - and it does me - then remember that staying
with ruby 1.8 is a reasonable alternative. Ruby 1.8.6 is going to be
maintained for a long time going forward, thanks to the people at
EngineYard and Phusion Passenger.

HTH,

Brian.

Brian Candler · Jul 27, 2009

Brian said:
-------- main.rb (your program) --------
# encoding: ASCII-8BIT

require 'library'
binary_data = "\xff\xee\xdd"
msg = Library.err_to_str
binary_data << [msg.bytesize].pack("N")
binary_data << msg

-------- library.rb (someone else's code that you don't control)
--------
# encoding: UTF-8

module Library
def self.err_to_str
"Ã¼ber-error"
end
end

$ ruby19 main.rb
main.rb:7:in `<main>': incompatible character encodings: ASCII-8BIT and
UTF-8 (Encoding::CompatibilityError)

I should add: if ruby 1.9 *always* gave an exception when an ASCII-8BIT
string encountered a UTF-8 String, it wouldn't be a problem: your unit
tests would pick up the failure quickly.

But maybe in this library you're using, 99% of the error message don't
have any extended characters (i.e. those with the top bit set). Those
will work fine, even if tagged as UTF-8. It's only on the occasion where
the library decides to return a string which is tagged UTF-8 *and*
contains extended characters that the runtime crash will occur - and
this means you're always wondering whether you have sufficient coverage.

As a workaround, you might have to add extra unit tests which stub out
the library and force it to return a message with high-bit characters in
it, and check that your program behaves as expected. But mocking every
single library API which might return a string is really painful.

Brian Candler · Jul 27, 2009

Having

said that, strings with backslash-escapes like that will probably get
ASCII-8BIT by default.

P.S: to check this you must actually write and run a standalone program
file. irb is not a good predictor of behaviour, nor is piping a program
to ruby on stdin.

$ irb19 --simple-prompt
$ ruby19
p "\xff".encoding
^D
#<Encoding:UTF-8>

$ cat >test.rb
p "\xff".encoding
^D
$ ruby19 test.rb
#<Encoding:ASCII-8BIT>
$

This is with:

$ ruby19 -v
ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]

compiled from source under Ubuntu Jaunty.

Marc Heiler · Jul 27, 2009

What would be cool would be a ruby 1.9 where the whole encoding stuff is
completely optional - so that things would work like in ruby 1.8

StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
Clearer errors in 1.9's minitest	6	Mar 26, 2011
Instance eval in 1.8 and 1.9	6	Jul 17, 2010
how to get raw bytes for ctypes functions that return c_wchar_p	2	Nov 19, 2013
Ruby 1.9 Time parse question	4	Jun 7, 2011
Default encoding in ruby 1.9	2	Jun 19, 2009
Find memory leak in 1.9 app	5	Jan 15, 2011
Splitting a string into characters - not bytes	6	Nov 4, 2010

Raw bytes in 1.9

Kless

David A. Black

Kless

David A. Black

Brian Candler

Brian Candler

Brian Candler

Marc Heiler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads