Reasons to use a buffer in IO::read?

S

Steve Midgley

Hi Ruby people,

I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:

until io.eof? do
file.write(io.read(10485760))
end

buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

http://rcoder.net/content/fast-ruby-io

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..

Thanks in advance for any information!

Steve
 
M

MonkeeSage

Hi Ruby people,

I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:

until io.eof? do
file.write(io.read(10485760))
end

buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

http://rcoder.net/content/fast-ruby-io

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..

Thanks in advance for any information!

Steve

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

f = File.new("testfile")
f.read(16) #=> "This is line one"

So...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Regards,
Jordan
 
S

Steve Midgley

Hi Ruby people,
I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:
until io.eof? do
file.write(io.read(10485760))
end
buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end
I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..
Thanks in advance for any information!

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

f = File.new("testfile")
f.read(16) #=> "This is line one"

So...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Regards,
Jordan

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve
 
M

MonkeeSage

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.
At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.
f = File.new("testfile")
f.read(16) #=> "This is line one"

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"
Regards,
Jordan

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve

I don't know if there is any optimization is the back end, but it lets
you pass the results of io.read to another method and also put them in
buffer at the same time. But since you can do that with assignment, I
don't really see any point to it (I was just trying to give an example
as the docs describe). To me, unless as you say, there is some
optimization going on in the backend, this code...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

....looks the same as this code...

file.write(buffer = io.read)
print "I read this stuff ", buffer, "\n"

Regards,
Jordan
 
R

Robert Klemme

This line above is completely superfluous.
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end
I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..
Thanks in advance for any information!

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

f = File.new("testfile")
f.read(16) #=> "This is line one"

So...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Regards,
Jordan

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing?

Yes, the string referenced by buffer is reused. This leads to
improved performance for the typical application which is like this:

buffer = ""
while ( io.read(1024, buffer) )
file.write buffer
end
The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Exactly (see above). Note that it is very inefficient to read with
such a large chunk size as you use in your original posting. If you
want to read the whole file you can simply do io.read.

Kind regards

robert
 
J

Jano Svitok

On Dec 5, 6:59 pm, SteveMidgley<[email protected]> wrote:
Hi Ruby people,
I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:
until io.eof? do
file.write(io.read(10485760))
end
buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end
I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..
Thanks in advance for any information!

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.
At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.
f = File.new("testfile")
f.read(16) #=> "This is line one"

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"
Regards,
Jordan

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve

I don't know if there is any optimization is the back end, but it lets
you pass the results of io.read to another method and also put them in
buffer at the same time. But since you can do that with assignment, I
don't really see any point to it (I was just trying to give an example
as the docs describe). To me, unless as you say, there is some
optimization going on in the backend, this code...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

...looks the same as this code...

file.write(buffer = io.read)

print "I read this stuff ", buffer, "\n"

Regards,
Jordan

I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

i.e.
buffer = ""
File.open('xxx','r') do |f|
while f.read(1024, buffer) do
process(buffer)
end
end

vs.

File.open('xxx','r') do |f|
while true do
buffer = f.read(1024)
break if buffer.empty?
process(buffer)
end
end
 
M

MonkeeSage

I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

I'm not the smartest C programmer (or the smartest anything
programmer), but I'm not seeing any optimization in the actual C code.
Please correct me if I'm wrong.

First, io_read() is the function called in the backend from IO#read.
Te relevant lines are:

====
rb_scan_args(argc, argv, "02", &length, &str);

if (NIL_P(length)) {
if (!NIL_P(str)) StringValue(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
return read_all(fptr, remain_size(fptr), str);
}
len = NUM2LONG(length);
if (len < 0) {
rb_raise(rb_eArgError, "negative length %ld given", len);
}

if (NIL_P(str)) {
str = rb_tainted_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str,len);
}
====

So we see that we get a new string from rb_tainted_str_new if buffer
is is not passed in to IO#read; otherwise str is used and we call
StringValue on it.

So what is StringValue? A macro defined in ruby.h:

====
#define StringValue(v) rb_string_value(&(v))
====

And what is rb_string_value()? A function from string.c:

====
static char *null_str = "";

VALUE
rb_string_value(ptr)
volatile VALUE *ptr;
{
VALUE s = *ptr;
if (TYPE(s) != T_STRING) {
s = rb_str_to_str(s);
*ptr = s;
}
if (!RSTRING(s)->ptr) {
FL_SET(s, ELTS_SHARED);
RSTRING(s)->ptr = null_str;
}
return s;
}
====

So if it's not a string, we convert it to one, otherwise we zero it
out.

But the interesting lines are back up in io_read():

====
rb_str_modify(str);
rb_str_resize(str,len);
====

Now rb_str_modify() (string.c) is called with our zeroed string. And
it in turn calls str_make_independent():

====
static void
str_make_independent(str)
VALUE str;
{
char *ptr;

ptr = ALLOC_N(char, RSTRING(str)->len+1);
if (RSTRING(str)->ptr) {
memcpy(ptr, RSTRING(str)->ptr, RSTRING(str)->len);
}
ptr[RSTRING(str)->len] = 0;
RSTRING(str)->ptr = ptr;
RSTRING(str)->aux.capa = RSTRING(str)->len;
FL_UNSET(str, STR_NOCAPA);
}
====

And finally, rb_str_resize is called:

====
VALUE
rb_str_resize(str, len)
VALUE str;
long len;
{
if (len < 0) {
rb_raise(rb_eArgError, "negative string size (or size too big)");
}

rb_str_modify(str);
if (len != RSTRING(str)->len) {
if (RSTRING(str)->len < len || RSTRING(str)->len - len > 1024) {
REALLOC_N(RSTRING(str)->ptr, char, len+1);
if (!FL_TEST(str, STR_NOCAPA)) {
RSTRING(str)->aux.capa = len;
}
}
RSTRING(str)->len = len;
RSTRING(str)->ptr[len] = '\0'; /* sentinel */
}
return str;
}
====

Now, like I said, I'm not the greatest C programmer...but I fail to
see how, if I'm reading the code above correctly, passing in a buffer
string to IO#read is any more optimal than creating a new string (even
when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan

----
References:

http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/io.c
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/ruby.h
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/string.c
 
M

MonkeeSage

I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

I'm not the smartest C programmer (or the smartest anything
programmer), but I'm not seeing any optimization in the actual C code.
Please correct me if I'm wrong.

First, io_read() is the function called in the backend from IO#read.
Te relevant lines are:

====
rb_scan_args(argc, argv, "02", &length, &str);

if (NIL_P(length)) {
if (!NIL_P(str)) StringValue(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
return read_all(fptr, remain_size(fptr), str);
}
len = NUM2LONG(length);
if (len < 0) {
rb_raise(rb_eArgError, "negative length %ld given", len);
}

if (NIL_P(str)) {
str = rb_tainted_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str,len);
}
====

So we see that we get a new string from rb_tainted_str_new if buffer
is is not passed in to IO#read; otherwise str is used and we call
StringValue on it.

So what is StringValue? A macro defined in ruby.h:

====
#define StringValue(v) rb_string_value(&(v))
====

And what is rb_string_value()? A function from string.c:

====
static char *null_str = "";

VALUE
rb_string_value(ptr)
volatile VALUE *ptr;
{
VALUE s = *ptr;
if (TYPE(s) != T_STRING) {
s = rb_str_to_str(s);
*ptr = s;
}
if (!RSTRING(s)->ptr) {
FL_SET(s, ELTS_SHARED);
RSTRING(s)->ptr = null_str;
}
return s;}

====

So if it's not a string, we convert it to one, otherwise we zero it
out.

But the interesting lines are back up in io_read():

====
rb_str_modify(str);
rb_str_resize(str,len);
====

Now rb_str_modify() (string.c) is called with our zeroed string. And
it in turn calls str_make_independent():

====
static void
str_make_independent(str)
VALUE str;
{
char *ptr;

ptr = ALLOC_N(char, RSTRING(str)->len+1);
if (RSTRING(str)->ptr) {
memcpy(ptr, RSTRING(str)->ptr, RSTRING(str)->len);
}
ptr[RSTRING(str)->len] = 0;
RSTRING(str)->ptr = ptr;
RSTRING(str)->aux.capa = RSTRING(str)->len;
FL_UNSET(str, STR_NOCAPA);}

====

And finally, rb_str_resize is called:

====
VALUE
rb_str_resize(str, len)
VALUE str;
long len;
{
if (len < 0) {
rb_raise(rb_eArgError, "negative string size (or size too big)");
}

rb_str_modify(str);
if (len != RSTRING(str)->len) {
if (RSTRING(str)->len < len || RSTRING(str)->len - len > 1024) {
REALLOC_N(RSTRING(str)->ptr, char, len+1);
if (!FL_TEST(str, STR_NOCAPA)) {
RSTRING(str)->aux.capa = len;
}
}
RSTRING(str)->len = len;
RSTRING(str)->ptr[len] = '\0'; /* sentinel */
}
return str;}

====

Now, like I said, I'm not the greatest C programmer...but I fail to
see how, if I'm reading the code above correctly, passing in a buffer
string to IO#read is any more optimal than creating a new string (even
when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan

Oh...wait...I'm completely dense. Duh! io_read() is going to create /
re-initialize new string anyway to put its results in. So If I create
a new string independently to store the return value of IO#read, then
I'm causing an extra allocation and copy. Sorry for wasting space.
Have pity on mentally handicapped people like me. :p

Regards,
Jordan
 
R

Robert Klemme

2007/12/7 said:
I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

I'm not the smartest C programmer (or the smartest anything
programmer), but I'm not seeing any optimization in the actual C code.
Please correct me if I'm wrong.

First, io_read() is the function called in the backend from IO#read.
Te relevant lines are:

====
rb_scan_args(argc, argv, "02", &length, &str);

if (NIL_P(length)) {
if (!NIL_P(str)) StringValue(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
return read_all(fptr, remain_size(fptr), str);
}
len = NUM2LONG(length);
if (len < 0) {
rb_raise(rb_eArgError, "negative length %ld given", len);
}

if (NIL_P(str)) {
str = rb_tainted_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str,len);
}
====

So we see that we get a new string from rb_tainted_str_new if buffer
is is not passed in to IO#read; otherwise str is used and we call
StringValue on it.

So what is StringValue? A macro defined in ruby.h:

====
#define StringValue(v) rb_string_value(&(v))
====

And what is rb_string_value()? A function from string.c:

====
static char *null_str = "";

VALUE
rb_string_value(ptr)
volatile VALUE *ptr;
{
VALUE s = *ptr;
if (TYPE(s) != T_STRING) {
s = rb_str_to_str(s);
*ptr = s;
}
if (!RSTRING(s)->ptr) {
FL_SET(s, ELTS_SHARED);
RSTRING(s)->ptr = null_str;
}
return s;}

====

So if it's not a string, we convert it to one, otherwise we zero it
out.

But the interesting lines are back up in io_read():

====
rb_str_modify(str);
rb_str_resize(str,len);
====

Now rb_str_modify() (string.c) is called with our zeroed string. And
it in turn calls str_make_independent():

====
static void
str_make_independent(str)
VALUE str;
{
char *ptr;

ptr = ALLOC_N(char, RSTRING(str)->len+1);
if (RSTRING(str)->ptr) {
memcpy(ptr, RSTRING(str)->ptr, RSTRING(str)->len);
}
ptr[RSTRING(str)->len] = 0;
RSTRING(str)->ptr = ptr;
RSTRING(str)->aux.capa = RSTRING(str)->len;
FL_UNSET(str, STR_NOCAPA);}

====

And finally, rb_str_resize is called:

====
VALUE
rb_str_resize(str, len)
VALUE str;
long len;
{
if (len < 0) {
rb_raise(rb_eArgError, "negative string size (or size too big)");
}

rb_str_modify(str);
if (len != RSTRING(str)->len) {
if (RSTRING(str)->len < len || RSTRING(str)->len - len > 1024) {
REALLOC_N(RSTRING(str)->ptr, char, len+1);
if (!FL_TEST(str, STR_NOCAPA)) {
RSTRING(str)->aux.capa = len;
}
}
RSTRING(str)->len = len;
RSTRING(str)->ptr[len] = '\0'; /* sentinel */
}
return str;}

====

Now, like I said, I'm not the greatest C programmer...but I fail to
see how, if I'm reading the code above correctly, passing in a buffer
string to IO#read is any more optimal than creating a new string (even
when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan

Oh...wait...I'm completely dense. Duh! io_read() is going to create /
re-initialize new string anyway to put its results in. So If I create
a new string independently to store the return value of IO#read, then
I'm causing an extra allocation and copy. Sorry for wasting space.
Have pity on mentally handicapped people like me. :p

LOL

Also, allocating of a String instance is not only the raw malloc of
the memory but as well the bookkeeping needed for GC. So it is more
expensive than a simple resize. Note also, that if you loop with code
like the one I showed the length of the string instance is adjusted
only *once* because all chunks have the same length or are shorter
(the last one potentially).

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,739
Latest member
Clint8040

Latest Threads

Top