C
Curt Sampson
I'm writing a C extension that involves fast scanning through and
parsing of tab-delimited files. Basically, I mmap the file, figure out
where the row and column boundaries are, and for each row end up with
an array of strings (pointer and length) for each row that I then pass
on to other C or Ruby code. The array and its strings are not supposed
to be modified by the callees, only read, and I can also live with the
callees being required to make their own copies of the strings and
arrays if they need to keep the data accessable after the call, if I can
figure out some way to enforce that.
It appears to me that this means I don't really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase--about ten-fold--in how
fast I can scan through the file.
Does anybody have any suggestions on a reasonably safe way to do this?
cjs
parsing of tab-delimited files. Basically, I mmap the file, figure out
where the row and column boundaries are, and for each row end up with
an array of strings (pointer and length) for each row that I then pass
on to other C or Ruby code. The array and its strings are not supposed
to be modified by the callees, only read, and I can also live with the
callees being required to make their own copies of the strings and
arrays if they need to keep the data accessable after the call, if I can
figure out some way to enforce that.
It appears to me that this means I don't really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase--about ten-fold--in how
fast I can scan through the file.
Does anybody have any suggestions on a reasonably safe way to do this?
cjs