Question of reference and (sub)strings.

S

Steve [RubyTalk]

Imagine comparing using C++ and Ruby to manipulate some strings. I would
like to confirm some assumptions I am making about in-memory efficiency
with Ruby. (This explains why the example seems contrived and I don't
want to be told to tackle the problem a different way, such as splitting
up the input as it is read from file/sockets etc... :) )

Assume I've got a large in-memory sequence of bytes (this would be
represented in C(++) using a malloc block) and in ruby as a String. I
have a pre-defined (non-trivial, potentially computationally expensive)
function which calculates a sequence of offsets into the String subject
to some arbitrary criteria... and subsequently I wish to reference
sub-strings (i.e. strings between two successive offsets) as if they
were independent of the original string (though, of course, each having
a fixed length.) N.B. This could be done 'cheaply' using pointers into
the original string if using C/C++.

Given that I only want to compute the offsets once, an obvious solution
would be to construct an Array of String - each element representing a
sub-string of the original... but this would double memory use. What
would be the best way to avoid duplicating the character sequences and
causing run-time bloat?

By corollary, if I had a large number of Strings, what would be the most
memory efficient way to represent their concatenation? If I had n mK
stings, would I need another n*mK contiguous block of memory to
represent their concatenation?
 
A

Adam Shelly

Given that I only want to compute the offsets once, an obvious solution
would be to construct an Array of String - each element representing a
sub-string of the original... but this would double memory use. What
would be the best way to avoid duplicating the character sequences and
causing run-time bloat?

You don't need to create the substrings until you need them: Here's a
test which only allocates memory for the substrings as needed. As far
as I can tell, the substrings are released back to the GC as soon as
you are done with them.

class IndexedString < String
def initialize s, &indexer
super s
@offsets =3D yield self
self
end
def substring n
(0...(@offsets.size-1)) =3D=3D=3D n ?
self[@offsets[n]...@offsets[n+1]] :
nil
end
end

def build_index s
indexes =3D [0]
while (n=3Ds.index('.',indexes.last+1)) do indexes << n+1 end
indexes << s.length+1
end

s =3D IndexedString.new( (("TESTINGTESTING!!"*1024)+".")*100) do |str|
build_index(str)
end #=3D +1600K
s.substring 10; #+16K
s.substring 95; #+16K

By corollary, if I had a large number of Strings, what would be the most
memory efficient way to represent their concatenation? If I had n mK
stings, would I need another n*mK contiguous block of memory to
represent their concatenation?

Not necessarily: You can build a class which takes an array of
strings and returns
"superstrings" only as needed. Here are my passing test cases, but
the class still needs a little work before I'll post it:

def test_MegaString
small_strings =3D %w( this is a collection of small strings)
mega =3D MegaString.new(small_strings)
assert_equal("t",mega[0])
assert_equal("this",mega[0,4])
assert_equal("hisisaco",mega[1,8])
assert_equal("ollect",mega[8,6])
assert_equal("a", mega[6...7])
assert_equal("col", mega[7..10])
assert_equal("lstrings", mega[23,-1])
assert_equal("lstrings", mega[23,230])
assert_equal("thisisacollectionofsmallstrings", mega.to_s)
end



-Adam
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,825
Latest member
VernonQuy6

Latest Threads

Top