My hack at the substring problem (nice one, James) is based on suffix
trees.
Suffix trees and suffix arrays and a great tool for substring
operations. The idea is to create a list of all the possible suffixes
in the original string. For "banana" this would be
banana
anana
nana
ana
na
a
Because the list contains an entry starting at every position in the
string, we know that all possible substrings of the original string
must occur at the start of one of these list elements. The substring
"nan", for example, occurs at the start of the 3rd entry.
Now sort them
a
ana
anana
banana
na
nana
Now we know that all substrings that start with the same sequence of
characters must occur at the start of items in the list that are
adjacent. Searching through to list to find these is a linear operation.
A suffix array is basically this data structure. For efficiency,
though, it doesn't store the actual strings. Instead it stores the
offset of the start of the string (for our example, this would be [6,
4, 2, 1, 5, 3]). You'll find a whole bunch of stuff on using suffix
arrays for searching and string matching, particularly in the area of
DNA sequencing.
It turns out that in Ruby, you don't always have to go to the trouble
of constructing the suffix array. When you take a substring in Ruby,
it doesn't copy the string data. Instead, it just constructs a new
string object (just a few words of memory) that references into the
original string. Only when either string is modified does the string
content get copied. This means the code for constructing the suffix
list is both simple and relatively efficient:
def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack...
@suffix_list.sort!
end
The only ugly part is the use of [i, len] to create the substring.
Technically it should be 1..len, but that constructs a new,
unnecessary range object. Using 'len' as the final index does the same
thing, because the substring stops at the end of the input string, but
it can be misleading to read.
The code to scan the suffix list looks at each adjacent pair, and sees
if it can find a longer matching substring at the start of each that
it has previously found. Because James disallowed overlapping
substrings, you have to be careful to look at no more that 1/2 of the
longest string. The basic loop looks like this:
s1 = @suffix_list[0]
1.upto(@suffix_list.size - 1) do |i|
s2 = @suffix_list
max_possible = s2.length / 2 # stop them overlapping
while # quick sanity check - saves doing the more expensive
substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]
max_length_so_far = max_plus_one
max_plus_one += 1
found_at = i
end
s1 = s2
end
The while loop has three conditions. The last one is the money earner:
it looks for a match at the start of the two strings which is one
longer that the longest match found so far. If it finds it, it records
this as the new longest match, and then tries for a even longer one
with the current pair.
The second condition on the while loop stops us considering more than
1/2 the second string. Because our strings are sorted, if thereis
overlap, the second string will always be the one that contains the
overlapping segment of the first.
The first condition is just an optimization: it stops us doing the
more expensive substring comparison if the last characters of the two
substrings we're comparing don't match.
So, put it all together, and we have
# - - - - - - - - - - - - -
class CommonSeq
def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack...
@suffix_list.sort!
end
def find_substrings
max_length_so_far = 0
max_plus_one = 1 # save a little math in the loop
found_at = nil
# Look at all adjacent pairs of suffices.
s1 = @suffix_list[0]
1.upto(@suffix_list.size - 1) do |i|
s2 = @suffix_list
max_possible = s2.length / 2 # stop them overlapping
while # quick sanity check - saves doing the more expensive
substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]
max_length_so_far = max_plus_one
max_plus_one += 1
found_at = i
end
s1 = s2
end
if found_at
suffix = @suffix_list[found_at]
[max_length_so_far, suffix[0, max_length_so_far]]
else
nil
end
end
end
if __FILE__ == $0
seq = CommonSeq.new(STDIN.read.chomp)
puts seq.find_substrings
end
# - - - - - - - - - - - - -
Run times are shown for the Illiad and War and Peace:
dave[tmp/seq] time ruby seq.rb <homer.txt
168
who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives
ruby seq.rb < homer.txt 2.82s user 0.05s system 99% cpu 2.872 total
dave[tmp/seq] time ruby seq.rb <war.txt
63
The Project Gutenberg Literary Archive Foundation has been
ruby seq.rb < war.txt 12.49s user 0.17s system 99% cpu 12.737 total
Here's the somewhat basic set of tests I used
# - - - - - - - - - - - - -
require 'seq'
require 'test/unit'
class CommonSeq
attr_reader :suffix_list
end
class TestSeq < Test::Unit::TestCase
def test_basic_suffix_creation
cs = CommonSeq.new("banana")
assert_equal(%w{a ana anana banana na nana}, cs.suffix_list)
end
def test_empty
cs = CommonSeq.new("")
assert_nil cs.find_substrings
end
def test_length_one
cs = CommonSeq.new("a")
assert_nil cs.find_substrings
end
def test_length_two_no_match
cs = CommonSeq.new("ab")
assert_nil cs.find_substrings
end
def test_length_two_with_match
cs = CommonSeq.new("aa")
assert_equal [ 1, "a"], cs.find_substrings
end
def test_length_three_no_match
cs = CommonSeq.new("abc")
assert_nil cs.find_substrings
end
def test_length_three_adjacent_match
cs = CommonSeq.new("aab")
assert_equal [ 1, "a"], cs.find_substrings
end
def test_length_three_separated_match
cs = CommonSeq.new("aba")
assert_equal [ 1, "a"], cs.find_substrings
end
def test_does_not_find_overlapping_match_length_one
cs = CommonSeq.new("aaa")
assert_equal [ 1, "a"], cs.find_substrings
end
def test_does_not_find_overlapping_match_length_three
cs = CommonSeq.new("aaaa")
assert_equal [ 2, "aa"], cs.find_substrings
end
def test_does_not_find_overlapping_match_length_two
cs = CommonSeq.new("ababa")
assert_equal [ 2, "ab"], cs.find_substrings
end
def test_banana
cs = CommonSeq.new("banana")
assert_equal [ 2, "an"], cs.find_substrings
end
end
# - - - - - - - - - - - - -
I wouldn't be surprised if the idea of searching only 1/2 of the
second string to prevent overlaps is wrong..
Dave