[QUIZ] Longest Repeated Substring (#153)

James Gray · Jan 19, 2008

i would think that

'aaaaaa'.longest_repeating_substring #=> 'aaaaa'

the quiz did not say that the two strings could not overlap.

is this correct james?

Actually, it did. Here's the relevant line from the quiz:

"Repeated substrings may not overlap."

James Edward Gray II

Rados³aw Bu³at · Jan 19, 2008

PiAnYWFhYWFhJy5sb25nZXN0X3JlcGVhdGluZ19zdWJzdHJpbmcgIz0+ICdhYWFhYScKPgo+IHRo
ZSBxdWl6IGRpZCBub3Qgc2F5IHRoYXQgdGhlIHR3byBzdHJpbmdzIGNvdWxkIG5vdCBvdmVybGFw
Lgo+Cj4gaXMgdGhpcyBjb3JyZWN0IGphbWVzPwoKTm8sIGl0IGlzbid0LiBMb29rIGF0IGV4YW1w
bGUgZnJvbSBKYW1lczoKCkV4YW1wbGU6CgogICAgICAgJCBlY2hvIGJhbmFuYSB8IHJ1YnkgbG9u
Z2VzdF9yZXBlYXRlZF9zdWJzdHJpbmcucmIKICAgICAgIGFuCgogICAgICAgT1IKCiAgICAgICAk
IGVjaG8gYmFuYW5hIHwgcnVieSBsb25nZXN0X3JlcGVhdGVkX3N1YnN0cmluZy5yYgogICAgICAg
bmEKCkZvbGxvd2luZyB5b3VyIHJlYXNvbmluZyAgJ2JhbmFuYScgd291bGQgZ2l2ZSAnYW5hJywg
cmF0aGVyIHRoYW4gJ2FuJyAob3IgJ25hJykuCi0tIApSYWRvc7NhdyBCdbNhdAoKaHR0cDovL3Jh
ZGFyZWsuam9nZ2VyLnBsIC0gbfNqIGJsb2cK

ara.t.howard · Jan 19, 2008

"Repeated substrings may not overlap."

sorry. missed that...

cheers.

a @ http://drawohara.com/

ara.t.howard · Jan 19, 2008

Anybody found a good long text to work with yet? The text of the =20
U.S. Constitution perhaps? Tristram Shandy?

I would say that you should handle the biggest text you possibly =20
can.

James Edward Gray II

http://www.fordham.edu/halsall/ancient/homer-illiad.txt

a @ http://codeforpeople.com/

fw · Jan 19, 2008

http://www.fordham.edu/halsall/ancient/homer-illiad.txt

a @ http://codeforpeople.com/

There's also War and Peace: http://www.gutenberg.org/dirs/etext01/wrnpc12.txt

$ wc wrnpc12.txt
67403 564514 3285165 wrnpc12.txt

yermej · Jan 19, 2008

http://www.fordham.edu/halsall/ancient/homer-illiad.txt

a @http://codeforpeople.com/

For The Illiad, I got:
' who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives' (168 characters)

yermej · Jan 19, 2008

There's also War and Peace:http://www.gutenberg.org/dirs/etext01/wrnpc12.txt

$ wc wrnpc12.txt
67403 564514 3285165 wrnpc12.txt

For War & Peace:
'

The Project Gutenberg Literary Archive Foundation has been ' (63
characters)

That one took awhile - 5m29s real time vs 1m2s real time for the
Illiad - though the increase is close to linear given the number of
characters in each text (3285165 vs. 806921).

Dave Thomas · Jan 19, 2008

That one took awhile - 5m29s real time vs 1m2s real time for the
Illiad - though the increase is close to linear given the number of
characters in each text (3285165 vs. 806921).

I've got the same output for both examples. Current times are 2.767s
and 12.894s elapsed on a 3GHz processor.

Ruby's clever copy-on-write string handling really shines here.

Dave

James Gray · Jan 19, 2008

So we need to take care of unicode too?

As always, it's your choice. That's a nice bonus though.

James Edward Gray II

Denis Hennessy · Jan 19, 2008

I've got the same output for both examples. Current times are 2.767s
and 12.894s elapsed on a 3GHz processor.

Ruby's clever copy-on-write string handling really shines here.

Hmm, I was happy with my solution (similiar times to yermej on 1.8GHz)
until I saw Dave's times. I'm really looking forward to seeing the code.

/dh

Raf Coremans · Jan 20, 2008

MjAwOC8xLzE4LCBKYW1lcyBHcmF5IDxqYW1lc0BncmF5cHJvZHVjdGlvbnMubmV0PjoKPgo+IE9u
IEphbiAxOCwgMjAwOCwgYXQgMzoxMCBQTSwgUmFkb3PFgmF3IEJ1xYJhdCB3cm90ZToKPgo+ID4g
T24gSmFuIDE4LCAyMDA4IDEwOjAwIFBNLCB5ZXJtZWogPHllcm1lakBnbWFpbC5jb20+IHdyb3Rl
Ogo+ID4KPiA+IE15IHF1ZXN0aW9uIGlzIChJJ20gbm90IGZhbWlsaWFyIHdpdGggUnVieVF1aXog
dG9vIG11Y2ggOikpOiBlcGlzb2RlCj4gPiBmb2N1cyBvbiBhbGdvcml0aG0gKHNwZWVkKSBvciBz
b3VyY2UgY29kZSAocmVhZGFibGUpPwo+Cj4gSG9wZWZ1bGx5IGJvdGguICA6KQo+Cj4gSmFtZXMg
RWR3YXJkIEdyYXkgSUkKPgoKCk15IHNvbHV0aW9uIG9mZmVycyBzcGVlZCBub3IgcmVhZGFiaWxp
dHk7IEkgd2VudCBmb3IgYSBvbmUtbGluZXIgKGlmIHlvdQpkaXNjb3VudCB0aGUgcmVxdWlyZSk6
Cg==

Denis Hennessy · Jan 20, 2008

Here's my solution to the quiz. It's has O(n) behaviour and takes
about 1 minute on the illiad and 5 minutes on war and peace on my
1.8GHz linux box (much longer on my powerbook).

It works by constructing a map from every letter in the text to an
array of its locations. It then iterates, increasing each string
(which sometimes creates splits) and removing strings which don't have
at least 2 locations. Thus, for each length n, the algorithm only has
to deal with the patterns which already matched with length n-1. This
is easiest to see by running it with the verbose option:

$ echo banana | ruby -v find_repeats.rb
ruby 1.8.6 (2007-09-23 patchlevel 110) [powerpc-darwin9.0.0]
Initial: {"a"=>[1, 3, 5], "b"=>[0], "n"=>[2, 4], "\n"=>[6]}
Filter (len=1): {"a"=>[1, 3, 5], "n"=>[2, 4]}
Grow (len=2): {"an"=>[1, 3], "na"=>[2, 4], "a\n"=>[5]}
Filter (len=2): {"an"=>[1, 3], "na"=>[2, 4]}
Grow (len=3): {"na\n"=>[4], "nan"=>[2], "ana"=>[1, 3]}
Filter (len=3): {}
an

Cheers,
Denis

find_repeats.rb:

#!/usr/bin/env ruby

text = ARGF.read
size = text.size

# Build a map from each (1-character) string to a list of its positions
roots = {}
size.times do |o|
s = text[o,1]
if roots.has_key? s
roots << o
else
roots = [o]
end
end

puts "Initial: #{roots.inspect}" if $VERBOSE
len = 1
first = nil
while true do
# Remove entries which don't have at least 2 non-overlapping
occurances
roots.delete_if do |s,offsets|
count = 0
last = nil
offsets.each do |o|
next if last && last+len > o
last = o
count += 1
end
count < 2
end
puts "Filter (len=#{len}): #{roots.inspect}" if $VERBOSE
break if roots.size == 0
first = roots[roots.keys[0]][0]

# Increase len by 1 and replace each existing root with the set of
longer roots
len += 1
new_roots = {}
roots.each do |s,offsets|
offsets.each do |o|
next if o > size - len
s = text[o,len]
if new_roots.has_key? s
new_roots << o
else
new_roots = [o]
end
end
end
roots = new_roots
puts "Grow (len=#{len}): #{roots.inspect}" if $VERBOSE
end

exit if first == nil

puts text[first,len-1]

Justin Ethier · Jan 20, 2008

[Note: parts of this message were removed to make it a legal post.]

Here is my solution. It is easy to follow but breaks down on larger inputs:

# Find first non-overlapping repeated substring contained in the input
string.
# Search space is smaller for longer substrings, so search for longest ones
first.
# Returns - Longest repeated substring, or nil if none
def longest_repeated_substring(input)
len = input.size / 2 # Max size is half total length, since strings cannot
overlap

while len > 0
# Find all substrings of given length
sub_strings = {}
for i in 0...input.size-len
sub_str = input[i..i+len]

if not sub_strings.has_key?(sub_str)
sub_strings[sub_str] = i+len # Add to list, track end pos for
overlaps
elsif sub_strings[sub_str] < i
return sub_str # First non-overlapping match ties for longest
end
end

len -= 1
end

nil
end

puts longest_repeated_substring(ARGV[0])

Thanks,

Justin

Dave Thomas · Jan 20, 2008

My hack at the substring problem (nice one, James) is based on suffix
trees.

Suffix trees and suffix arrays and a great tool for substring
operations. The idea is to create a list of all the possible suffixes
in the original string. For "banana" this would be

banana
anana
nana
ana
na
a

Because the list contains an entry starting at every position in the
string, we know that all possible substrings of the original string
must occur at the start of one of these list elements. The substring
"nan", for example, occurs at the start of the 3rd entry.

Now sort them

a
ana
anana
banana
na
nana

Now we know that all substrings that start with the same sequence of
characters must occur at the start of items in the list that are
adjacent. Searching through to list to find these is a linear operation.

A suffix array is basically this data structure. For efficiency,
though, it doesn't store the actual strings. Instead it stores the
offset of the start of the string (for our example, this would be [6,
4, 2, 1, 5, 3]). You'll find a whole bunch of stuff on using suffix
arrays for searching and string matching, particularly in the area of
DNA sequencing.

It turns out that in Ruby, you don't always have to go to the trouble
of constructing the suffix array. When you take a substring in Ruby,
it doesn't copy the string data. Instead, it just constructs a new
string object (just a few words of memory) that references into the
original string. Only when either string is modified does the string
content get copied. This means the code for constructing the suffix
list is both simple and relatively efficient:

def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack...
@suffix_list.sort!
end

The only ugly part is the use of [i, len] to create the substring.
Technically it should be 1..len, but that constructs a new,
unnecessary range object. Using 'len' as the final index does the same
thing, because the substring stops at the end of the input string, but
it can be misleading to read.

The code to scan the suffix list looks at each adjacent pair, and sees
if it can find a longer matching substring at the start of each that
it has previously found. Because James disallowed overlapping
substrings, you have to be careful to look at no more that 1/2 of the
longest string. The basic loop looks like this:

s1 = @suffix_list[0]

1.upto(@suffix_list.size - 1) do |i|

s2 = @suffix_list
max_possible = s2.length / 2 # stop them overlapping

while # quick sanity check - saves doing the more expensive
substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]

max_length_so_far = max_plus_one
max_plus_one += 1
found_at = i
end
s1 = s2
end

The while loop has three conditions. The last one is the money earner:
it looks for a match at the start of the two strings which is one
longer that the longest match found so far. If it finds it, it records
this as the new longest match, and then tries for a even longer one
with the current pair.

The second condition on the while loop stops us considering more than
1/2 the second string. Because our strings are sorted, if thereis
overlap, the second string will always be the one that contains the
overlapping segment of the first.

The first condition is just an optimization: it stops us doing the
more expensive substring comparison if the last characters of the two
substrings we're comparing don't match.

So, put it all together, and we have

# - - - - - - - - - - - - -

class CommonSeq

def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack...
@suffix_list.sort!
end

def find_substrings
max_length_so_far = 0
max_plus_one = 1 # save a little math in the loop
found_at = nil

# Look at all adjacent pairs of suffices.
s1 = @suffix_list[0]

1.upto(@suffix_list.size - 1) do |i|

s2 = @suffix_list
max_possible = s2.length / 2 # stop them overlapping

while # quick sanity check - saves doing the more expensive
substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]

max_length_so_far = max_plus_one
max_plus_one += 1
found_at = i
end
s1 = s2
end

if found_at
suffix = @suffix_list[found_at]
[max_length_so_far, suffix[0, max_length_so_far]]
else
nil
end
end
end

if __FILE__ == $0
seq = CommonSeq.new(STDIN.read.chomp)
puts seq.find_substrings
end

# - - - - - - - - - - - - -

Run times are shown for the Illiad and War and Peace:

dave[tmp/seq] time ruby seq.rb <homer.txt
168
who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives
ruby seq.rb < homer.txt 2.82s user 0.05s system 99% cpu 2.872 total

dave[tmp/seq] time ruby seq.rb <war.txt
63

The Project Gutenberg Literary Archive Foundation has been
ruby seq.rb < war.txt 12.49s user 0.17s system 99% cpu 12.737 total

Here's the somewhat basic set of tests I used

# - - - - - - - - - - - - -
require 'seq'
require 'test/unit'

class CommonSeq
attr_reader :suffix_list
end

class TestSeq < Test::Unit::TestCase

def test_basic_suffix_creation
cs = CommonSeq.new("banana")
assert_equal(%w{a ana anana banana na nana}, cs.suffix_list)
end

def test_empty
cs = CommonSeq.new("")
assert_nil cs.find_substrings
end

def test_length_one
cs = CommonSeq.new("a")
assert_nil cs.find_substrings
end

def test_length_two_no_match
cs = CommonSeq.new("ab")
assert_nil cs.find_substrings
end

def test_length_two_with_match
cs = CommonSeq.new("aa")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_length_three_no_match
cs = CommonSeq.new("abc")
assert_nil cs.find_substrings
end

def test_length_three_adjacent_match
cs = CommonSeq.new("aab")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_length_three_separated_match
cs = CommonSeq.new("aba")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_one
cs = CommonSeq.new("aaa")
assert_equal [ 1, "a"], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_three
cs = CommonSeq.new("aaaa")
assert_equal [ 2, "aa"], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_two
cs = CommonSeq.new("ababa")
assert_equal [ 2, "ab"], cs.find_substrings
end

def test_banana
cs = CommonSeq.new("banana")
assert_equal [ 2, "an"], cs.find_substrings
end
end
# - - - - - - - - - - - - -

I wouldn't be surprised if the idea of searching only 1/2 of the
second string to prevent overlaps is wrong..

Dave

Guillaume Carbonneau · Jan 20, 2008

Here is my solution using Suffix arrays

class SuffixArray
attr_accessor :suffix_array
def initialize(the_string)
@the_string = the_string
@suffix_array = Array.new
#build the suffixes
last_index = the_string.length-1
(0..last_index).each do |i|
the_suffix = the_string[i..last_index]
the_position = i
# << is the append (or push) operator for arrays in Ruby
@suffix_array << { :suffix=>the_suffix,

osition=>the_position }
end

#sort the suffix array
@suffix_array.sort! { |a,b| a[:suffix] <=> b[:suffix] }
end

end

text = STDIN.read

highest_count = 0
longest_string = ""
sa = SuffixArray.new(text)
sa.suffix_array.each_with_index do |s,i|
j = 1
if sa.suffix_array[i+1]
while sa.suffix_array[:suffix][0,j] ==
sa.suffix_array[i+1][:suffix][0,j]
if j > highest_count
highest_count = j
longest_string = sa.suffix_array[:suffix][0,j]
end
j += 1
end
end

end
p longest_string

---------------------

I get the following times

ILLIAD :
$ time cat homer-illiad.txt | ruby suffix.rb
" who are\nperishing and coming to a bad end. We will, however, since you
so\nbid us, refrain from actual fighting, but we will make
serviceable\nsuggestions to the Argives"

real 1m15.566s
user 1m14.810s
sys 0m0.410s

WAR AND PEACE
$ time cat wrnpc12.txt | ruby suffix.rb
"\r\n\r\nThe Project Gutenberg Literary Archive Foundation has been "

real 4m55.145s
user 4m49.979s
sys 0m1.951s

Eric I. · Jan 20, 2008

I wouldn't be surprised if the idea of searching only 1/2 of the
second string to prevent overlaps is wrong..

I think you're right in that it's wrong.

If you submit the string "ababab" to your program, it comes back with
"aba" as the longest non-overlapping substring, but the answer should
be "ab". When you compare the consecutive sorted suffixes "abab" and
"ababab", you allow strings up to 3 ("ababab".size / 2) to be used,
but in fact, they overlap in all but 2 characters.

I'll post my solution in a reply, which is very similar to your except
in the overlap prevention code, which, I have to admit, is pretty
ugly. And I'm not even convinced that I got it right!

Eric

JJ · Jan 20, 2008

I think you're right in that it's wrong.
....snip

I'll post my solution in a reply, which is very similar to your
except in the overlap prevention code, which, I have to admit, is
pretty ugly. And I'm not even convinced that I got it right!

Dave's code can be corrected by realizing that since all suffix
strings end at the same place (the end of the string), then of the two
adjacent strings being tested, one is a suffix of the other.

This means that to detect overlap, the following test can be used:

if prefix.length + s1.length > s2.length then
# Overlap
end

where "prefix" is the current prefix being checked in the two adjacent
suffix strings.

Here is a picture. Pretend in the "ababab" case, we are checking the
adjacent strings "abab" and "ababab". Since one is a suffix of the
other, they can be lined up as they appeared in the original string
(in your mind):

abab
ababab

Now, the prefix being checked might be "aba". It matches both strings,
but if you check "aba".length + s1.length (7), it's too long to fit in
s2.length (6). In other words, they line up like this:

ababab # s2
abab # s1
aba # prefix, lined up with s2
^
`---- # overlap exists because the prefix
# as lined up with s2 overlaps with s1
# when s1 is lined up with s2 as they
# appear in the original string. In other
# words, the "aba" in s2 goes past the
# beginning of the "aba" in s1.

Adding this test (instead of the s2.length / 2 test) and also testing
adjacent strings that start with the prefix currently being searched
(to find later matches if earlier ones overlap) would correct Dave's
solution and shouldn't be much more complicated.

-JJ

Dido Sevilla · Jan 21, 2008

T24gSmFuIDE4LCAyMDA4IDk6MjMgUE0sIFJ1YnkgUXVpeiA8amFtZXNAZ3JheXByb2R1Y3Rpb25z
Lm5ldD4gd3JvdGU6Cj4gWW91ciBwcm9ncmFtIHdpbGwgYmUgcGFzc2VkIHNvbWUgdGV4dCBvbiBT
VERJTiBhbmQgaXMgZXhwZWN0ZWQgdG8gcHJpbnQgdGhlCj4gbG9uZ2VzdCByZXBlYXRlZCBzdWJz
dHJpbmcgd2l0aGluIHRoYXQgdGV4dCB0byBTVERPVVQuCgpJJ20gZ3Vlc3NpbmcgdGhhdCBzb21l
b25lIHdhbnRzIHRvIGJyZWFrIGEgVmlnZW5lcmUgY2lwaGVyLiAgVGhlCmxvbmdlc3QgcmVwZWF0
ZWQgc3Vic3RyaW5nIGluIHRleHQgZW5jaXBoZXJlZCB3aXRoIGEgVmlnZW5lcmUgY2lwaGVyCmhh
cyBhIGNsb3NlIHJlbGF0aW9uc2hpcCB0byB0aGUgbGVuZ3RoIG9mIHRoZSBrZXksIGFuZCBvbmNl
IHRoZSBrZXkKbGVuZ3RoIGlzIGRldGVybWluZWQsIHRoZSBjaXBoZXIgaXMgaGFsZndheSBicm9r
ZW4uIEF0IGxlYXN0IHRoYXQncwpvbmUgYXBwbGljYXRpb24gZm9yIHRoaXMgc29ydCBvZiBhbGdv
cml0aG0uLi4KCi0tIArmma7pgJrjgZjjgoPjgarjgYTjga7jgYzlvZPnhLbjgarjgonnrZTjgYjj
govnp4Hjga/kvZXjgYzjgafjgY3jgovvvJ8K5pmu6YCa44Gn44KC5pmu6YCa44GY44KD44Gq44GP
44Gm5oSf44GY44KL44G+44G+5oSf44GY44KL44GT44Go44Gg44GR44KS44GZ44KL44KI77yBCmh0
dHA6Ly9zdG9ybXd5cm0uYmxvZ3Nwb3QuY29tCg==

Todd Benson · Jan 21, 2008

2008/1/18 yermej said:
For The Illiad, I got:
' who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives' (168 characters)

Using Eric's code, for "The Iliad" downloaded from the Gutenberg project...

"On this the rest of the Acheans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. " (193
characters)

Todd

Denis Hennessy · Jan 21, 2008

Using Eric's code, for "The Iliad" downloaded from the Gutenberg
project...

"On this the rest of the Acheans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. " (193
characters)

Todd

In my copy of the Illiad (there's a phrase I never thought I'd use!),
those two sections differ in whitespace / linebreaks:

On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but not
so Agamemnon, who spoke fiercely to him and sent him roughly away.
"Old man," said he, "let me not find you tarrying about our ships, nor
yet coming hereafter.

and

"On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. So
he went back in anger, and Apollo, who loved him dearly, heard his
prayer.

Did you get yours from http://www.fordham.edu/halsall/ancient/homer-illiad.txt
?

/dh

[SUMMARY] Longest Repeated Substring (#153)	0	Jan 24, 2008
[QUIZ] Word Loop (#149)	44	Dec 7, 2007
[QUIZ] Found Senryu (#224)	2	Nov 13, 2009
[QUIZ] Gathering Ruby Quiz 2 Data (#189)	10	Jan 23, 2009
[QUIZ] Ruby BASIC (#228)	1	Jan 30, 2010
[QUIZ] IP to Country (#139)	37	Sep 14, 2007
[QUIZ] URL Shenanigans (#220)	1	Sep 14, 2009
[QUIZ] The Matrix: Code Screen (#230)	0	Mar 20, 2010

[QUIZ] Longest Repeated Substring (#153)

James Gray

Rados³aw Bu³at

ara.t.howard

ara.t.howard

fw

yermej

yermej

Dave Thomas

James Gray

Denis Hennessy

Raf Coremans

Denis Hennessy

Justin Ethier

Dave Thomas

Guillaume Carbonneau

Eric I.

JJ

Dido Sevilla

Todd Benson

Denis Hennessy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads