Determining the common prefix for several strings

E

Eric Promislow

The obligatory solution using zip...

# data contains a list of strings
aa = data.map{ |x| x.split(//)}
bz = aa[0].zip(*(aa[1 .. -1])).map{ |x| x.join("")}
target = bz.index( bz.find{ |x| x.squeeze.size != 1})
answer = (target.nil? ? aa[0] :
target >= 1 ? aa[0][0 .. target - 1] :
[]).join("")

- Eric
 
E

Eric Promislow

The obligatory solution usingzip...

# data contains a list of strings
aa = data.map{ |x| x.split(//)}
bz = aa[0].zip(*(aa[1 .. -1])).map{ |x| x.join("")}
target = bz.index( bz.find{ |x| x.squeeze.size != 1})
answer = (target.nil? ? aa[0] :
target >= 1 ? aa[0][0 .. target - 1] :
[]).join("")

- Eric

Doh! This isn't correct. It's hard to improve over
Michael's routine back in the initial response. You
need to find any one of the shortest strings, remove it,
and use that instead of aa[0] as above.
 
W

Wolfgang Nádasi-donner

Xavier said:
El Aug 3, 2007, a las 2:31 AM, Todd Burch
escribi�:
item 5
item 10
itemize this

I want to parse these names and determine that "item" is the common
prefix.

I know how to armstrong my way through it a character at a time,
but I'm
thinking that Ruby probably has a very elegant way of doing this that
I'm not familiar with yet.

This is a solution that avoids going char by char:

items = [
'item001',
'item004',
'item002',
'item002b',
'item002C',
'item 5',
'item 10',
'itemize this'
]

min_length = items.map {|i| i.length}.min
cuts = items.map {|i| i.split(//)[0, min_length]}
cuts.transpose.each_with_index do |t, i|
if t.uniq.length != 1
puts cuts.first[0, i].join('')
break
end
end

We first check which is the length of the shortest string in the
array, then split by chars to store those many ones, and then
transpose the matrix. Note that transposing is OK because we cut the
strings. That's longer to explain than to write it in Ruby as you
see. Once you get the transpose you can go row by row to detect the
one that has more than one distinct element. Again that's easy
delegating the job to uniq.

-- fxn

A variant of this is:

list = <<-EOS
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
EOS
s = list.split("\n").sort{|a,b|a.length<=>b.length}
pre =
s.collect{|e|e.split('')[0...(s[0].length)]}.transpose.collect{|e|e.uniq}.reject{|e|e.length>1}.flatten.join('')
p pre # => "item"

Wolfgang N#adasi-Donner
 
W

Wolfgang Nádasi-donner

Wolfgang Nádasi-donner wrote:
EOS
s = list.split("\n").sort{|a,b|a.length<=>b.length}
pre =
s.collect{|e|e.split('')[0...(s[0].length)]}.transpose.collect{|e|e.uniq}.reject{|e|e.length>1}.flatten.join('')
p pre # => "item"

Wolfgang N#adasi-Donner

:oops: - I made an error!

The corrected version is...

list = <<-EOS
item0aa
item1aa
item3a
EOS
s = list.split("\n").sort{|a,b|a.length<=>b.length}
pre =
s.collect{|e|e.split('')[0...(s[0].length)]}.transpose.collect{|e|e.uniq}.
inject(''){|r,c|(c.length>1)?(break r):(r<<c[0])}
p pre # => "item"


Wolfgang Nádasi-Donner
 
G

Gavin Sinclair

item001
item004
item002
item002b
item002C
item 5
item 10
itemize this

I want to parse these names and determine that "item" is the common
prefix.

I know this has been done to death, but here's my preferred code:

list = %{
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
}.strip.map { |line| line.strip }.sort_by { |str| str.length }

prefix = list.first.dup

until prefix.empty? or list.all? { |str| str =~ /^#{prefix}/ }
prefix.chop!
end

p prefix


I didn't think of using chop! until someone else posted a solution
with that in it. It's the first really good use of chop! I've seen.

This solution also demonstrates the lovely #emtpy? and #all? methods.
(I know #all? has been seen before in this thread, but it's such a
handy method.)

Cheers,
Gavin
 
G

Gavin Sinclair

min = list.sort!{|a,b| a.size <=> b.size}.first
break if list.all?{|a| a =~ /^#{min}/} while min.chop!

I just realised that the solution I just posted elsewhere in this
thread is very very similar to this one.

Be careful that using bang methods in chains is sometimes problematic,
as they can return nil if nothing changes (sort! doesn't have this
problem). For interested readers, find the problem with this code:

title = "the moon in june"
title.downcase!.capitalize!

Cheers,
Gavin
 
P

Peña, Botp

From: Gavin Sinclair [mailto:[email protected]]=20
# > min =3D list.sort!{|a,b| a.size <=3D> b.size}.first
# > break if list.all?{|a| a =3D~ /^#{min}/} while min.chop!
# I just realised that the solution I just posted elsewhere in this
# thread is very very similar to this one.
# Be careful that using bang methods in chains is sometimes problematic,
# as they can return nil if nothing changes (sort! doesn't have this
# problem). =20

indeed. as long as the array has no nil elements, w're ok w sort!

irb(main):009:0> s=3D"test"
=3D> "test"
irb(main):010:0> s.downcase!.downcase!
NoMethodError: undefined method `downcase!' for nil:NilClass
from (irb):10
from :0
irb(main):011:0> a=3D[1,1,1]
=3D> [1, 1, 1]
irb(main):012:0> a.sort!.sort!
=3D> [1, 1, 1]

i realized just now that i posted the first code. That should be,

min =3D list.sort!{|a,b| a.size <=3D> b.size}.shift # <--shift instead =
of first
break if list.all?{|a| a =3D~ /^#{min}/} while min.chop!=20

but anyway, after a couple of tests, the sort does not help much, so

min =3D list.min{|a,b| a.size <=3D> b.size}
break if list.all?{|a| a =3D~ /^#{min}/} while min.chop!=20

hmm.. Gavin, your until code just gave me a hint. if we reverse logic, =
we can save the break and make it more readable too.

min =3D list.min{|a,b| a.size <=3D> b.size}
min.chop! until list.all?{|a| a =3D~ /^#{min}/}

and we can use switch w #any? anytime too :)

min =3D list.min{|a,b| a.size <=3D> b.size}
min.chop! while list.any?{|a| a !~ /^#{min}/}

ruby power, indeed.

kind regards -botp
 
R

Robert Klemme

------=_Part_111584_14226503.1186400979032
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

2007/8/6 said:
From: Gavin Sinclair [mailto:[email protected]]
# > min =3D list.sort!{|a,b| a.size <=3D> b.size}.first
# > break if list.all?{|a| a =3D~ /^#{min}/} while min.chop!
# I just realised that the solution I just posted elsewhere in this
# thread is very very similar to this one.
# Be careful that using bang methods in chains is sometimes problematic,
# as they can return nil if nothing changes (sort! doesn't have this
# problem).

indeed. as long as the array has no nil elements, w're ok w sort!

irb(main):009:0> s=3D"test"
=3D> "test"
irb(main):010:0> s.downcase!.downcase!
NoMethodError: undefined method `downcase!' for nil:NilClass
from (irb):10
from :0
irb(main):011:0> a=3D[1,1,1]
=3D> [1, 1, 1]
irb(main):012:0> a.sort!.sort!
=3D> [1, 1, 1]

i realized just now that i posted the first code. That should be,

min =3D list.sort!{|a,b| a.size <=3D> b.size}.shift # <--shift instead o= f first
break if list.all?{|a| a =3D~ /^#{min}/} while min.chop!

but anyway, after a couple of tests, the sort does not help much, so

min =3D list.min{|a,b| a.size <=3D> b.size}
break if list.all?{|a| a =3D~ /^#{min}/} while min.chop!

hmm.. Gavin, your until code just gave me a hint. if we reverse logic, w=
e can save the break and make it more readable too.
min =3D list.min{|a,b| a.size <=3D> b.size}
min.chop! until list.all?{|a| a =3D~ /^#{min}/}

and we can use switch w #any? anytime too :)

min =3D list.min{|a,b| a.size <=3D> b.size}
min.chop! while list.any?{|a| a !~ /^#{min}/}

ruby power, indeed.

All these solutions have the issue that they modify the list which
might or might not be ok. I tend to prefer to view the item list as
read only. So I'd do:

min =3D list.min{|a,b| a.size <=3D> b.size}.dup
min.chop! until list.all?{|a| a =3D~ /^#{min}/}

I have added some interesting benchmarking:

$ ./prefix.rb
user system total real
1000 * list.rx 0.016000 0.000000 0.016000 ( 0.014000)
1000 * list.rx with conversion 0.031000 0.000000 0.031000 ( 0.030000)
1000 * list.one_pass 0.219000 0.000000 0.219000 ( 0.218000)
1000 * list.min 0.328000 0.000000 0.328000 ( 0.337000)
1000 * list.first.dup 0.640000 0.000000 0.640000 ( 0.641000)
1 * large.rx 0.000000 0.000000 0.000000 ( 0.000000)
1 * large.rx with conversion 0.000000 0.000000 0.000000 ( 0.000000)
1 * large.one_pass 25.985000 0.000000 25.985000 ( 26.298000)
1 * large.min 34.953000 0.000000 34.953000 ( 35.098000)
1 * large.first.dup 27.719000 0.000000 27.719000 ( 27.732000)


(see attachment for code)

Kind regards

robert

------=_Part_111584_14226503.1186400979032
Content-Type: application/x-ruby; name="prefix.rb"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="prefix.rb"
X-Attachment-Id: f_f50wi4qz

IyFydWJ5DQoNCnJlcXVpcmUgJ2JlbmNobWFyaycNCg0KUkVQID0gMTAwDQoNCmxpc3QgPSA8PFhY
WC50b19hLmVhY2gge3x4fCB4LmNob21wITsgeC5mcmVlemV9LmZyZWV6ZQ0KaXRlbTAwMQ0KaXRl
bTAwNA0KaXRlbTAwMg0KaXRlbTAwMmINCml0ZW0wMDJDDQppdGVtIDUNCml0ZW0gMTANCml0ZW1p
emUgdGhpcw0KWFhYDQoNCmxhcmdlID0gbGlzdC5kdXANCmxhcmdlLmNvbmNhdCBsYXJnZSB3aGls
ZSBsYXJnZS5zaXplIDwgMV8wMDBfMDAwDQpsYXJnZS5mcmVlemUNCg0KQmVuY2htYXJrLmJtIDMw
IGRvIHx4fA0KICB7Imxpc3QiID0+IFtsaXN0LCAxXzAwMF0sICJsYXJnZSIgPT4gW2xhcmdlLCAx
XX0uZWFjaCBkbyB8bGFiZWwsIChzZXQsIHJlcCl8DQoNCiAgICBzdHJpbmcgPSBsaXN0LmpvaW4o
IlxuIikuZnJlZXplDQoNCiAgICB4LnJlcG9ydCAiI3tyZXB9ICogI3tsYWJlbH0ucngiIGRvDQog
ICAgICByZXAudGltZXMgZG8NCiAgICAgICAgbWluID0gL1xBKC4qKS4qKFxuXDEuKikqXFovLm1h
dGNoKHN0cmluZylbMV0NCiAgICAgIGVuZA0KICAgIGVuZA0KDQogICAgeC5yZXBvcnQgIiN7cmVw
fSAqICN7bGFiZWx9LnJ4IHdpdGggY29udmVyc2lvbiIgZG8NCiAgICAgIHJlcC50aW1lcyBkbw0K
ICAgICAgICBzdHJpbmcgPSBsaXN0LmpvaW4oIlxuIikuZnJlZXplDQogICAgICAgIG1pbiA9IC9c
QSguKikuKihcblwxLiopKlxaLy5tYXRjaChzdHJpbmcpWzFdDQogICAgICBlbmQNCiAgICBlbmQN
Cg0KICAgIHgucmVwb3J0ICIje3JlcH0gKiAje2xhYmVsfS5vbmVfcGFzcyIgZG8NCiAgICAgIHJl
cC50aW1lcyBkbw0KICAgICAgICBtaW4gPSBzZXQuZmlyc3QuZHVwDQogICAgICAgIHNldC5lYWNo
IGRvIHxzfA0KICAgICAgICAgIGkgPSAwDQogICAgICAgICAgd2hpbGUgbWluW2ldID09IHNbaV0g
JiYgaSA8IG1pbi5zaXplDQogICAgICAgICAgICBpICs9IDENCiAgICAgICAgICBlbmQNCiAgICAg
ICAgICBtaW4uc2xpY2UhIGkuLi0xDQogICAgICAgIGVuZA0KICAgICAgZW5kDQogICAgZW5kDQoN
CiAgICB4LnJlcG9ydCAiI3tyZXB9ICogI3tsYWJlbH0ubWluIiBkbw0KICAgICAgcmVwLnRpbWVz
IGRvDQogICAgICAgIG1pbiA9IHNldC5taW57fGEsYnwgYS5zaXplIDw9PiBiLnNpemV9LmR1cA0K
ICAgICAgICBtaW4uY2hvcCEgdW50aWwgc2V0LmFsbD97fGF8IGEgPX4gL14je21pbn0vfQ0KICAg
ICAgZW5kDQogICAgZW5kDQoNCiAgICB4LnJlcG9ydCAiI3tyZXB9ICogI3tsYWJlbH0uZmlyc3Qu
ZHVwIiBkbw0KICAgICAgcmVwLnRpbWVzIGRvDQogICAgICAgIG1pbiA9IHNldC5maXJzdC5kdXAN
CiAgICAgICAgbWluLmNob3AhIHVudGlsIHNldC5hbGw/e3xhfCBhID1+IC9eI3ttaW59L30NCiAg
ICAgIGVuZA0KICAgIGVuZA0KDQogIGVuZA0KZW5kDQo=
------=_Part_111584_14226503.1186400979032--
 
S

Stefan Rusterholz

Todd said:
I have several strings to analyze. The strings are the names of a list
of things that were once ordered and now are not. A typical array of
names might look like this:

item001
item004
item002
item002b
item002C
item 5
item 10
itemize this

I want to parse these names and determine that "item" is the common
prefix.

I know how to armstrong my way through it a character at a time, but I'm
thinking that Ruby probably has a very elegant way of doing this that
I'm not familiar with yet.

Thanks, Todd

Wow, I'm surprised nobody had the idea of simply using Array#abbrev,
which is included in the stdlib.

Regards
Stefan
 
T

Todd Burch

Stefan said:
Wow, I'm surprised nobody had the idea of simply using Array#abbrev,
which is included in the stdlib.

Regards
Stefan

Well that looks simple enough! Looks like I would take the shortest
key returned and chop! the last character to get the common prefix. If
the shortest key length was 1, there is no common prefix.

I had never used Abbrev before - thanks for pointing it out. This
wasn't a contest, but if it were - I think you won!

Todd
 
X

Xavier Noria

El Aug 6, 2007, a las 5:13 PM, Stefan Rusterholz escribi=F3:
Wow, I'm surprised nobody had the idea of simply using Array#abbrev,
which is included in the stdlib.

There's no guarantee you can obtain the maximum common prefix from =20
its output, because it computes _unambiguous_ prefixes:

%w{ item1 item2 }.abbrev
=3D> {"item1"=3D>"item1", "item2"=3D>"item2"}

See, no "item" there, you end up basically with the information you =20
had before the call.

-- fxn
 
P

Peña, Botp

From: Robert Klemme [mailto:[email protected]]=20
# I have added some interesting benchmarking:
#=20
# $ ./prefix.rb
# user system total =
real
# 1000 * list.rx 0.016000 0.000000 0.016000 ( =
0.014000)
# 1000 * list.rx with conversion 0.031000 0.000000 0.031000 ( =
0.030000)
# 1 * large.rx 0.000000 0.000000 0.000000 ( =
0.000000)
# 1 * large.rx with conversion 0.000000 0.000000 0.000000 ( =
0.000000)

Robert, thanks for the benchmark. I included Brian|Martin's solution in =
the mix but still the regex scheme is way a lot better w regards to =
speed.

Could some kind soul enlighten me why the regex solution works better, =
and even best using large strings?

kind regards -botp
 
P

Peña, Botp

From: Logan Capaldo [mailto:[email protected]]=20
# I've edited out some false starts. Please feel free to point=20
# out any flaws
# that I've missed
#=20
# Abbrev.abbrev(items).sort_by { |s,| s.length }.first.first[0..-2]

I cannot prove it wrong, hacker Logan :)

i just put some cosmetics though by getting the extract fr min,

irb(main):099:0> list=3D[
irb(main):100:1* "item two of those",
irb(main):101:1* "item two",
irb(main):102:1* "item1",
irb(main):103:1* "item22",
irb(main):104:1* "item333",
irb(main):105:1* "item 444",
irb(main):106:1* "item 5555",
irb(main):107:1* "item6a",
irb(main):108:1* "item6b",
irb(main):109:1* "item---x",
irb(main):110:1* "itemy",
irb(main):111:1* "item whatever :)",
irb(main):112:1* ]
=3D> ["item two of those", "item two", "item1", "item22", "item333", =
"item 444", "item 5555", "item6a", "item6b", "item---x", "itemy", "item =
whatever :)"]
irb(main):113:0> =
list.abbrev.min{|(a1,),(b1,)|a1.length<=3D>b1.length}.first[0..-2]
=3D> "item"

so,=20
list.abbrev.min{|(a1,),(b1,)|a1.length<=3D>b1.length}.first[0..-2]

your magic number "0..2" is an amazingly simple hack (it looks like 42 =
:). but i'm not sure how this abbrev scheme stands stands against the =
regex solutions though.

kind regards -botp
 
R

Robert Klemme

From: Robert Klemme [mailto:[email protected]]
# I have added some interesting benchmarking:
#
# $ ./prefix.rb
# user system total real
# 1000 * list.rx 0.016000 0.000000 0.016000 ( 0.014000)
# 1000 * list.rx with conversion 0.031000 0.000000 0.031000 ( 0.030000)
# 1 * large.rx 0.000000 0.000000 0.000000 ( 0.000000)
# 1 * large.rx with conversion 0.000000 0.000000 0.000000 ( 0.000000)

Robert, thanks for the benchmark. I included Brian|Martin's solution inthe mix but still the regex scheme is way a lot better w regards to speed.

Could some kind soul enlighten me why the regex solution works better, and even best using large strings?

My theory is that it's completely done in C and there is only *one*
invocation. All other solutions need some form of iteration implemented
in Ruby.

Kind regards

robert
 
X

Xavier Noria

El Aug 6, 2007, a las 5:13 PM, Stefan Rusterholz escribi=F3:


There's no guarantee you can obtain the maximum common prefix from
its output, because it computes _unambiguous_ prefixes:

%w{ item1 item2 }.abbrev
=3D> {"item1"=3D>"item1", "item2"=3D>"item2"}

See, no "item" there, you end up basically with the information you
had before the call.



irb(main):027:0> items =3D [
irb(main):028:1* 'item001',
irb(main):029:1* 'item004',
irb(main):030:1* 'item002',
irb(main):031:1* 'item002b',
irb(main):032:1* 'item002C',
irb(main):033:1* 'item 5',
irb(main):034:1* 'item 10',
irb(main):035:1* 'itemize this'
irb(main):036:1> ]
irb(main):039:0> Abbrev.abbrev(items).sort_by { |s,| =20
s.length}.first.first[0..-
2]
=3D> "item"
irb(main):040:0> items =3D ["happy", "happening", "hapless"]
=3D> ["happy", "happening", "hapless"]
irb(main):041:0> Abbrev.abbrev(items).sort_by { |s,| =20
s.length}.first.first[0..-
2]
=3D> "hap"
irb(main):042:0> items =3D ["c", "b", "a"]
=3D> ["c", "b", "a"]
irb(main):043:0> Abbrev.abbrev(items).sort_by { |s,| =20
s.length}.first.first[0..-
2]
=3D> ""
irb(main):044:0>

I've edited out some false starts. Please feel free to point out =20
any flaws
that I've missed

If I understand correctly that approach that says: "since I can get =20
some absolute shortest (perhaps not unique as fas as length is =20
concerned) unambiguous minimum, belonging to string S, that shortest =20
minus one is ambiguous". That's the key idea, and I think there's =20
something into it.

What happens is that, as far as I can see, you can only conclude =20
there's ambiguity with some other string, which is not the same than =20
saying that you've got the maximum common prefix. See for example:

%w{abc acb abd acd}.abbrev.sort_by {|s,| s.length}.first.first[0..-2]
=3D> "ac"

-- fxn=20

PS: I think abbrev does too much work compared to more specific =20
solutions, but it is worth exploring for the sake of thinking =20
nonetheless.=
 
P

Peña, Botp

From: Xavier Noria [mailto:[email protected]]=20
# %w{abc acb abd acd}.abbrev.sort_by {|s,| =
s.length}.first.first[0..-2]
# =3D> "ac"

duh, i should read further on what abrrev really does. Thanks for the =
enlightenment.

# PS: I think abbrev does too much work compared to more specific =20
# solutions, but it is worth exploring for the sake of thinking =
nonetheless.

yes. this has been an enlightening thread nonetheless.=20
kind regards -botp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,266
Messages
2,571,318
Members
47,998
Latest member
GretaCjy4

Latest Threads

Top