Count substrings from a string

Z

Zangief Ief

Hi,

I am able to know if there is a substring in a string with:

if string =~ /#{word}/:
puts "match"
else
puts "do not match"
end

But I would like to count all the substring from a string...

Thanks for help :)
 
W

Wolfgang Nádasi-Donner

Zangief said:
Hi,

I am able to know if there is a substring in a string with:

if string =~ /#{word}/:
puts "match"
else
puts "do not match"
end

But I would like to count all the substring from a string...

Thanks for help :)

irb(main):001:0> "one word is one word, never be three
words".scan(/word/).length
=> 3

:), Wolfgang Nádasi-Donner
 
R

Robert Dober

irb(main):001:0> "one word is one word, never be three
words".scan(/word/).length
=3D> 3

:), Wolfgang N=E1dasi-Donner
It is not clear what you want, I use an idiom like the following to
allow for overlap

class String
def all_subs substr, overlap =3D false
return scan(substr) unless overlap
(0...size).inject([]){ |r, i|
r <<
( substr.match(self[i..-1])[0] rescue nil )
}.compact
end
end


p "aaaa".all_subs( /aaa/ )
p "aaaa".all_subs( /aaa/, true )

too bad scan does not have an overlap parameter

HTH
Robert


--=20
what do I think about Ruby?
http://ruby-smalltalk.blogspot.com/
 
W

Wolfgang Nádasi-Donner

Robert said:
too bad scan does not have an overlap parameter

It has the possibility :)


irb(main):001:0> "aaaaaa".scan(/(?=aa)/) do
irb(main):002:1* puts "#{$~.pre_match}-#{$~[0]}-#{$~.post_match}"
irb(main):003:1> end
--aaaaaa
a--aaaaa
aa--aaaa
aaa--aaa
aaaa--aa
=> "aaaaaa"
irb(main):004:0> "aaaaaa".scan(/(?=aa)/).length
=> 5

With the look ahead features severals things can be done.

Wolfgang Nádasi-Donner
 
R

Robert Dober

It has the possibility :)
I am not willing to pay that price;)
but let me see, in case of sparse overlaps ( and that is often the
case ) the lookahead might be less costly than my
brute force approach, hmm, not sure, I will try

38/38 > cat substr.rb && ruby substr.rb
# vim: sw=3D2 ts=3D2 ft=3Druby expandtab tw=3D0 nu syn:
#
class String
def bf_sub substr
(0...size).inject([]){ |r, i|
r <<
( substr.match(self[i..-1])[0] rescue nil )
}.compact
end

def la_sub substr
scan /(?=3D#{substr})/
end
end

require 'benchmark'

N =3D 1_000
A =3D "a" * 100
S =3D /aa/
B =3D "b " * 20
T =3D /b\s/
Benchmark::bmbm do |bm|
bm.report "look ahead on many occurrances" do
N.times do
A.la_sub S
end
end
bm.report "brute force on many occurrances" do
N.times do
A.bf_sub S
end
end

bm.report "look ahead on sparse occurrances" do
N.times do
B.la_sub T
end
end
bm.report "brute force on sparse occurrances" do
N.times do
B.bf_sub T
end
end
end

Rehearsal -----------------------------------------------------------------=
----
look ahead on many occurrances 0.150000 0.000000 0.150000 ( 0.240=
576)
brute force on many occurrances 1.020000 0.030000 1.050000 ( 1.327=
992)
look ahead on sparse occurrances 0.050000 0.000000 0.050000 ( 0.125=
577)
brute force on sparse occurrances 2.720000 0.090000 2.810000 ( 4.508=
353)
------------------------------------------------------------ total: 4.06000=
0sec

user system total r=
eal
look ahead on many occurrances 0.150000 0.000000 0.150000 ( 0.184=
002)
brute force on many occurrances 0.890000 0.050000 0.940000 ( 1.243=
384)
look ahead on sparse occurrances 0.050000 0.000000 0.050000 ( 0.051=
307)
brute force on sparse occurrances 2.620000 0.090000 2.710000 ( 3.080=
540)
---------------------------------------------------------------------------=
--------------------------------
$stdout.sub("urranc","urrenc")

Turns out lookahead is quite ok, brute force hopeless even in the case
suiting it best :(, wonder how this would scale if there
were a Regexp#match allowing for an offset.

Cheers
Robert


--=20
what do I think about Ruby?
http://ruby-smalltalk.blogspot.com/
 
W

Wolfgang Nádasi-Donner

Robert said:
...wonder how this would scale if there were a Regexp#match allowing for an offset.


...which will come in Ruby 1.9:

C:\Dokumente und Einstellungen\wolfgang>irb19
irb(main):001:0> "abcdef".match(/./,3)
=> #<MatchData "d">

Wolfgang Nádasi-Donner
 
R

Robert Dober

r an offset.


...which will come in Ruby 1.9:

C:\Dokumente und Einstellungen\wolfgang>irb19
irb(main):001:0> "abcdef".match(/./,3)
=3D> #<MatchData "d">
Great :)

I have written a 1.9 test and still lookahead is much better, I am
surpprised as I got rid of all fancy stuff in the
bf_sub method going back to my Regexp stuff happily again.

def bf_sub substr
r =3D []
size.times do |i|
s =3D
substr.match(self,i)
r << ( s && s[0] )
end
r.compact
end

512/12 > ruby1.9 substr-1.9.rb
Rehearsal -----------------------------------------------------------------=
----
look ahead on many occurrances 0.200000 0.000000 0.200000 ( 0.308=
822)
brute force on many occurrances 0.530000 0.000000 0.530000 ( 0.649=
798)
look ahead on sparse occurrances 0.070000 0.000000 0.070000 ( 0.171=
571)
brute force on sparse occurrances 1.260000 0.000000 1.260000 ( 1.517=
635)
------------------------------------------------------------ total: 2.06000=
0sec

user system total r=
eal
look ahead on many occurrances 0.200000 0.000000 0.200000 ( 0.531=
080)
brute force on many occurrances 0.530000 0.000000 0.530000 ( 0.641=
892)
look ahead on sparse occurrances 0.070000 0.000000 0.070000 ( 0.158=
519)
brute force on sparse occurrances 1.240000 0.000000 1.240000 ( 1.389=
619)



Original test
507/7 > ruby substr.rb
Rehearsal -----------------------------------------------------------------=
----
look ahead on many occurrances 0.140000 0.000000 0.140000 ( 0.148=
194)
brute force on many occurrances 0.770000 0.020000 0.790000 ( 1.049=
865)
look ahead on sparse occurrances 0.040000 0.000000 0.040000 ( 0.059=
984)
brute force on sparse occurrances 2.130000 0.050000 2.180000 ( 2.316=
258)
------------------------------------------------------------ total: 3.15000=
0sec

user system total r=
eal
look ahead on many occurrances 0.120000 0.000000 0.120000 ( 0.147=
802)
brute force on many occurrances 0.770000 0.030000 0.800000 ( 0.838=
135)
look ahead on sparse occurrances 0.040000 0.000000 0.040000 ( 0.057=
320)
brute force on sparse occurrances 2.070000 0.070000 2.140000 ( 2.398=
068)
Wolfgang N=E1dasi-Donner


R.
--=20
what do I think about Ruby?
http://ruby-smalltalk.blogspot.com/
 
R

Robert Klemme

Great :)

I have written a 1.9 test and still lookahead is much better, I am
surpprised as I got rid of all fancy stuff in the
bf_sub method going back to my Regexp stuff happily again.

IMHO it is not surprising because the lookahead method uses one method
(scan) which is implemented in C while you create a lot of temporary
objects and also start scanning at *every* position in the string.
Usually regexp engines are implemented much smarter than we can do in
short method. I am sure you have heard of Boyer Moore and Knuth Morris
Pratt algorithms... :)
def bf_sub substr
r = []
size.times do |i|
s =
substr.match(self,i)
r << ( s && s[0] )
end
r.compact
end

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top