Search for a string in another string allowing mismatches

J

Janus Bor

Hi all,

I'm sure this is a very common problem. But can't find a simple &
efficient way of solving it.

Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here's an example:

query string:
"acgt"

subject string
"acctaggg"

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match "acct".
If 2 mismatches were allowed, the query string would match "acct" and
"aggg".

If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: /[ac][c][cg][t][acgt][c][gt][t]/

Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.

I'd be very grateful for any suggestions!

Cheers,
Janus
 
D

Dave Howell

Hi all,
=20
I'm sure this is a very common problem. But can't find a simple &
efficient way of solving it.

Hmm. I wouldn't think this is common at all. I have dealt with a =
somewhat similar situation in the past in two ways:

1. Let the user do partial searches.

2. Using the 'soundex' tools in a database

Neither of those matches your situation, which appears to be scanning =
for near-matches in genetic material.
Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.

This certainly isn't my field of expertise, but I don't expect a =
fuzzy-logic search and 'performance' to be particularly compatible. Hope =
I'm wrong!
If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: =
/[ac][c][cg][t][acgt][c][gt][t]/

Gah.=20

What came to my mind was something like=20

fuzzysearch("agct", 1)

where "1" is the number of mismatches allowed. The code in question =
would run the following searches:
/.gct/
/a.ct/
/ag.t/
/agc./

for fuzzysearch("agct",2) it would run
/..ct/
/.g.t/
/.gc./
/a..t/
/a.c./
/ag../

I have no idea what the overhead of wildcards are; if there's some =
ultra-simple ultra-fast string matching tool somewhere that you can call =
from Ruby or invoke through a command line, then you could take =
advantage of the tiny set of possible alternatives, and break the fuzzy =
search into 13 static searches:
agct
cgct
ggct
tgct
aact
acct
atct
 
H

Harry Kakueki

Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here's an example:

query string:
"acgt"

subject string
"acctaggg"

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match "acct".
If 2 mismatches were allowed, the query string would match "acct" and
"aggg".

This does not use regular expressions.
It does not tell you which characters match. It tells you how many.

I don't know if it is useful or fast enough but maybe it will give you
some ideas.


str = "acctaggg"

(0..str.length-4).each do |y|
p str.unpack("@#{y}a4")[0].split(//).zip("acgt".split(//)).map{|r|
r[0]<=>r[1]}.select{|f| f==0}.size
end



Harry
 
H

Harry Kakueki

Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here's an example:

query string:
"acgt"

subject string
"acctaggg"

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match "acct".
If 2 mismatches were allowed, the query string would match "acct" and
"aggg".

This does not use regular expressions.
It does not tell you which characters match. It tells you how many.

I don't know if it is useful or fast enough but maybe it will give you
some ideas.


str =3D "acctaggg"

(0..str.length-4).each do |y|
=A0p str.unpack("@#{y}a4")[0].split(//).zip("acgt".split(//)).map{|r|
r[0]<=3D>r[1]}.select{|f| f=3D=3D0}.size
end



Harry

A little modification of my previous post.
Again, (lack of) speed may be a problem.


class String
def fuz(m,n) # (string to match, number of matches)
a =3D []
(0..size-4).each do |y|
t =3D self[y,4]
if t.split(//).zip(m.split(//)).map{|r| r[0]<=3D>r[1]}.select{|f|
f=3D=3D0}.size>=3Dn
a << t
end
end
a
end
end

str =3D "acctaggg"

p str.fuz("acgt",2) #> ["acct", "aggg"]


Harry
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top