newb: comapring two strings

M

manstey

Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

Thanks in advance,
Matthew
 
D

Diez B. Roggisch

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

Use the levenshtein distance.
http://en.wikisource.org/wiki/Levenshtein_distance

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

decorate-sort-undecorate is the idion for this

l = <list of strings>

l = [(len(w), w) for w in l]
l.sort()
l = [w for _, w in l]


Diez
 
S

Serge Orlov

manstey said:
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

If your strings are pretty short you can do it like this even without
sorting by length first:

def fuzzy_keys(s):
for pos in range(len(s)):
yield s[0:pos]+chr(0)+s[pos+1:]

def fuzzy_insert(d, s):
for fuzzy_key in fuzzy_keys(s):
if fuzzy_key in d:
strings = d[fuzzy_key]
if type(strings) is list:
strings += s
else:
d[fuzzy_key] = [strings, s]
else:
d[fuzzy_key] = s

def gather_fuzzy_matches(d):
for strings in d.itervalues():
if type(strings) is list:
yield strings

acc = {}
fuzzy_insert(acc, "yaqtel")
fuzzy_insert(acc, "yaqtil")
fuzzy_insert(acc, "oaqtil")
print list(gather_fuzzy_matches(acc))

prints

[['yaqtil', 'oaqtil'], ['yaqtel', 'yaqtil']]
 
J

johnzenger

manstey said:
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

You want zip.

def diffbyonlyone(string1, string2):
diffcount = 0
for c1, c2 in zip(string1, string2):
if c1 != c2:
diffcount += 1
if diffcount > 1:
return False
return diffcount == 1

print diffbyonlyone("yaqtil","yaqtel") # True
print diffbyonlyone("yiqtol","yaqtel") # False

If your strings are long, it might be faster/more memory efficient to
use itertools.izip instead.
My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1 & str2, str1 & str3, str 1 & str4, str1 & str5,
str2 & str3, str3 & str4, str3 & str5, str4 & str5.

for index1 in xrange(len(words)):
for index2 in xrange(index1+1,len(words)):
if diffbyonlyone(words[index1], words[index2]):
print words[index1] + " -- " + words[index2]

....but by all means run that only on sets of words that you have
already identified, pursuant to some criteria like word length, to be
likely matches. Do the math; that's a lot of comparisons!
 
P

Paul McGuire

manstey said:
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')
.... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >=
len(a)-1
....False

This takes advantage of the Python convention that True evaluates to 1 and
False evaluates to 0.

-- Paul
 
P

Paul Rubin

Paul McGuire said:
... return len(a)==len(b) and sum(map(lambda (x,y): x==y, zip(a,b))) >=
len(a)-1

Yikes! How about (untested):

def offByNoMoreThanOneCharacter(a,b):
return len(a)==len(b) and \
len([i for i in xrange(len(a)) if a != b]) <= 1

which seems more direct. I'm not sure it's correct since I'd
consider "apple" and "apples" to be one character apart even though
their lengths are unequal.

But I think this type of function is useless for the problem at hand,
since there are around N=300,000 words, comparing every pair of them
is O(N**2) which is a bit painful.

I'd use a decorate-sort-undecorate approach like:

# generate all the ways to
def generate_all_variants(word):
wlower = word.lower()
yield (wlower, word)
for i in xrange(len(word)):
yield (wlower[:i] + wlower[i+1:], word)

Use this to generate all the variants of all the words in the
dictionary, and write those out into a file, each line containing a
variant plus the original word. Then use a sorting utility like the
unix "sort" program to sort the file. Those programs work efficiently
even if the file is too large to fit in memory. Then read the sorted
file to find equivalence classes on the variants.
 
P

Peter Otten

manstey said:
Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

Not sure if it can handle 300000 words, but here is something to play with.

import sys

def find_similars(words, lookup=None, dupes=None):
if lookup is None:
lookup = {}
if dupes is None:
dupes = set()
for word in words:
low_word = word.lower()
if low_word not in dupes:
dupes.add(low_word)
last = None
for i, c in enumerate(low_word):
if c == last: continue
key = low_word[:i], low_word[i+1:]
if key in lookup:
lookup[key].append(word)
else:
lookup[key] = [word]
last = c
return (group for group in lookup.itervalues() if len(group) > 1)

def main():
import optparse
parser = optparse.OptionParser()
parser.usage += " infile"
parser.add_option("-n", "--limit", type="int", help="process at most
LIMIT words")
options, args = parser.parse_args()
if args:
words = (w.strip() for fn in args for w in open(fn))
else:
words = (w.strip() for w in sys.stdin)
if options.limit is not None:
from itertools import islice
words = islice(words, max_len)

for group in find_similars(words):
print " ".join(sorted(group))

if __name__ == "__main__":
main()

Peter
 
J

John Machin

Use the levenshtein distance.

Given the constraint that the two strings are the same length, I'm
assuming (as other posters appear to have done) that "vary by only one
character" precludes insertion and deletion operations.

In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc. In this
case it is possible to achieve O(n) time [ref. Ukkonen]. However (a)
one still needs to track where the difference arose (b) we are talking
about extreme overkill for the OP's problem.
 
P

Paul McGuire

In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc.

Here is a class that implements 4 different approaches to this comparison
problem, with a configurable number of maximum mismatches (where a and b are
the strings being compared):

1. use sum(map(lambda (x,y): x!=y, itertools.izip(a,b))) to count
mismatches - no short-circuiting
2. use sum(map(abs, itertools.starmap(cmp, itertools.izip(a,b)))) to count
mismatches - no short-circuiting
3. use explicit for loop to compare each tuple returned from
itertools.izip(a,b) - short-circuits when number of mismatches exceeds
allowable number
4. use for loop over itertools.takewhile to count mismatches -
short-circuits when number of mismatches exceeds allowable number

Also note the short-circuiting in the initializer - if the strings being
compared are shorter in length then the number of allowed mismatches, than
they will always match, no comparisons required. (In the OP's specific
case, any two one-character strings will match within one character).

Of course this does nothing to address the larger issue of the program
design - I just wanted to learn a little more about itertools. :)

-- Paul


from itertools import izip,starmap,takewhile

class offByNoMoreThanOneCharacter(object):
def __init__(self,a,b,maxMismatches=1):
len_a = len(a)
self.val = None
if len_a != len(b):
self.val = False
elif len_a <= maxMismatches:
self.val = True
else:
self.a = a
self.b = b
self.maxMismatches = maxMismatches

def eval1(self):
# one-liner using map with lambda - sum counts mismatches
self.val = sum(map(lambda (x,y): x!=y, izip(self.a,self.b))) <=
self.maxMismatches

def eval2(self):
# one-liner using map with abs and cmp - sum counts mismatches
self.val = sum(map(abs, starmap(cmp, izip(self.a,self.b)))) <=
self.maxMismatches

def eval3(self):
# explicit traversal, with short-circuit when too many mismatches
found
mismatches = 0
for (ac,bc) in izip(self.a,self.b):
if ac != bc:
mismatches += 1
if mismatches > self.maxMismatches:
self.val = False
break
else:
self.val = True

def eval4(self):
# traversal using takewhile, also short-circuits when too many
mismatches found
numMismatches = 0
maxMismatches = self.maxMismatches
stillOk = lambda (s,t)=(None,None) : numMismatches <= maxMismatches
for (ac,bc) in takewhile(stillOk, izip(self.a,self.b)):
numMismatches += (ac != bc)
self.val = stillOk()


# special instance method to return "boolean-ness" of this object
def __nonzero__(self):
if self.val is None:
# change this line to use whichever eval method you want to test
self.eval1()
return self.val


def test(a,b):
print a,b
print bool(offByNoMoreThanOneCharacter(a,b))
print

test("abc","abd")
test("abc","axd")
test("yaqtil","yaqtel")
test("yiqtol","yaqtel")
 
B

bearophileHUGS

I think a way to solve the problem may be:
1) create a little Python script to separate the original words in many
files, each one containing only words of the same length. Every
filename can contain the relative word length.
2) write a little C program with two nested loops, that scans all the
pairs of the words of a single file, looking for single char
differences using a scanning of the chars. Probably there are many
possible tricks to speed up this search (like separating the words in
subgroups, of using some assembly-derived tricks, etc) but maybe you
don't need them. As C data structure you may simply use a single block
(because you know the length of a single word, so the files produced by
Python can be without spaces and returns).

I have suggested C because if the words are all of the same length then
you have 30000^2 = 90 000 000 000 pairs to test.
If you want, before creating the C program you can create a little
Python+Psyco prototype that uses array.array of chars (because
sometimes Psyco uses them quite quickly).

Bye,
bearophile
 
B

bearophileHUGS

I have suggested C because if the words are all of the same length then
you have 30000^2 = 90 000 000 000 pairs to test.

Sorry, you have (n*(n-1))/2 pairs to test (~ 45 000 000 000).

bearophile
 
J

John Machin

Still terrible. Use a better algorithm!

To put all this in perspective, here's a very similar real-world
problem:

You have a customer database with 300,000 records. Some of them are
duplicated, because there are differences in recording the customers'
names, like a one-keystroke typo. The task is to locate pairs of rows
which are likely to be duplicates. 45 billion comaprisons[1] may be
ecxessive.

And here's a clue for a better algorithm: Knuth's TAOCP vol 3 [1973
edition] chapter 6 (searching) third page "as if the words were spelled
backwards". If you find yourself reading about soundex, you've
definitely gone too far :)

[1]: "comapring" in the subject: is this a game of 'Spot the deliberate
mistake"? Is the OP counting a transposition of adjacent characters as
"varying by one character"?

Cheers
 
B

bearophileHUGS

Paul Rubin>Still terrible. Use a better algorithm!<

I agree, it's O(n^2), but if you need to run this program just 1 time,
and the program is in C, and you don't want to use much time to think
and code a better algorithm (or you aren't able to do it) then maybe
that naive solution can be enough, (if the running time is around 30-90
minutes).
Total time to solve a one-shot problem is often the sum of thinking
time + programming time + running time :)


Paul Rubin>Use this to generate all the variants of all the words in
the dictionary, and write those out into a file, each line containing a
variant plus the original word. Then use a sorting utility like the
unix "sort" program to sort the file. Those programs work efficiently
even if the file is too large to fit in memory. Then read the sorted
file to find equivalence classes on the variants.<

If the words are all 5 character long, then I think you are creating:
(len(lowercase)-1) * 5 * 30000 = 3750000
strings of len 5+5 (plus enter if you save it into a file), this means
a file of about 37 MB. All those trings may even fit in memory if you
have lot of it.
Your solution looks nice :) (It reminds me a little the word signature
trick used to find anagrams).

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,783
Latest member
RickeyDort

Latest Threads

Top