how to find difference in number of characters

H

harryos

hi
I am trying to write a compare method which takes two strings and find
how many characters have changed.


def compare_strings(s1,s2):
pass


text1="goat milk"
text2="cow milk"
print compare_strings(text1,text2)

This must give 3 ,since 3 characters are changed between strings.I was
advised to use levenshtein algorithm ..but then the matrix ops take a
long time for nontrivial strings of say 20000 characters ..Can this
comparison be implemented using difflib module?..I am at a loss as to
how to implement this using difflib .Is there some way I can get the
difference as a number ?
Can somebody help?
thanks
harry
 
P

Peter Otten

harryos said:
I am trying to write a compare method which takes two strings and find
how many characters have changed.


def compare_strings(s1,s2):
pass


text1="goat milk"
text2="cow milk"
print compare_strings(text1,text2)

This must give 3 ,since 3 characters are changed between strings.I was
advised to use levenshtein algorithm ..but then the matrix ops take a
long time for nontrivial strings of say 20000 characters ..Can this

I tried it with a string of 20000 chars and the python-levenshtein package
that comes with ubuntu. It took about one second to calculate the distance:

import functools
import random
import string
import time

from Levenshtein import distance

def make_strings(n, delete, insert, swap, replace,
charset=string.ascii_letters):
def gen_chars():
while True:
yield random.choice(charset)
chars = gen_chars()
a = [next(chars) for i in xrange(n)]
s = "".join(a)
for i in range(delete):
del a[random.randrange(len(a))]
for i in range(insert):
a.insert(random.randrange(len(a)+1), next(chars))
for i in range(swap):
p = random.randrange(len(a)-1)
a[p], a[p+1] = a[p+1], a[p]
for i in range(replace):
a[random.randrange(len(a))] = next(chars)
t = "".join(a)
return s, t

N = 20000
M = 100
ms = functools.partial(make_strings, N, M, M, M, M)

def measure(f, *args):
start = time.time()
try:
return f(*args)
finally:
print time.time() - start

if __name__ == "__main__":
import sys
args = sys.argv[1:]
if args:
N, M = map(int, args)

s, t = make_strings(N, M, M, M, M)
print measure(distance, s, t)

$ python levenshtein_demo.py 10000 1000
0.225363969803
3644
$ python levenshtein_demo.py 20000 1000
1.05217313766
4197
$ python levenshtein_demo.py 30000 1000
2.38736391068
4390
$ python levenshtein_demo.py 40000 1000
4.1686527729
4558

What would be an acceptable time?

Peter
 
H

harryos

What would be an acceptable time?

Thanks for the reply Peter,
I was using python functions I came across the net..not cpython
implementations..Probably my low config machine is also to blame..(I
am no expert at judging algorithm performance either),but is there a
way I can use difflib module to do this job?Even though I went through
the docs I couldn't make out how..

regards
harry
 
P

Peter Otten

harryos said:
but is there a way I can use difflib module to do this job?

I'm afraid I can't help you with that.

You might get more/better answers if you tell us more about the context of
the problem and add some details that may be relevant.

Peter
 
H

harryos

You might get more/better answers if you tell us more about the context of
the problem and add some details that may be relevant.

Peter

I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text..I was wondering if difflib could offer something
in the way of determining the size of delta..
Thanks again for the reply..
harry
 
S

Stefan Behnel

harryos, 09.10.2010 14:24:
I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text.

"Number of characters" sounds like a rather useless measure here. I'd
rather apply an XPath, CSS selector or PyQuery expression to the parsed
page and check if the interesting subtree of it has changed at all or not,
potentially disregarding any structural changes by stripping all tags and
normalising the resulting text to ignore whitespace and case differences.

Stefan
 
H

harryos

"Number of characters" sounds like a rather useless measure here.

What I meant by number of characters was the number of edits happened
between the two versions..Levenshtein distance may be one way for
this..but I was wondering if difflib could do this
regards

harry
 
E

Emmanuel Surleau

What I meant by number of characters was the number of edits happened
between the two versions..Levenshtein distance may be one way for
this..but I was wondering if difflib could do this
regards

As pointed out above, you also need to consider how the structure of the web
page has changed. If you are only looking at plain text, the Levenshtein
distance measures the number of edit operations (insertion, deletion or
substition) necessary to transform string A into string B.

Cheers,

Emm
 
S

Seebs

What I meant by number of characters was the number of edits happened
between the two versions..

Consider two strings:

Hello, world!

Yo, there.

What is the "number of edits happened between the two versions"? It could
be:

* Zero. I just typed them both from scratch, no editing occurred between
them.
* Two. Two words are different.
* Ten or so -- counting changed characters.
* Three. Two words and a punctuation mark are different.

In other words, your problem here is that you haven't actually described
what you want. Slow down. Think! Describe what you want clearly enough
that any other person who reads your description can always come up with
the same answer you would for a given set of inputs.

-s
 
G

geremy condra

Consider two strings:

Hello, world!

Yo, there.

What is the "number of edits happened between the two versions"?  It could
be:

* Zero.  I just typed them both from scratch, no editing occurred between
 them.
* Two.  Two words are different.
* Ten or so -- counting changed characters.
* Three.  Two words and a punctuation mark are different.

In other words, your problem here is that you haven't actually described
what you want.  Slow down.  Think!  Describe what you want clearly enough
that any other person who reads your description can always come up with
the same answer you would for a given set of inputs.

-s

He mentioned L distance earlier, I'm sure he means 'number of edits'
in that context...

Geremy Condra
 
D

Diez B. Roggisch

harryos said:
I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text..I was wondering if difflib could offer something
in the way of determining the size of delta..

If you normalize the data, this might be worth trying.

Make all tags appear on one single line, possibly re-order attributes so
that they are in alphabetical order. Each text child git's also
normalized, by replacing all whitespace with a single space.

Then run difflib over these, and count the number of diffrences.


Diez
 
D

D'Arcy J.M. Cain

In other words, your problem here is that you haven't actually described
what you want. Slow down. Think! Describe what you want clearly enough
that any other person who reads your description can always come up with
the same answer you would for a given set of inputs.

Better yet, write the unit tests for us.
 
D

Dennis Lee Bieber

Consider two strings:

Hello, world!

Yo, there.

What is the "number of edits happened between the two versions"? It could
be:

* Zero. I just typed them both from scratch, no editing occurred between
them.

Well, if those are supposed to be the same "file" at different times
in its life... it requires ONE edit: select all, replace with new text.

As for the original "cow milk" vs "goat milk"... I'd consider that a
One Edit difference too... replacing the first word. Otherwise you get
into things like: replace character 1, replace character 3, insert
character at position 4 (implies shifting rest, unless you want to count
all shifted characters as a change -- which a line-oriented DIFF would).

I'm not likely to care about the operations needed to make a change
so much as the final contents of the change... one word changed <period>

If talking a web page, is the markup to be considered or ignored?

<p><b>This is IT!</b></p>
vs
<h1>This is IT!</h1>

If markup is considered, I'd consider that two "edits"... Change
paragraph to heading-1, and remove bolding. If markup is irrelevant,
then no edits have occurred. I would not consider it four edits (change
<p> to <h1>, delete <b>, delete </b>, change </p> to </h1>) since a
decent HTML editor should allow for selecting the matching tags and
replacing them (or at least, deleting the bounding tags and wrapping the
remains in a different tag).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,783
Latest member
RickeyDort

Latest Threads

Top