how to find difference in number of characters

harryos · Oct 9, 2010

hi
I am trying to write a compare method which takes two strings and find
how many characters have changed.

def compare_strings(s1,s2):
pass

text1="goat milk"
text2="cow milk"
print compare_strings(text1,text2)

This must give 3 ,since 3 characters are changed between strings.I was
advised to use levenshtein algorithm ..but then the matrix ops take a
long time for nontrivial strings of say 20000 characters ..Can this
comparison be implemented using difflib module?..I am at a loss as to
how to implement this using difflib .Is there some way I can get the
difference as a number ?
Can somebody help?
thanks
harry

Peter Otten · Oct 9, 2010

harryos said:
I am trying to write a compare method which takes two strings and find
how many characters have changed.

def compare_strings(s1,s2):
pass

text1="goat milk"
text2="cow milk"
print compare_strings(text1,text2)

This must give 3 ,since 3 characters are changed between strings.I was
advised to use levenshtein algorithm ..but then the matrix ops take a
long time for nontrivial strings of say 20000 characters ..Can this

I tried it with a string of 20000 chars and the python-levenshtein package
that comes with ubuntu. It took about one second to calculate the distance:

import functools
import random
import string
import time

from Levenshtein import distance

def make_strings(n, delete, insert, swap, replace,
charset=string.ascii_letters):
def gen_chars():
while True:
yield random.choice(charset)
chars = gen_chars()
a = [next(chars) for i in xrange(n)]
s = "".join(a)
for i in range(delete):
del a[random.randrange(len(a))]
for i in range(insert):
a.insert(random.randrange(len(a)+1), next(chars))
for i in range(swap):
p = random.randrange(len(a)-1)
a[p], a[p+1] = a[p+1], a[p]
for i in range(replace):
a[random.randrange(len(a))] = next(chars)
t = "".join(a)
return s, t

N = 20000
M = 100
ms = functools.partial(make_strings, N, M, M, M, M)

def measure(f, *args):
start = time.time()
try:
return f(*args)
finally:
print time.time() - start

if __name__ == "__main__":
import sys
args = sys.argv[1:]
if args:
N, M = map(int, args)

s, t = make_strings(N, M, M, M, M)
print measure(distance, s, t)

$ python levenshtein_demo.py 10000 1000
0.225363969803
3644
$ python levenshtein_demo.py 20000 1000
1.05217313766
4197
$ python levenshtein_demo.py 30000 1000
2.38736391068
4390
$ python levenshtein_demo.py 40000 1000
4.1686527729
4558

What would be an acceptable time?

Peter

harryos · Oct 9, 2010

What would be an acceptable time?

Thanks for the reply Peter,
I was using python functions I came across the net..not cpython
implementations..Probably my low config machine is also to blame..(I
am no expert at judging algorithm performance either),but is there a
way I can use difflib module to do this job?Even though I went through
the docs I couldn't make out how..

regards
harry

Peter Otten · Oct 9, 2010

harryos said:
but is there a way I can use difflib module to do this job?

I'm afraid I can't help you with that.

You might get more/better answers if you tell us more about the context of
the problem and add some details that may be relevant.

Peter

harryos · Oct 9, 2010

You might get more/better answers if you tell us more about the context of
the problem and add some details that may be relevant.

Peter

I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text..I was wondering if difflib could offer something
in the way of determining the size of delta..
Thanks again for the reply..
harry

Stefan Behnel · Oct 9, 2010

harryos, 09.10.2010 14:24:

I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text.

"Number of characters" sounds like a rather useless measure here. I'd
rather apply an XPath, CSS selector or PyQuery expression to the parsed
page and check if the interesting subtree of it has changed at all or not,
potentially disregarding any structural changes by stripping all tags and
normalising the resulting text to ignore whitespace and case differences.

Stefan

harryos · Oct 9, 2010

"Number of characters" sounds like a rather useless measure here.

What I meant by number of characters was the number of edits happened
between the two versions..Levenshtein distance may be one way for
this..but I was wondering if difflib could do this
regards

harry

Emmanuel Surleau · Oct 9, 2010

What I meant by number of characters was the number of edits happened
between the two versions..Levenshtein distance may be one way for
this..but I was wondering if difflib could do this
regards

As pointed out above, you also need to consider how the structure of the web
page has changed. If you are only looking at plain text, the Levenshtein
distance measures the number of edit operations (insertion, deletion or
substition) necessary to transform string A into string B.

Cheers,

Emm

Seebs · Oct 9, 2010

What I meant by number of characters was the number of edits happened
between the two versions..

Consider two strings:

Hello, world!

Yo, there.

What is the "number of edits happened between the two versions"? It could
be:

* Zero. I just typed them both from scratch, no editing occurred between
them.
* Two. Two words are different.
* Ten or so -- counting changed characters.
* Three. Two words and a punctuation mark are different.

In other words, your problem here is that you haven't actually described
what you want. Slow down. Think! Describe what you want clearly enough
that any other person who reads your description can always come up with
the same answer you would for a given set of inputs.

-s

geremy condra · Oct 9, 2010

Consider two strings:

Hello, world!

Yo, there.

What is the "number of edits happened between the two versions"? It could
be:

* Zero. I just typed them both from scratch, no editing occurred between
them.
* Two. Two words are different.
* Ten or so -- counting changed characters.
* Three. Two words and a punctuation mark are different.

In other words, your problem here is that you haven't actually described
what you want. Slow down. Think! Describe what you want clearly enough
that any other person who reads your description can always come up with
the same answer you would for a given set of inputs.

-s

He mentioned L distance earlier, I'm sure he means 'number of edits'
in that context...

Geremy Condra

Diez B. Roggisch · Oct 9, 2010

harryos said:
I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text..I was wondering if difflib could offer something
in the way of determining the size of delta..

If you normalize the data, this might be worth trying.

Make all tags appear on one single line, possibly re-order attributes so
that they are in alphabetical order. Each text child git's also
normalized, by replacing all whitespace with a single space.

Then run difflib over these, and count the number of diffrences.

Diez

D'Arcy J.M. Cain · Oct 9, 2010

In other words, your problem here is that you haven't actually described
what you want. Slow down. Think! Describe what you want clearly enough
that any other person who reads your description can always come up with
the same answer you would for a given set of inputs.

Better yet, write the unit tests for us.

Dennis Lee Bieber · Oct 9, 2010

Consider two strings:

Hello, world!

Yo, there.

What is the "number of edits happened between the two versions"? It could
be:

* Zero. I just typed them both from scratch, no editing occurred between
them.

Well, if those are supposed to be the same "file" at different times
in its life... it requires ONE edit: select all, replace with new text.

As for the original "cow milk" vs "goat milk"... I'd consider that a
One Edit difference too... replacing the first word. Otherwise you get
into things like: replace character 1, replace character 3, insert
character at position 4 (implies shifting rest, unless you want to count
all shifted characters as a change -- which a line-oriented DIFF would).

I'm not likely to care about the operations needed to make a change
so much as the final contents of the change... one word changed <period>

If talking a web page, is the markup to be considered or ignored?

<p><b>This is IT!</b></p>
vs
<h1>This is IT!</h1>

If markup is considered, I'd consider that two "edits"... Change
paragraph to heading-1, and remove bolding. If markup is irrelevant,
then no edits have occurred. I would not consider it four edits (change
<p> to <h1>, delete <b>, delete </b>, change </p> to </h1>) since a
decent HTML editor should allow for selecting the matching tags and
replacing them (or at least, deleting the bounding tags and wrapping the
remains in a different tag).

How to find number of characters in a unicode string?	8	Sep 18, 2006
How to use ufixed when it involves multiplication a number of times?(VHDL question)	0	Aug 22, 2016
how to count the total number of strings (in the list) used in Python?	0	Apr 16, 2012
need help with a cart I inherited, need to increase number of total characters allowed	3	Oct 22, 2007
Idiot Q: How to find index number of HASH match?	25	Jul 21, 2006
How to get the number of Elements in a Matrix (Multi-Dimisional array)?	3	Apr 9, 2006
JUnit - How to aggregate differences instead of asserting just the first difference ?	0	Sep 14, 2005
Expected Token Density in Random Stream	0	Dec 7, 2011

how to find difference in number of characters

harryos

Peter Otten

harryos

Peter Otten

harryos

Stefan Behnel

harryos

Emmanuel Surleau

Seebs

geremy condra

Diez B. Roggisch

D'Arcy J.M. Cain

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads