Diff of Text

G

GZ

Hi All,

I am looking for an algorithm that can compare to source code files
line by line and find the minimum diff. I have looked at the difflib
included in python. The problem is that it is designed to make the
diff results easier for humans to read, instead of minimize the size
of the output differencial. I would like an algorithm implementation
that gives the absolute minimum difference between the two files.

Can you help me?

Thanks,
gz
 
P

Patrick Maupin

Hi All,

I am looking for an algorithm that can compare to source code files
line by line and find the minimum diff. I have looked at the difflib
included in python. The problem is that it is designed to make the
diff results easier for humans to read, instead of minimize the size
of the output differencial. I would like an algorithm implementation
that gives the absolute minimum difference between the two files.

Can you help me?

Thanks,
gz

There's an "rsync.py" module in pypi -- one would think that would
have to solve that same problem...

Regards,
Pat
 
G

GZ

Hi Pat,

There's an "rsync.py" module in pypi -- one would think that would
have to solve that same problem...

Regards,
Pat

No, rsync does not solve my problem.

I want a library that does unix 'diff' like function, i.e. compare two
strings line by line and output the difference. Python's difflib does
not work perfectly for me, because the resulting differences are
pretty big. I would like an algorithm that generates the smallest
differences.
 
L

Lie Ryan

Hi Pat,



No, rsync does not solve my problem.

I want a library that does unix 'diff' like function, i.e. compare two
strings line by line and output the difference. Python's difflib does
not work perfectly for me, because the resulting differences are
pretty big. I would like an algorithm that generates the smallest
differences.

is n=0 not short enough?

pprint.pprint(list(difflib.context_diff(s, t, n=0)))
 
G

GZ

On06/05/10 07:51, GZ wrote:









is n=0 not short enough?

pprint.pprint(list(difflib.context_diff(s, t, n=0)))

This still does not do what I want it to do. It only displays the diff
results in a different format. I want a different algorithm to
generate a smaller diff -- in other words less differences
 
L

Lie Ryan

This still does not do what I want it to do. It only displays the diff
results in a different format. I want a different algorithm to
generate a smaller diff -- in other words less differences

No, I meant I was confirming that you already have turned off context
lines (i.e. the n=0 part), right?



Also, what's the nature of the changes? You might be able to minimize
difflib's output by using word-based or character-based diff-ing instead
of traditional line-based diff-ing.



diff output is fairly compressable, so you might want to look at zipping
the output.
 
S

Steven D'Aprano

This still does not do what I want it to do. It only displays the diff
results in a different format. I want a different algorithm to generate
a smaller diff -- in other words less differences

Can you give a *short* example, showing the output from Python's diff and
what you would prefer instead?
 
G

GZ

Hi Lie,

No, I meant I was confirming that you already have turned off context
lines (i.e. the n=0 part), right?

Also, what's the nature of the changes? You might be able to minimize
difflib's output by using word-based or character-based diff-ing instead
of traditional line-based diff-ing.

diff output is fairly compressable, so you might want to look at zipping
the output.- Hide quoted text -

- Show quoted text -

Thanks for your response.

The verboseness of the format is not really my problem and I only care
about line by line comparison for now.

Let me think of a better way to express what I mean by a "smaller
diff." After I diff the two strings, I will have something like this:

AAA
- BBB
+ CCC
+ DDD
- EEE

It means the first line does not change, the second line is replaced
by the third line, the forth line is new, and the fifth line is
deleted.

I define the "smallness" of the diff algorithm as "the sum of the
total number of minuses and pluses". In my above example, it is 4 (two
minuses and 2 pluses). Note that no matter what format we use to
represent the diff, this number is the same.

Python's difflib does not really minimize this number. It tries to
make this number small, but also tries to yield matches that “look
right” to people at the cost of increasing this number. (http://
docs.python.org/library/difflib.html).

What I am looking for is an algo that can really minimize this number.
 
D

Dave Angel

GZ said:
Thanks for your response.

The verboseness of the format is not really my problem and I only care
about line by line comparison for now.

Let me think of a better way to express what I mean by a "smaller
diff." After I diff the two strings, I will have something like this:

AAA
- BBB
+ CCC
+ DDD
- EEE

It means the first line does not change, the second line is replaced
by the third line, the forth line is new, and the fifth line is
deleted.

I define the "smallness" of the diff algorithm as "the sum of the
total number of minuses and pluses". In my above example, it is 4 (two
minuses and 2 pluses). Note that no matter what format we use to
represent the diff, this number is the same.

Python's difflib does not really minimize this number. It tries to
make this number small, but also tries to yield matches that “look
right” to people at the cost of increasing this number. (http://
docs.python.org/library/difflib.html).

What I am looking for is an algo that can really minimize this number.
The - lines aren't needed, any more than the context lines are needed,
so that will typically cut your results in half. But perhaps the real
algorithm you're describing is one that permits lines to be moved as
well as inserted and deleted. If you then consider that information to
be free (it's not part of the measure you're specifying), you may do
best with the following algorithm:

Scan the two files looking for lines that appear exactly once in each
file. Consider those lines the first invariants, and everything else a
potential change. Now, starting with each pair (which I call a bridge
between islands), look at the line previous and the line following, and
if either or both are also matched, expand the two islands to include
them. As an island grows to butt up against another island, merge them.

Now, each bridge is a "move" operation, and every line left over in
either file is either a plus or a minus, an insert or a delete. For most
editing transactions, there will be relatively few of these.

For example, if three functions were reordered, from A, B, C, to A, C,
B, there would be no plus or minus lines at all.

DaveA
 
G

GZ

Are you drawing a distinction between:

  * “line FOO is replaced by line BAR”
  * “line FOO is deleted, line BAR is added”

Your wording seems to make that distinction, but I don't see how it's
useful or meaningful in a discussion aboutdiff. Are they not exactly
the same?

--
 \     “Injustice is relatively easy to bear; what stings is justice.” |
  `\                                                 —Henry L. Mencken |
_o__)                                                                  |
Ben Finney

I should distinguish between modifications and additions. In my above
example, one line is modified/replaced, one line is added and one line
is deleted. There are a total of 3 edits. I am looking for an
alternative python library other than difflib that minimizes this
number (edit distance).
 
B

Bryan

GZ said:
I should distinguish between modifications and additions. In my above
example, one line is modified/replaced, one line is added and one line
is deleted. There are a total of 3 edits. I am looking for an
alternative python library other than difflib that minimizes this
number (edit distance).

Since you know the lingo, "edit distance", I expect you've already
Googled up the general results. Depending on what kinds of edits we
count, the problem of finding a minimal diff can be low-order poly-
time, NP-hard, or incomputable.

Gonzalo Navarro's 2001 survey, "A Guided Tour to Approximate String
Matching", is now available on-line for free:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7225&rep=rep1&type=pdf
The paper notes, "issues have been put aside to keep a reasonable
scope," but presents more than most of us ever wanted to know about
the subject.

I don't know of a module that does what GZ asks. There are scores of
line-oriented diff implementations, and there are minimal-edit-
distance diffs, but the combination is rare at best. Problem domains
that call for a true minimal diff tend not to work in units of lines
of text.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top