GZ said:
Thanks for your response.
The verboseness of the format is not really my problem and I only care
about line by line comparison for now.
Let me think of a better way to express what I mean by a "smaller
diff." After I diff the two strings, I will have something like this:
AAA
- BBB
+ CCC
+ DDD
- EEE
It means the first line does not change, the second line is replaced
by the third line, the forth line is new, and the fifth line is
deleted.
I define the "smallness" of the diff algorithm as "the sum of the
total number of minuses and pluses". In my above example, it is 4 (two
minuses and 2 pluses). Note that no matter what format we use to
represent the diff, this number is the same.
Python's difflib does not really minimize this number. It tries to
make this number small, but also tries to yield matches that “look
right” to people at the cost of increasing this number. (http://
docs.python.org/library/difflib.html).
What I am looking for is an algo that can really minimize this number.
The - lines aren't needed, any more than the context lines are needed,
so that will typically cut your results in half. But perhaps the real
algorithm you're describing is one that permits lines to be moved as
well as inserted and deleted. If you then consider that information to
be free (it's not part of the measure you're specifying), you may do
best with the following algorithm:
Scan the two files looking for lines that appear exactly once in each
file. Consider those lines the first invariants, and everything else a
potential change. Now, starting with each pair (which I call a bridge
between islands), look at the line previous and the line following, and
if either or both are also matched, expand the two islands to include
them. As an island grows to butt up against another island, merge them.
Now, each bridge is a "move" operation, and every line left over in
either file is either a plus or a minus, an insert or a delete. For most
editing transactions, there will be relatively few of these.
For example, if three functions were reordered, from A, B, C, to A, C,
B, there would be no plus or minus lines at all.
DaveA