DiffLib Question

whitewave · May 2, 2007

Hi Guys,
I'm a bit confused in difflib. In most cases, the differences
found using difflib works well but when I have come across the
following set of text:
.... problem even for the simple forms of the differential equation and
the nonlocal conditions. Due to these facts, some serious difficulties
arise in the application of the classical methods to such a
problem.'''.... adjoint problem even for the simple forms of the differential
equation and the nonlocal conditions. Due
.... to these facts, some serious difficulties arise in the application
of the classical methods to such a
.... problem. '''

Using this line of code:

a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:

Click to expand...

Click to expand...

.... dif.append(i)
.... s = ''.join(dif)

I get the following output:

' I n a d d i t i o n , t h e c o n s i
d e r e d p r o b l e m d o e s n o t
h a v e a m e a n i n g f u l t r a d i
t i o n a l t y p e o f- + \n a d j o i n t+
+ p+ r+ o+ b+ l+ e+ m+ + e+ v+ e+ n+ + f+ o+ r+ + t+ h+ e+ + s+ i+
m+ p+ l+ e+ + f+ o+ r+ m+ s+ + o+ f+ + t+ h+ e+ + d+ i+ f+ f+ e+ r
+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+ i+ o+ n+ + a+ n+ d+ + t+ h+ e
+ + n+ o+ n+ l+ o+ c+ a+ l+ + c+ o+ n+ d+ i+ t+ i+ o+ n+ s+ .+ + D+
u+ e \n+ t+ o+ + t+ h+ e+ s+ e+ + f+ a+ c+ t+ s+ ,+ + s+ o+ m+ e+
+ s+ e+ r+ i+ o+ u+ s+ + d+ i+ f+ f+ i+ c+ u+ l+ t+ i+ e+ s+ + a+ r+
i+ s+ e+ + i+ n+ + t+ h+ e+ + a+ p+ p+ l+ i+ c+ a+ t+ i+ o+ n+ + o
+ f+ + t+ h+ e+ + c+ l+ a+ s+ s+ i+ c+ a+ l+ + m+ e+ t+ h+ o+ d+ s
+ + t+ o+ + s+ u+ c+ h+ + a+ \n p r o b l e m- - e- v- e-
n- - f- o- r- - t- h- e- - s- i- m- p- l- e- - f- o- r- m- s- -
o- f- - t- h- e- - d- i- f- f- e- r- e- n- t- i- a- l- - e- q- u-
a- t- i- o- n- - a- n- d- - t- h- e- - n- o- n- l- o- c- a- l- -
c- o- n- d- i- t- i- o- n- s . - D- u- e- - t- o- - t- h- e- s-
e- - f- a- c- t- s- ,- - s- o- m- e- - s- e- r- i- o- u- s- - d-
i- f- f- i- c- u- l- t- i- e- s- - a- r- i- s- e- - i- n- - t- h-
e- - a- p- p- l- i- c- a- t- i- o- n- - o- f- - t- h- e- - c- l-
a- s- s- i- c- a- l- - m- e- t- h- o- d- s- - t- o- - s- u- c- h-
- a- - p- r- o- b- l- e- m- .'

How come the rest of the text after the "adjoint" word is marked as an
additional text (while others is deleted) while in fact those text are
contained in both d1 and d2?The only difference is that it has a
newline. I'm I missing something? Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen

Michele Simionato · May 2, 2007

Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen
HTH:

Help on method __init__ in module difflib:

__init__(self, linejunk=None, charjunk=None) unbound difflib.Differ
method
Construct a text differencer, with optional filters.

The two optional keyword parameters are for filter functions:

- `linejunk`: A function that should accept a single string
argument,
and return true iff the string is junk. The module-level
function
`IS_LINE_JUNK` may be used to filter out lines without visible
characters, except for at most one splat ('#'). It is
recommended
to leave linejunk None; as of Python 2.3, the underlying
SequenceMatcher class has grown an adaptive notion of "noise"
lines
that's better than any static definition the author has ever
been
able to craft.

- `charjunk`: A function that should accept a string of length 1.
The
module-level function `IS_CHARACTER_JUNK` may be used to filter
out
whitespace characters (a blank or tab; **note**: bad idea to
include
newline in this!). Use of IS_CHARACTER_JUNK is recommended.

Michele Simionato

whitewave · May 2, 2007

Hi,
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?

a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:

Click to expand...

Click to expand...

.... dif.append(i)
.... s = ''.join(dif)

Thanks
Jen

Gabriel Genellina · May 2, 2007

En Wed said:
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?

Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a = difflib.Differ(linejunk=ignore_ws_nl).compare(d1,d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

I hope this is what you were looking for.

whitewave · May 4, 2007

Usually, Differ receives two sequences of lines, being each line a

sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a =difflib.Differ(linejunk=ignore_ws_nl).compare(d1,d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Thanks.
Jen

Gabriel Genellina · May 6, 2007

En Fri said:
Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Differ objects do a two-level diff; depending on what kind of differences
you are interested in, you feed it with different things.
If the "line" concept is important to you (that is, you want to see which
"lines" were added, removed or modified), then feed the Differ with a
sequence of lines (file.readlines() would be fine).
This way, if someone inserts a few words inside a paragraph and the
remaining lines have to be reflushed, you'll see many changes from words
that were at end of lines now moving to the start of next line.
If you are more concerned about "paragraphs" and words, feed the Differ
with a sequence of "paragraphs". Maybe your editor can handle it; assuming
a paragraph ends with two linefeeds, you can get a list of paragraphs in
Python using file.read().split("\n\n").
A third alternative would be to consider the text as absolutely plain, and
just feed Differ with file.read(), as menctioned in an earlier post.

whitewave · May 6, 2007

Hello,

I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code

diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)

Click to expand...

Click to expand...

With the following strings for comparison:
.... problems for ordinary differential equations with sufficiently
smooth coefficients have been
.... investigated in detail by other authors
\cite{CR1,CR2,CR3,CR4,CR5}.'''.... differential equations with sufficiently smooth coefficients have
been investigated in detail by many
.... authors \cite{CR1,CR2,CR3,CR4,CR5}.'''

I get this result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e+ + p+ r+ o+ b+ l+ e+ m+ s+ + f+ o+ r+ + o
+ r+ d+ i+ n+ a+ r+ y
+ d+ i+ f+ f- p- r- o- b- l e+ r+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+
i+ o+ n+ s+ + w+ i+ t+ h+ + s+ u+ f+ f+ i+ c+ i+ e+ n+ t+ l+ y+ +
s m- s- - f o- r- o- r- d- i- n- a- r- y- - d- i- f- f- e- r-
e- n t- i- a- l- - e- q- u- a- t- i- o- n- s- - w- i- t h - s-
u- f- f- i c- i+ o e+ f+ f+ i+ c+ i+ e n t- l- y- s+ + h+ a+ v
+ e+ + b+ e+ e+ n+ + i+ n+ v+ e+ s+ t+ i+ g+ a+ t+ e+ d+ + i+ n+ +
d+ e+ t+ a+ i+ l+ + b+ y+ m- o- o- t- h- - c- o- e- f- f- i- c-
i- e- n- t- s- - h a- v- e- - b- e- e n+ y
- i- n- v- e- s- t- i- g- a- t- e- d- - i- n- - d- e- t- a- i- l- -
b- y- - o- t- h- e- r- a u t h o r s \ c i t e
{ C R 1 , C R 2 , C R 3 , C R 4 , C R 5 } .

Whereas, this is my expected result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s
w i t h s u f f i c i e n t l y s m o o t
h c o e f f i c i e n t s h a v e b e e
n-
+ i n v e s t i g a t e d i n d e t a i
l b y + m- o- t- h- e- r- a+ n+ y+
+ a u t h o r s \ c i t e { C R 1 , C R 2 , C
R 3 , C R 4 , C R 5 } .

Thanks,
Jen

Gabriel Genellina · May 7, 2007

En Mon said:
I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code

diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)

Click to expand...

Click to expand...

So you are concerned with character differences, ignoring higher order
structures. Use a linejunk filter function to the Differ constructor -as
shown in my post last Wednesday- to ignore "\n" characters when matching.
That is:

def ignore_eol(c): return c in "\r\n"
fileDiff = difflib.Differ(linejunk=ignore_eol).compare(f1, f2)
print ''.join(fileDiff)

you get:

- T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o
n s a n d t h e G r e e n ' s f u n c t i
o n
s o f l i n e a r b o u n d a r y v a l
u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s w
i t
h s u f f i c i e n t l y s m o o t h c o e
f f
i c i e n t s h a v e b e e n-
+ i n v e s t i g a t e d i n d e t a i l
b y
+ m+ a+ n+ y+
- o- t- h- e- r- a u t h o r s \ c i t e { C R 1 ,
C R
2 , C R 3 , C R 4 , C R 5 } .

Blue J Ciphertext Program	2	Nov 22, 2023
My Status, Ciphertext	2	Nov 27, 2023
Can't solve problems! please Help	0	Sep 26, 2022
ChatGPT will make us Job(Home)less	3	Jan 22, 2023
How to play corresponding sound?	2	Jun 10, 2023
Custom alphabetical sort	8	Dec 24, 2012
Python code problem	2	Apr 23, 2023
Dont work, it´s something whit the loops?	1	Jun 30, 2021

DiffLib Question

whitewave

Michele Simionato

whitewave

Gabriel Genellina

whitewave

Gabriel Genellina

whitewave

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads