DiffLib Question

W

whitewave

Hi Guys,
I'm a bit confused in difflib. In most cases, the differences
found using difflib works well but when I have come across the
following set of text:
.... problem even for the simple forms of the differential equation and
the nonlocal conditions. Due to these facts, some serious difficulties
arise in the application of the classical methods to such a
problem.'''.... adjoint problem even for the simple forms of the differential
equation and the nonlocal conditions. Due
.... to these facts, some serious difficulties arise in the application
of the classical methods to such a
.... problem. '''

Using this line of code:
a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:
.... dif.append(i)
.... s = ''.join(dif)

I get the following output:

' I n a d d i t i o n , t h e c o n s i
d e r e d p r o b l e m d o e s n o t
h a v e a m e a n i n g f u l t r a d i
t i o n a l t y p e o f- + \n a d j o i n t+
+ p+ r+ o+ b+ l+ e+ m+ + e+ v+ e+ n+ + f+ o+ r+ + t+ h+ e+ + s+ i+
m+ p+ l+ e+ + f+ o+ r+ m+ s+ + o+ f+ + t+ h+ e+ + d+ i+ f+ f+ e+ r
+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+ i+ o+ n+ + a+ n+ d+ + t+ h+ e
+ + n+ o+ n+ l+ o+ c+ a+ l+ + c+ o+ n+ d+ i+ t+ i+ o+ n+ s+ .+ + D+
u+ e \n+ t+ o+ + t+ h+ e+ s+ e+ + f+ a+ c+ t+ s+ ,+ + s+ o+ m+ e+
+ s+ e+ r+ i+ o+ u+ s+ + d+ i+ f+ f+ i+ c+ u+ l+ t+ i+ e+ s+ + a+ r+
i+ s+ e+ + i+ n+ + t+ h+ e+ + a+ p+ p+ l+ i+ c+ a+ t+ i+ o+ n+ + o
+ f+ + t+ h+ e+ + c+ l+ a+ s+ s+ i+ c+ a+ l+ + m+ e+ t+ h+ o+ d+ s
+ + t+ o+ + s+ u+ c+ h+ + a+ \n p r o b l e m- - e- v- e-
n- - f- o- r- - t- h- e- - s- i- m- p- l- e- - f- o- r- m- s- -
o- f- - t- h- e- - d- i- f- f- e- r- e- n- t- i- a- l- - e- q- u-
a- t- i- o- n- - a- n- d- - t- h- e- - n- o- n- l- o- c- a- l- -
c- o- n- d- i- t- i- o- n- s . - D- u- e- - t- o- - t- h- e- s-
e- - f- a- c- t- s- ,- - s- o- m- e- - s- e- r- i- o- u- s- - d-
i- f- f- i- c- u- l- t- i- e- s- - a- r- i- s- e- - i- n- - t- h-
e- - a- p- p- l- i- c- a- t- i- o- n- - o- f- - t- h- e- - c- l-
a- s- s- i- c- a- l- - m- e- t- h- o- d- s- - t- o- - s- u- c- h-
- a- - p- r- o- b- l- e- m- .'

How come the rest of the text after the "adjoint" word is marked as an
additional text (while others is deleted) while in fact those text are
contained in both d1 and d2?The only difference is that it has a
newline. I'm I missing something? Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen
 
M

Michele Simionato

Is there a way for me to disregard
the newlines and spaces?

Python 2.3
WINXP

Thanks.
Jen
HTH:
Help on method __init__ in module difflib:

__init__(self, linejunk=None, charjunk=None) unbound difflib.Differ
method
Construct a text differencer, with optional filters.

The two optional keyword parameters are for filter functions:

- `linejunk`: A function that should accept a single string
argument,
and return true iff the string is junk. The module-level
function
`IS_LINE_JUNK` may be used to filter out lines without visible
characters, except for at most one splat ('#'). It is
recommended
to leave linejunk None; as of Python 2.3, the underlying
SequenceMatcher class has grown an adaptive notion of "noise"
lines
that's better than any static definition the author has ever
been
able to craft.

- `charjunk`: A function that should accept a string of length 1.
The
module-level function `IS_CHARACTER_JUNK` may be used to filter
out
whitespace characters (a blank or tab; **note**: bad idea to
include
newline in this!). Use of IS_CHARACTER_JUNK is recommended.


Michele Simionato
 
W

whitewave

Hi,
Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?
a = difflib.Differ().compare(d1,d2)
dif =[]
for i in a:

.... dif.append(i)
.... s = ''.join(dif)

Thanks
Jen
 
G

Gabriel Genellina

Thank you for your reply. But I don't fully understand what the
charjunk and linejunk is all about. I'm a bit newbie in python using
the DiffLib. I'm I using the right code here? I will I implement the
linejunk and charjunk using the following code?

Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a = difflib.Differ(linejunk=ignore_ws_nl).compare(d1,d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

I hope this is what you were looking for.
 
W

whitewave

Usually, Differ receives two sequences of lines, being each line a
sequence of characters (strings). It uses a SequenceMatcher to compare
lines; the linejunk argument is used to ignore certain lines. For each
pair of similar lines, it uses another SequenceMatcher to compare
characters inside lines; the charjunk is used to ignore characters.
As you are feeding Differ with a single string (not a list of text lines),
the "lines" it sees are just characters. To ignore whitespace and
newlines, in this case one should use the linejunk argument:

def ignore_ws_nl(c):
return c in " \t\n\r"

a =difflib.Differ(linejunk=ignore_ws_nl).compare(d1,d2)
dif = list(a)
print ''.join(dif)

I n a d d i t i o n , t h e c o n s i d e
r e
d p r o b l e m d o e s n o t h a v e
a m
e a n i n g f u l t r a d i t i o n a l t y
p e
o f- +
a d j o i n t-
+ p r o b l e m e v e n f o r t h e s i
m p
l e f o r m s o f t h e d i f f e r e n t
i a
l e q u a t i o n a n d t h e n o n l o
c a l
c o n d i t i o n s . D u e- +
t o t h e s e f a c t s , s o m e s e r
i o
u s d i f f i c u l t i e s a r i s e i n
t h
e a p p l i c a t i o n o f t h e c l a
s s i
c a l m e t h o d s t o s u c h a- +
p r o b l e m .+

Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Thanks.
Jen
 
G

Gabriel Genellina

Thanks! It works fine but I was wondering why the result isn't
consistent. I am comparing two huge documents with several paragraphs
in it. Some parts in the paragraph returns the diff perfectly but
others aren't. I am confused.

Differ objects do a two-level diff; depending on what kind of differences
you are interested in, you feed it with different things.
If the "line" concept is important to you (that is, you want to see which
"lines" were added, removed or modified), then feed the Differ with a
sequence of lines (file.readlines() would be fine).
This way, if someone inserts a few words inside a paragraph and the
remaining lines have to be reflushed, you'll see many changes from words
that were at end of lines now moving to the start of next line.
If you are more concerned about "paragraphs" and words, feed the Differ
with a sequence of "paragraphs". Maybe your editor can handle it; assuming
a paragraph ends with two linefeeds, you can get a list of paragraphs in
Python using file.read().split("\n\n").
A third alternative would be to consider the text as absolutely plain, and
just feed Differ with file.read(), as menctioned in an earlier post.
 
W

whitewave

Hello,

I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code
diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)

With the following strings for comparison:
.... problems for ordinary differential equations with sufficiently
smooth coefficients have been
.... investigated in detail by other authors
\cite{CR1,CR2,CR3,CR4,CR5}.'''.... differential equations with sufficiently smooth coefficients have
been investigated in detail by many
.... authors \cite{CR1,CR2,CR3,CR4,CR5}.'''

I get this result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e+ + p+ r+ o+ b+ l+ e+ m+ s+ + f+ o+ r+ + o
+ r+ d+ i+ n+ a+ r+ y
+ d+ i+ f+ f- p- r- o- b- l e+ r+ e+ n+ t+ i+ a+ l+ + e+ q+ u+ a+ t+
i+ o+ n+ s+ + w+ i+ t+ h+ + s+ u+ f+ f+ i+ c+ i+ e+ n+ t+ l+ y+ +
s m- s- - f o- r- o- r- d- i- n- a- r- y- - d- i- f- f- e- r-
e- n t- i- a- l- - e- q- u- a- t- i- o- n- s- - w- i- t h - s-
u- f- f- i c- i+ o e+ f+ f+ i+ c+ i+ e n t- l- y- s+ + h+ a+ v
+ e+ + b+ e+ e+ n+ + i+ n+ v+ e+ s+ t+ i+ g+ a+ t+ e+ d+ + i+ n+ +
d+ e+ t+ a+ i+ l+ + b+ y+ m- o- o- t- h- - c- o- e- f- f- i- c-
i- e- n- t- s- - h a- v- e- - b- e- e n+ y
- i- n- v- e- s- t- i- g- a- t- e- d- - i- n- - d- e- t- a- i- l- -
b- y- - o- t- h- e- r- a u t h o r s \ c i t e
{ C R 1 , C R 2 , C R 3 , C R 4 , C R 5 } .

Whereas, this is my expected result:

T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o n s a n d t h e G r e e n ' s f u
n c t i o n s o f l i n e a r b o u n d
a r y v a l u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s
w i t h s u f f i c i e n t l y s m o o t
h c o e f f i c i e n t s h a v e b e e
n-
+ i n v e s t i g a t e d i n d e t a i
l b y + m- o- t- h- e- r- a+ n+ y+
+ a u t h o r s \ c i t e { C R 1 , C R 2 , C
R 3 , C R 4 , C R 5 } .


Thanks,
Jen
 
G

Gabriel Genellina

I am currently doing the third option. Doing file.read() to both file
to be compared then feed the result to the compare function.

Let me give you a brief sample of what I want to achieve.

Using this code
diffline=[]
fileDiff = difflib.Differ().compare(f1, f2)
diffline = list(fileDiff)
finStr = ''.join(diffline)

So you are concerned with character differences, ignoring higher order
structures. Use a linejunk filter function to the Differ constructor -as
shown in my post last Wednesday- to ignore "\n" characters when matching.
That is:

def ignore_eol(c): return c in "\r\n"
fileDiff = difflib.Differ(linejunk=ignore_eol).compare(f1, f2)
print ''.join(fileDiff)

you get:

- T h e s o l v a b+ i l- e+ i+ t+ y c o n d i t
i o
n s a n d t h e G r e e n ' s f u n c t i
o n
s o f l i n e a r b o u n d a r y v a l
u e-
+ p r o b l e m s f o r o r d i n a r y- +
d i f f e r e n t i a l e q u a t i o n s w
i t
h s u f f i c i e n t l y s m o o t h c o e
f f
i c i e n t s h a v e b e e n-
+ i n v e s t i g a t e d i n d e t a i l
b y
+ m+ a+ n+ y+
- o- t- h- e- r- a u t h o r s \ c i t e { C R 1 ,
C R
2 , C R 3 , C R 4 , C R 5 } .
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,838
Latest member
KandiceChi

Latest Threads

Top