N
Nick Matzke
Hi all,
So I have an interesting challenge. I want to compare two book
chapters, which I have in plain text format, and find out (a) percentage
similarity and (b) what has changed.
Some features make this problem different than what seems to be the
standard text-matching problem solvable with e.g. difflib. Here is what
I mean:
* there is no guarantee that single lines from each file will be
directly comparable -- e.g., if a few words are inserted into a
sentence, then a chunk of the sentence will be moved to the next line,
then a chunk of that line moved to the next, etc.
* Also, there are cases where paragraphs have been moved around,
sections re-ordered, etc. So it can't just be a "linear" match.
I imagine this kind of thing can't be all that hard in the grand scheme
of things, but I couldn't find an easily applicable solution readily
available. I have advanced beginner python skills but am not quite
where I could do this kind of thing from scratch without some guidance
about the likely functions, libraries etc. to use.
PS: I am going to have to do this for multiple book chapters so various
software packages, e.g. for windows, are not really usable.
Any help is much appreciated!!
Cheers,
Nick
--
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: (e-mail address removed)
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================
So I have an interesting challenge. I want to compare two book
chapters, which I have in plain text format, and find out (a) percentage
similarity and (b) what has changed.
Some features make this problem different than what seems to be the
standard text-matching problem solvable with e.g. difflib. Here is what
I mean:
* there is no guarantee that single lines from each file will be
directly comparable -- e.g., if a few words are inserted into a
sentence, then a chunk of the sentence will be moved to the next line,
then a chunk of that line moved to the next, etc.
* Also, there are cases where paragraphs have been moved around,
sections re-ordered, etc. So it can't just be a "linear" match.
I imagine this kind of thing can't be all that hard in the grand scheme
of things, but I couldn't find an easily applicable solution readily
available. I have advanced beginner python skills but am not quite
where I could do this kind of thing from scratch without some guidance
about the likely functions, libraries etc. to use.
PS: I am going to have to do this for multiple book chapters so various
software packages, e.g. for windows, are not really usable.
Any help is much appreciated!!
Cheers,
Nick
--
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley
Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page:
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: (e-mail address removed)
Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people
thought the earth was spherical, they were wrong. But if you think that
thinking the earth is spherical is just as wrong as thinking the earth
is flat, then your view is wronger than both of them put together."
Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================