from a file to an array of words

Dr.Ruud · Nov 28, 2005

ngoc schreef:

I want to compare two text(html) files. "diff" command in Linux
compares only by line.
In Perl, I can use "getlines" of FileHandle object and "split" later.
Is there a single function(method) to go from a file to an array of
words. Thanks

Or normalize the files, and then use diff.

Anno Siegel · Nov 28, 2005

ngoc said:
I want to compare two text(html) files. "diff" command in Linux compares
only by line.
In Perl, I can use "getlines" of FileHandle object and "split" later. Is
there a single function(method) to go from a file to an array of words.

There is no function named "getlines" in Perl. Are you using a module?
Which one?

Of course you can split the content of a file into words. Defining
"word" as any string of non-whitespace characters, this does it

my @words = map split, <$file>;

However, having a list of words, or two, doesn't immediately tell you the
difference between the lists. The algorithm used by the diff command
is a complex beast. You'd probably be best off looking for a CPAN
module. A search for "diff" brings up a few likely candidates.

Anno

Paul Lalli · Nov 28, 2005

Anno said:
There is no function named "getlines" in Perl. Are you using a module?
Which one?

Er. Seems pretty clear to me which one he's using....

Paul Lalli

Anno Siegel · Nov 28, 2005

ngoc said:
I want to pick out those words added by my colleagues. So from two
arrays of words, I use Set::Scalar module to compare.

That would be set comparison, which is entirely different from what the
diff command does. Mentioning "diff" in your original posting was a red
herring.

Anno

ngoc · Nov 28, 2005

I want to compare two text(html) files. "diff" command in Linux compares
only by line.
In Perl, I can use "getlines" of FileHandle object and "split" later. Is
there a single function(method) to go from a file to an array of words.
Thanks

ngoc · Nov 28, 2005

There is no function named "getlines" in Perl. Are you using a module?
Which one?

http://search.cpan.org/~nwclark/perl-5.8.7/lib/FileHandle.pm

ngoc · Nov 28, 2005

However, having a list of words, or two, doesn't immediately tell you the
difference between the lists. The algorithm used by the diff command
is a complex beast. You'd probably be best off looking for a CPAN
module. A search for "diff" brings up a few likely candidates.

I want to pick out those words added by my colleagues. So from two
arrays of words, I use Set::Scalar module to compare.

Ingo Menger · Nov 28, 2005

ngoc schrieb:

My whole meaning is, I want to compare 2 html documents in words (not
lines).

You should define what it means to you to compare "in words". I suggest
that perl (as any other language) compares strings "in characters" or
even "in bits".

"diff" command does not solve my problem, because it is line based.

This sounds like nonsense.
Suppose, you have the following output from diff:

232c232
< the fucking manual
---

the fine manual

Does this not tell you which word was changed? If you don't like it,
write a perl program that replaces every sequence of spaces with a
newline, then you can use diff on the result, since then a word and a
line are the same.

So I try to write a perl program.
The algorithm is: from 2 files -> 2 arrays. Compare 2 arrays. The result
of comparison shows what words are added by my colleagues.

What if the deleted some words? Or changed some?

ngoc · Nov 28, 2005

As noted by Anno, your articles are misleading. Personally, I would
not be as kind and would rate your articles as gibberish.

Your original article metions html. Seems logical you would have a copy
of the original html page, and a copy of the current modified page in order
to compare the two. It would seem logical you employ a device to keep
track of modifications on a per modification basis. Are you doing this?

No. But cvs does not help in this case. we have only 2 versions.

Are modifications appended to your original page? If so, you do not
need to compare; you only need to "tail" for changes.

Not appended, but adding words in html documents

The term "words" used by you indicates you have developed two arrays
which contain one word per element, not a sentence per element. One
array contains a list of original words, and the other a list of words after
modification over a period of time. Is this so? Yes

You do not indicate if you remove html tags which may contain words which
match your original words. You also do not indicate if others are adding new
html tags which contain different markup words or the same markup words
as original. Are those to be considered?

None of your articles indicate if new words added could be duplicates of
the original words. Does this happen?

Other parameters of the nature listed, are also missing.

Bottom line is your articles are gibberish.

Work towards writing articles which are clear, concise and coherent.

My whole meaning is, I want to compare 2 html documents in words (not
lines).
"diff" command does not solve my problem, because it is line based.
So I try to write a perl program.
The algorithm is: from 2 files -> 2 arrays. Compare 2 arrays. The result
of comparison shows what words are added by my colleagues.

I posted to forum so that some smart guys out there, teach me the
shortest way from a file to an array of words.

My job is layout webpage. Because my colleagues do not know html. I also
want to know what words are added in, and in which parts. I have
responsible for content too (in addition to layout of webpage).

usenet · Nov 29, 2005

ngoc said:
My whole meaning is, I want to compare 2 html documents in words (not lines).

I posted to forum so that some smart guys out there, teach me the
shortest way from a file to an array of words.

Well, that's easy (push @array, split / +/, $_ while <FH>

. But I
don't think that's what you really want to know.

Your definition of "word" is unclear in the broader context of your
requirements. Do you mean printable characters separated by whitespace
(which is what I assumed in my code above)? Think about it a bit.
Consider this bit of HTML:

< a href = 'www.google.com' >
Google
< /a >
<a href='www.yahoo.com'>Yahoo</a>

The link to Google is ten "words", whereas the link to Yahoo is only
two "words." Yet they both represent the same sort of HTML data
structure with the same level of "complexity."

Suppose your original file looks like the example above, but some
dimwit changes it to look like this:

<a href='www.google.com' >Google</a>
<a href= "askjeeves.com" > The best search engine ever!!!!!!
</a>
<a href =
'www.yahoo.com' >
Yahoo < /a >

Exactly WHAT do you want your comparison script to print? (notice that
the Google link is now two "words" instead of ten, even though it is
functionally identical, and, of course, a number of "words" are added
regarding AskJeeves, and the Yahoo link is now several "words" even
though it is functionally identical).

What would your script print in this case???

robic0 · Nov 29, 2005

ngoc schreef:

Or normalize the files, and then use diff.

What do you mean by "normalize"?

Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Converting an Array to a String in JavaScript	7	Sep 22, 2023
How to fetch and console.log all items from an associative array	3	Apr 17, 2024
Hello guys, how do I do arithmetics with a certain index from an array ? JavaScript	3	Dec 7, 2022
Sort from an array of objects into stacks without using java.util	0	Jul 6, 2021
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
How to add a className to an object in array in js?	1	Dec 7, 2021
Turning lines of a file into array?	10	May 4, 2013

from a file to an array of words

Dr.Ruud

Anno Siegel

Paul Lalli

Anno Siegel

ngoc

ngoc

ngoc

Ingo Menger

ngoc

usenet

robic0

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads