from a file to an array of words

D

Dr.Ruud

ngoc schreef:
I want to compare two text(html) files. "diff" command in Linux
compares only by line.
In Perl, I can use "getlines" of FileHandle object and "split" later.
Is there a single function(method) to go from a file to an array of
words. Thanks

Or normalize the files, and then use diff.
 
A

Anno Siegel

ngoc said:
I want to compare two text(html) files. "diff" command in Linux compares
only by line.
In Perl, I can use "getlines" of FileHandle object and "split" later. Is
there a single function(method) to go from a file to an array of words.

There is no function named "getlines" in Perl. Are you using a module?
Which one?

Of course you can split the content of a file into words. Defining
"word" as any string of non-whitespace characters, this does it

my @words = map split, <$file>;

However, having a list of words, or two, doesn't immediately tell you the
difference between the lists. The algorithm used by the diff command
is a complex beast. You'd probably be best off looking for a CPAN
module. A search for "diff" brings up a few likely candidates.

Anno
 
P

Paul Lalli

Anno said:
There is no function named "getlines" in Perl. Are you using a module?
Which one?

Er. Seems pretty clear to me which one he's using....

Paul Lalli
 
A

Anno Siegel

ngoc said:
I want to pick out those words added by my colleagues. So from two
arrays of words, I use Set::Scalar module to compare.

That would be set comparison, which is entirely different from what the
diff command does. Mentioning "diff" in your original posting was a red
herring.

Anno
 
N

ngoc

I want to compare two text(html) files. "diff" command in Linux compares
only by line.
In Perl, I can use "getlines" of FileHandle object and "split" later. Is
there a single function(method) to go from a file to an array of words.
Thanks
 
N

ngoc

However, having a list of words, or two, doesn't immediately tell you the
difference between the lists. The algorithm used by the diff command
is a complex beast. You'd probably be best off looking for a CPAN
module. A search for "diff" brings up a few likely candidates.

I want to pick out those words added by my colleagues. So from two
arrays of words, I use Set::Scalar module to compare.
 
I

Ingo Menger

ngoc schrieb:

My whole meaning is, I want to compare 2 html documents in words (not
lines).

You should define what it means to you to compare "in words". I suggest
that perl (as any other language) compares strings "in characters" or
even "in bits".
"diff" command does not solve my problem, because it is line based.

This sounds like nonsense.
Suppose, you have the following output from diff:

232c232
< the fucking manual
---
the fine manual

Does this not tell you which word was changed? If you don't like it,
write a perl program that replaces every sequence of spaces with a
newline, then you can use diff on the result, since then a word and a
line are the same.

So I try to write a perl program.
The algorithm is: from 2 files -> 2 arrays. Compare 2 arrays. The result
of comparison shows what words are added by my colleagues.

What if the deleted some words? Or changed some?
 
N

ngoc

As noted by Anno, your articles are misleading. Personally, I would
not be as kind and would rate your articles as gibberish.

Your original article metions html. Seems logical you would have a copy
of the original html page, and a copy of the current modified page in order
to compare the two. It would seem logical you employ a device to keep
track of modifications on a per modification basis. Are you doing this?
No. But cvs does not help in this case. we have only 2 versions.
Are modifications appended to your original page? If so, you do not
need to compare; you only need to "tail" for changes.
Not appended, but adding words in html documents
The term "words" used by you indicates you have developed two arrays
which contain one word per element, not a sentence per element. One
array contains a list of original words, and the other a list of words after
modification over a period of time. Is this so? Yes

You do not indicate if you remove html tags which may contain words which
match your original words. You also do not indicate if others are adding new
html tags which contain different markup words or the same markup words
as original. Are those to be considered?

None of your articles indicate if new words added could be duplicates of
the original words. Does this happen?

Other parameters of the nature listed, are also missing.

Bottom line is your articles are gibberish.

Work towards writing articles which are clear, concise and coherent.
My whole meaning is, I want to compare 2 html documents in words (not
lines).
"diff" command does not solve my problem, because it is line based.
So I try to write a perl program.
The algorithm is: from 2 files -> 2 arrays. Compare 2 arrays. The result
of comparison shows what words are added by my colleagues.

I posted to forum so that some smart guys out there, teach me the
shortest way from a file to an array of words.

My job is layout webpage. Because my colleagues do not know html. I also
want to know what words are added in, and in which parts. I have
responsible for content too (in addition to layout of webpage).
 
U

usenet

ngoc said:
My whole meaning is, I want to compare 2 html documents in words (not lines).

I posted to forum so that some smart guys out there, teach me the
shortest way from a file to an array of words.

Well, that's easy (push @array, split / +/, $_ while <FH>;). But I
don't think that's what you really want to know.

Your definition of "word" is unclear in the broader context of your
requirements. Do you mean printable characters separated by whitespace
(which is what I assumed in my code above)? Think about it a bit.
Consider this bit of HTML:

< a href = 'www.google.com' >
Google
< /a >
<a href='www.yahoo.com'>Yahoo</a>

The link to Google is ten "words", whereas the link to Yahoo is only
two "words." Yet they both represent the same sort of HTML data
structure with the same level of "complexity."

Suppose your original file looks like the example above, but some
dimwit changes it to look like this:

<a href='www.google.com' >Google</a>
<a href= "askjeeves.com" > The best search engine ever!!!!!!
</a>
<a href =
'www.yahoo.com' >
Yahoo < /a >

Exactly WHAT do you want your comparison script to print? (notice that
the Google link is now two "words" instead of ten, even though it is
functionally identical, and, of course, a number of "words" are added
regarding AskJeeves, and the Yahoo link is now several "words" even
though it is functionally identical).

What would your script print in this case???
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,176
Messages
2,570,947
Members
47,501
Latest member
Ledmyplace

Latest Threads

Top