ngoc said:
My whole meaning is, I want to compare 2 html documents in words (not lines).
I posted to forum so that some smart guys out there, teach me the
shortest way from a file to an array of words.
Well, that's easy (push @array, split / +/, $_ while <FH>
. But I
don't think that's what you really want to know.
Your definition of "word" is unclear in the broader context of your
requirements. Do you mean printable characters separated by whitespace
(which is what I assumed in my code above)? Think about it a bit.
Consider this bit of HTML:
< a href = '
www.google.com' >
Google
< /a >
<a href='
www.yahoo.com'>Yahoo</a>
The link to Google is ten "words", whereas the link to Yahoo is only
two "words." Yet they both represent the same sort of HTML data
structure with the same level of "complexity."
Suppose your original file looks like the example above, but some
dimwit changes it to look like this:
<a href='
www.google.com' >Google</a>
<a href= "askjeeves.com" > The best search engine ever!!!!!!
</a>
<a href =
'
www.yahoo.com' >
Yahoo < /a >
Exactly WHAT do you want your comparison script to print? (notice that
the Google link is now two "words" instead of ten, even though it is
functionally identical, and, of course, a number of "words" are added
regarding AskJeeves, and the Yahoo link is now several "words" even
though it is functionally identical).
What would your script print in this case???