loop in loop takes too much time

F

FMAS

I am comparing 2 lists of words and want to output the words in list 1
which are not available in list 2. Both lists have a different number
of entries.

Basically the script below works, but if the lists are large it takes
ages. I tested it on 2 lists of approx 15,000 entries each and after
25 min I had processed only 316 entries!

Here the script:


open(WORDLIST1,'C:\temp\a.txt') || die("cannot open file1!\n");
@list1 = <WORDLIST1>;
open(WORDLIST2,'C:\temp\b.txt') || die("cannot open file2!\n");
@list2 = <WORDLIST2>;

# loop in loop

foreach $list1 (@list1) {
foreach $list2 (@list2) {
chomp $list1;
chomp $list2;

last if ($list1 =~ m/$list2/i) ; # if match found look for next $list1
$lastentry = $list2[$#list2]; # in order to print entry only once when
no match found
if ($list2 =~ m/$lastentry/i) {
print "$list1\n";
}}}

Any suggestions?

I have heard that it would be much faster to build a search tree, but
I don't know how to do it (probably too complicated for a newbee like
me).

Thanks for your help!

Francois
 
I

Iain

(e-mail address removed) (FMAS) wrote in @posting.google.com:
I am comparing 2 lists of words and want to output the words in list 1
which are not available in list 2. Both lists have a different number
of entries.

Basically the script below works, but if the lists are large it takes
ages. I tested it on 2 lists of approx 15,000 entries each and after
25 min I had processed only 316 entries!

If you're merely trying to find words in file1 that aren't in file2 I
wouldn't bother with arrays at all. Hashes are usually much more
convenient for this kind of thing. Your other big problem was using a
regular expression match construct to see if the two words are
identical. Don't do that! If you simply want to test two strings for
equality use the 'eq' string comparison operator.


#!/usr/bin/perl

use strict;
use warnings;

# Firstly get a unique list of all words found in file2 as the set of
# keys of a hash
my %file2words;
open( F2WORDS, '<', 'c:\temp\b.txt' )
or die "open failed: $!";
while ( <F2WORDS> ) {
chomp;
$file2words{$_}++;
# the above line has the useful side-effect of counting the number
# of occurrences of each word
}
close( F2WORDS ) or die "close failed: $!";

# Now we can go through file1 and check whether each word in turn
# has been previously seen in file2. Similar technique:
open( F1WORDS, '<', 'c:\temp\a.txt' )
or die "open failed: $!";
while( <F1WORDS> ) {
chomp;
next if $file2words{$_};
print "$_\n";
}
close( F1WORDS ) or die "close failed: $!";

__END__

Note that as long as the very last line in your file has a \n at the end
of it, you can get rid of the two "chomp" lines and change the print
statement to simply 'print;' -- and once you've done that you can
further chop this script down by using various common Perl idioms:


#!/usr/bin/perl

use strict;
use warnings;

open( F2WORDS, '<', 'c:\temp\b.txt' )
or die "open failed: $!";
my %file2words = map { $_, 1 } <F2WORDS>;
close( F2WORDS ) or die "close failed: $!";

open( F1WORDS, '<', 'c:\temp\a.txt' )
or die "open failed: $!";
$file2words{$_} or print while <F1WORDS>;
close( F1WORDS ) or die "close failed: $!";

__END__


The moral of all this? Basically, you have to start doing a bit of
lateral thinking with Perl. Instead of the naive algorithm (two arrays
and nested loops), there's usually a more subtle and usually simpler way
to do things. Good luck!
 
A

Anno Siegel

Iain said:
(e-mail address removed) (FMAS) wrote in @posting.google.com:


If you're merely trying to find words in file1 that aren't in file2 I
wouldn't bother with arrays at all.

That's what OP said, but the code shows the matches are supposed to
be case-insensitive.
Hashes are usually much more
convenient for this kind of thing. Your other big problem was using a
regular expression match construct to see if the two words are
identical. Don't do that! If you simply want to test two strings for
equality use the 'eq' string comparison operator.

A hash can still be used, but some case normalization must be done:

my ( $f1, $f2);
open $f1, $_ or die "Can't read $_: $!" for '/tmp/a';
open $f2, $_ or die "Can't read $_: $!" for '/tmp/b';
my %file2words = map { uc() => 1 } <$f2>;
$file2words{ uc()} or print while <$f1>;

However, the regex solution should not be dismissed out of hand. The
regex engine can be surprisingly efficient. You would build a combined
regex out of the second file, as in

our $re = join '|', map quotemeta, sort { length $b <=> length $a } <$f2>;

The sorting is necessary in case one of the words is the beginning of
another. We want to try the longer word first.

In this case, for the small set of data I used, the hash method is indeed
faster by a factor of 2, but different data, or a different machine, might
give different results.

The efficiency of the regex is still remarkable, considering that it
checks all alternatives (words from file 2) each time, while the hash
computes one hash key and tries one access. The reason is the same
as always in low-level optimization in Perl. You can do a *lot* of
stuff when you can make the Perl core (or an Inline'd or otherwise
loaded routine) do it in one step.

The more looping and logic you can squeeze into one Perl opcode (as
the parlance is), the better, efficiency-wise. The regex engine
incorporates much looping and logic in one call. It is often still
efficient when a faster algorithm (that needs a few opcodes to
implement) is available.

Anno
 
J

Jürgen Exner

FMAS said:
I am comparing 2 lists of words and want to output the words in list 1
which are not available in list 2. [...]

This Question is Asked Frequently, please see "perldoc -q instersection

Some more comments on your code:

Please add
use strict;
use warnings;
You should never develop a program without them.
open(WORDLIST1,'C:\temp\a.txt') || die("cannot open file1!\n");
You may want to add the reason why the open failed: ....open file1 because
$! !\n");
You may want to use the lower precedence operator "or" instead of "||"
because then there is no ambiguity as to evaluation order.
@list1 = <WORDLIST1>;
There is absolutely no need to read list 1 into memory. You can process this
file easier line by line.
open(WORDLIST2,'C:\temp\b.txt') || die("cannot open file2!\n");
@list2 = <WORDLIST2>;

If you would use a hash for list2 instead of an array, then you could take
advantage of the (almost) O(1) lookup time in a hash.
# loop in loop

foreach $list1 (@list1) {
foreach $list2 (@list2) {
chomp $list1;
chomp $list2;

While in general chomp() is good idea in this specific case it isn't needed.
Normally not such a big issue, but you are even chomping every line many,
many times over. If at all use chomp) _once_ on each array _outside_ of the
loop (chomp() takes a list, too, no need to loop through each element one by
one).
last if ($list1 =~ m/$list2/i) ; # if match found look for next $list1
Ok, this will work. But do you really need a RegExp? From your verbal
description I would exepect that actually you are looking for equality, not
for an RE match?
$lastentry = $list2[$#list2]; # in order to print entry only once when
no match found
if ($list2 =~ m/$lastentry/i) {
It took me a long time to understand what you are doing with this
$lastentry.
The more canonical way would be to use a binary flag $found, which you set
to "0" just before starting the inner loop, if you find a match then set it
to "1" inside the inner loop, and then do the print() just after ending the
the inner loop iff the flag is true (i.e. "1").
print "$list1\n";
}}}

Any suggestions?

I have heard that it would be much faster to build a search tree, but

That is true, but it would be even faster to use a hash.
What about:

use warnings; use strict;
my %list2;
open(WORDLIST2,'C:\temp\b.txt') or
die("cannot open 'C:\temp\b.txt' because $! !\n");
while (<WORDLIST2>) {
$list2{$_}=1; #or any other value
}
open(WORDLIST1,'C:\temp\a.txt') or
die("cannot open file1 because $! !\n");
while (<WORDLIST1>){
print unless exists($list2{$_});
}

jue
 
F

FMAS

Thanks Lain!

It works amazingly fast (time cut from hours to one second!).
I see that I have a lot of homework to do...

rgds

Francois
 
B

Ben Morrow

Quoth (e-mail address removed)-berlin.de (Anno Siegel):
That's what OP said, but the code shows the matches are supposed to
be case-insensitive.

A hash can still be used, but some case normalization must be done:

my ( $f1, $f2);
open $f1, $_ or die "Can't read $_: $!" for '/tmp/a';
open $f2, $_ or die "Can't read $_: $!" for '/tmp/b';
my %file2words = map { uc() => 1 } <$f2>;

Just a minor nit: in a Unicode environment it is better to use lc for
case-folding than uc, as Unicode has three cases: upper, lower and
title, with mappings upper <-> lower and title <-> lower.

Ben
 
A

Anno Siegel

Ben Morrow said:
Quoth (e-mail address removed)-berlin.de (Anno Siegel):
[...]
A hash can still be used, but some case normalization must be done:

my ( $f1, $f2);
open $f1, $_ or die "Can't read $_: $!" for '/tmp/a';
open $f2, $_ or die "Can't read $_: $!" for '/tmp/b';
my %file2words = map { uc() => 1 } <$f2>;

Just a minor nit: in a Unicode environment it is better to use lc for
case-folding than uc, as Unicode has three cases: upper, lower and
title, with mappings upper <-> lower and title <-> lower.

Ah, thanks for that (I knew about title case, but never drew this
conclusion). It is good to have a general default for the uc/lc
choice, when there is no other preference.

As I look at that line again, I notice that the map block
"{ lc() => 1 }" looks exactly like a one-pair hashref. Prefixed
with a "+", or in a different context, it is one. Would

map { ( lc() => 1) } <$f2>;

be clearer? I don't consider unnecessary parentheses often, but
this may be a case. What does the style pol^Wpanel say?

Anno
 
B

Ben Morrow

Quoth (e-mail address removed)-berlin.de (Anno Siegel):
As I look at that line again, I notice that the map block
"{ lc() => 1 }" looks exactly like a one-pair hashref. Prefixed
with a "+", or in a different context, it is one. Would

map { ( lc() => 1) } <$f2>;

be clearer? I don't consider unnecessary parentheses often, but
this may be a case. What does the style pol^Wpanel say?

I would probably omit the block altogether, and write

my %file2words = map +( (lc) => 1 ), <$f2>;

or forget the 1 and just use

my %file2words;
@file2words{ map lc, <$f2> } = ();

The first is a case for not using =>:

my %file2words = map +(lc, 1), <$f2>;

Hmm, I think I like that best, modulo my standard irritation at the need
for the '+' (one of my projects for when-I-get-round-tuit is writing a
pragma to make whitespace between function-name and '(' significant (or
maybe even get rid of 'function(args)' altogether, in favour of
'(function args)' where necessary), along with allowing methods to be
called as list operators).

Ben
 
A

Anno Siegel

Ben Morrow said:
Quoth (e-mail address removed)-berlin.de (Anno Siegel):

I would probably omit the block altogether, and write

my %file2words = map +( (lc) => 1 ), <$f2>;

or forget the 1 and just use

my %file2words;
@file2words{ map lc, <$f2> } = ();

The first is a case for not using =>:

my %file2words = map +(lc, 1), <$f2>;

Hmm, I think I like that best, modulo my standard irritation at the need
for the '+'

It's certainly short, and also clear modulo that irritation. "+"
spoils readability, just because it is so rarely needed. Pity about
the "=>", pointing from key to value in the prospective hash.
(one of my projects for when-I-get-round-tuit is writing a
pragma to make whitespace between function-name and '(' significant (or

Good thing Abigail isn't looking...
maybe even get rid of 'function(args)' altogether, in favour of
'(function args)' where necessary), along with allowing methods to be
called as list operators).

Projects involving the parser awe me.

I have a certain dislike for the (function args) notation, but that
may be entirely subjective. I realize how it meshes list operator
notation with the function of () as (sub-)expression delimiters.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top