String comparison optimization

I

ironmanda

I have some code that attempts to match a set A of strings against a
set B of strings,
where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
strings in B,
so there is no early exit criteria. The code is roughly:

foreach my $a (keys %A) {
$a=lc($a);
# Repeat the string 3 times so as to allow up to 3!=6 permutations
# of $b to match
$a.=qq( $a $a);
foreach my $b (keys %B) {
# $b is lowercase already
# Allow string fragments of $b to match anywhere in $a - use .*
# in place of spaces
$b=~s/\s+/\.\*/g;
if ($a=~/$b/) {
&doSomething($a, $b);
}
}
}

Is there a faster way to do this?

David
 
U

Uri Guttman

i> I have some code that attempts to match a set A of strings against a
i> set B of strings,
i> where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
i> strings in B,
i> so there is no early exit criteria. The code is roughly:

i> foreach my $a (keys %A) {
i> $a=lc($a);

don't use $a and $b for regular vars (even in examples). they are
reserved for sort. in general single letter var names are bad (with the
exception of $i, $j, etc.). plenty of past threads in here cover that.

i> # Repeat the string 3 times so as to allow up to 3!=6 permutations
i> # of $b to match
i> $a.=qq( $a $a);
i> foreach my $b (keys %B) {

why get the keys each time? they don't change for each outer loop. get
the keys into an array outside the loops and loop over that.

i> # $b is lowercase already
i> # Allow string fragments of $b to match anywhere in $a - use .*
i> # in place of spaces
i> $b=~s/\s+/\.\*/g;

you do this each time in the %A loop when it should be done one
time. move it to the outside of both loops.

also since this will be recompiled into a regex each time, you should
make the array consist of qr// expressions.

i> if ($a=~/$b/) {
i> &doSomething($a, $b);

don't use & to call functions. plenty of past threads in here cover that.

uri
 
B

Ben Morrow

Quoth (e-mail address removed):
I have some code that attempts to match a set A of strings against a
set B of strings,
where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
strings in B,
so there is no early exit criteria. The code is roughly:

foreach my $a (keys %A) {

The variables $a and $b are magic (they are used by sort, so you don't
need to predeclare them under 'strict'). It's better to avoid them.
$a=lc($a);
# Repeat the string 3 times so as to allow up to 3!=6 permutations
# of $b to match
$a.=qq( $a $a);
foreach my $b (keys %B) {
# $b is lowercase already
# Allow string fragments of $b to match anywhere in $a - use .*
# in place of spaces
$b=~s/\s+/\.\*/g;

.. and * are not magic in double-quoted strings (the RHS of s/// is a
double-quoted string.
if ($a=~/$b/) {

This will (or may, at any rate) recompile the regex every time. You can
probably make it faster by precompiling it:

$b = qr/$b/;
if ($a =~ $b) {

Are the strings in $b regular expressions, or are they literal strings?
That is, if you have $a = 'aaa' and $b = 'a*', do you want a match or
not? If the latter, then you need to use \Q:

$b = qr/\Q$b/;

You *may* also be able to make it faster by using index, or by splitting
$a and $b into their consitiuent parts and using a hash to match them.
You would need to specify more exactly the sorts of things you are
matching for us to help you with this.
&doSomething($a, $b);

Don't call subs with & unless you know what it does.

Ben
 
J

jgraber

I have some code that attempts to match a set A of strings against a
set B of strings,
where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
strings in B,
so there is no early exit criteria. The code is roughly:

foreach my $a (keys %A) {
$a=lc($a);
# Repeat the string 3 times so as to allow up to 3!=6 permutations
# of $b to match
$a.=qq( $a $a);
foreach my $b (keys %B) {
# $b is lowercase already
# Allow string fragments of $b to match anywhere in $a - use .*
# in place of spaces
$b=~s/\s+/\.\*/g;
if ($a=~/$b/) {
&doSomething($a, $b);
}
}
}

Is there a faster way to do this?

David

You've provided a code example. Thats good.
But its not complete enough to run, because
there is no data provided, ( you could use __DATA__ )
so we have no examples of what your data looks like.

Without examples of your data, it seems likely that your code
fragment does not well represent your real code,
and/or does not represent your true intent.
An example is worth 1000 words of description.

But otherwise, these general comments may help.
1 use of variable names $a and $b is discouraged except inside sort routines.
so pick some other variable name, even in short example code.

2 refactor your loop, or build a new %Bprime hash to search on.
As is, the same operation $b=~s/\s+/\.\*/g;
is applied to the first key of %B ~30,000 times.
The new %Bprime could be precompiled regexps. using qr()

3 There may already be a module to to do hashkey by regexp lookup.
If this is genetics related, there are quite likely modules
for it already.

4 Since you will be doing ~3000 pattern matches on $a,
you could try studying it first. perldoc -f study

5 The biggest time issue is likely the .* in $b
since it seems like they will cause lots of backtracking.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,202
Messages
2,571,055
Members
47,658
Latest member
jaguar32

Latest Threads

Top