String comparison optimization

ironmanda · Nov 29, 2006

I have some code that attempts to match a set A of strings against a
set B of strings,
where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
strings in B,
so there is no early exit criteria. The code is roughly:

foreach my $a (keys %A) {
$a=lc($a);
# Repeat the string 3 times so as to allow up to 3!=6 permutations
# of $b to match
$a.=qq( $a $a);
foreach my $b (keys %B) {
# $b is lowercase already
# Allow string fragments of $b to match anywhere in $a - use .*
# in place of spaces
$b=~s/\s+/\.\*/g;
if ($a=~/$b/) {
&doSomething($a, $b);
}
}
}

Is there a faster way to do this?

David

Uri Guttman · Nov 29, 2006

i> I have some code that attempts to match a set A of strings against a
i> set B of strings,
i> where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
i> strings in B,
i> so there is no early exit criteria. The code is roughly:

i> foreach my $a (keys %A) {
i> $a=lc($a);

don't use $a and $b for regular vars (even in examples). they are
reserved for sort. in general single letter var names are bad (with the
exception of $i, $j, etc.). plenty of past threads in here cover that.

i> # Repeat the string 3 times so as to allow up to 3!=6 permutations
i> # of $b to match
i> $a.=qq( $a $a);
i> foreach my $b (keys %B) {

why get the keys each time? they don't change for each outer loop. get
the keys into an array outside the loops and loop over that.

i> # $b is lowercase already
i> # Allow string fragments of $b to match anywhere in $a - use .*
i> # in place of spaces
i> $b=~s/\s+/\.\*/g;

you do this each time in the %A loop when it should be done one
time. move it to the outside of both loops.

also since this will be recompiled into a regex each time, you should
make the array consist of qr// expressions.

i> if ($a=~/$b/) {
i> &doSomething($a, $b);

don't use & to call functions. plenty of past threads in here cover that.

uri

Ben Morrow · Nov 29, 2006

Quoth (e-mail address removed):

I have some code that attempts to match a set A of strings against a
set B of strings,
where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
strings in B,
so there is no early exit criteria. The code is roughly:

foreach my $a (keys %A) {

The variables $a and $b are magic (they are used by sort, so you don't
need to predeclare them under 'strict'). It's better to avoid them.

$a=lc($a);
# Repeat the string 3 times so as to allow up to 3!=6 permutations
# of $b to match
$a.=qq( $a $a);
foreach my $b (keys %B) {
# $b is lowercase already
# Allow string fragments of $b to match anywhere in $a - use .*
# in place of spaces
$b=~s/\s+/\.\*/g;

.. and * are not magic in double-quoted strings (the RHS of s/// is a
double-quoted string.

if ($a=~/$b/) {

This will (or may, at any rate) recompile the regex every time. You can
probably make it faster by precompiling it:

$b = qr/$b/;
if ($a =~ $b) {

Are the strings in $b regular expressions, or are they literal strings?
That is, if you have $a = 'aaa' and $b = 'a*', do you want a match or
not? If the latter, then you need to use \Q:

$b = qr/\Q$b/;

You *may* also be able to make it faster by using index, or by splitting
$a and $b into their consitiuent parts and using a hash to match them.
You would need to specify more exactly the sorts of things you are
matching for us to help you with this.

&doSomething($a, $b);

Don't call subs with & unless you know what it does.

Ben

jgraber · Nov 29, 2006

I have some code that attempts to match a set A of strings against a
set B of strings,
where A ~ 30,000 and B ~ 2,500. A string in A can match any number of
strings in B,
so there is no early exit criteria. The code is roughly:

foreach my $a (keys %A) {
$a=lc($a);
# Repeat the string 3 times so as to allow up to 3!=6 permutations
# of $b to match
$a.=qq( $a $a);
foreach my $b (keys %B) {
# $b is lowercase already
# Allow string fragments of $b to match anywhere in $a - use .*
# in place of spaces
$b=~s/\s+/\.\*/g;
if ($a=~/$b/) {
&doSomething($a, $b);
}
}
}

Is there a faster way to do this?

David

You've provided a code example. Thats good.
But its not complete enough to run, because
there is no data provided, ( you could use __DATA__ )
so we have no examples of what your data looks like.

Without examples of your data, it seems likely that your code
fragment does not well represent your real code,
and/or does not represent your true intent.
An example is worth 1000 words of description.

But otherwise, these general comments may help.
1 use of variable names $a and $b is discouraged except inside sort routines.
so pick some other variable name, even in short example code.

2 refactor your loop, or build a new %Bprime hash to search on.
As is, the same operation $b=~s/\s+/\.\*/g;
is applied to the first key of %B ~30,000 times.
The new %Bprime could be precompiled regexps. using qr()

3 There may already be a module to to do hashkey by regexp lookup.
If this is genetics related, there are quite likely modules
for it already.

4 Since you will be doing ~3000 pattern matches on $a,
you could try studying it first. perldoc -f study

5 The biggest time issue is likely the .* in $b
since it seems like they will cause lots of backtracking.

Joinging the list to string	1	Dec 2, 2024
Picture Comparison Code Not Working Properly	1	Jul 24, 2021
Image overlay and comparison code error.	2	Jul 1, 2021
Noob question about mathematical addition vs. "string addition" in C#	1	Mar 6, 2022
Seeking co-founders for my company.	3	Sep 8, 2024
Output confusion	2	Mar 9, 2023
Issue with textbox script?	0	Sep 5, 2022
Please Help me to Write these C programs, I am fully confused to solve these Programs. Thanks alot.	1	May 30, 2022

String comparison optimization

ironmanda

Uri Guttman

Ben Morrow

jgraber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads