Converting a string to multiple search patterns

T

Tore Aursand

I think 11. A D should come before 10. B D, because all else being
equal, A comes before B.

Hmm. This is meant for a search engine, and personally I think it makes
more sense to score based on how close the words are to each other. In
addition, words "to the left" should score higher than words on the "right
side".

Maybe I'm totally wrong, but I think most search engines out there works
like this; they always try to match "early" words before "later" words.

Or am I wrong?
 
A

Anno Siegel

Well, my regex solution agrees with you, and while the insides of the
regex are a bit ugly to look at, the algorithm is far simpler than it
seems.

#!/usr/bin/perl -l

my $rx;

{
use re 'eval';
my @kw = qw( A B C D E );
$rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
@{[map qq{ (?:
\Q$_\E (?{ \$s += \$f <<= 1 }) |
(?{ \$f = 1 })
) \\s*
}, @kw ]}
$
(?{ $s })
}x;
}

while (<{A,}{B,}{C,}{D,}{E,}>) {
chomp;
print "$^R\t$_" if /$rx/ and $^R;
}

Run that code through '| sort -n', and you'll get:

2 A
2 B
2 C
2 D
2 E
4 AC
4 AD
4 AE
4 BD
4 BE
4 CE
6 AB
6 ACE
6 BC
6 CD
6 DE
8 ABD
8 ABE
8 ACD
8 ADE
8 BCE
8 BDE
[...]


which I think is consistent with the rules.

Consistent, yes, and the fact that it sorts all combinations of n before
those of n+1 is significant.

However, the order of matches with equal score is undefined. So the
score by itself is insufficient to define the required sequence.

Anno
 
B

Brad Baxter

Brad Baxter said:
On Tue, 8 Jun 2004, David K. Wall wrote:
[snip code]

(BTW, I don't like the way my posted code (mis)handles the empty set,
but never mind)
I think I disagree with this. While it agrees with the OP's
stated specs, I'm not sure it agrees with the spirit of the specs.
:)
Heh.

Of course, my interpretation may simply be wrong.

Or mine. It's Tore's problem, let him worry about it. :)

Indeed. He has certainly gotten some food for thought.

But all else isn't equal. The "distance" between A and D is greater
than the "distance" between B and D. I'm not sure how to express
this clearly other than in code, but the way I understood Tore was
this: the combinations are grouped

first by the number of terms/elements in a combination,
then by the "range" of the combination,
then by the order of the original set.

-- but maybe I was reading too much into the choice of 'A'..'D' for
the example?

Now I'm not so sure you're wrong. I hadn't thought in terms of
"closeness", only in terms of adjacent or closest-to-first. I'll grant
that closeness is probably more meaningful.

That's a good point -- I certainly won't argue against it. What Would
Google Do? :)

I assumed that was the issue. Perhaps my thinking is colored toward
three-of-a-kind beats two-pair. :) That analogy is highly likely
dubious, and I, too, wonder what research indicates.

Randal mentions "edit distance" which seems to lean toward closeness.

My mind is open.

Regards,

Brad
 
B

Brad Baxter

I think 11. A D should come before 10. B D, because all else being equal,
A comes before B. In addition, when your code expands the terms 'A'..'E',
you get:

A B C D E
A B C D
B C D E
A B C E
A B D E
A C D E
A B C
B C D
C D E

While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
should outrank it, because they have more adjacent terms in a row.

Well, my regex solution agrees with you, and while the insides of the
regex are a bit ugly to look at, the algorithm is far simpler than it
seems.

#!/usr/bin/perl -l

my $rx;

{
use re 'eval';
my @kw = qw( A B C D E );
$rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
@{[map qq{ (?:
\Q$_\E (?{ \$s += \$f <<= 1 }) |
(?{ \$f = 1 })
) \\s*
}, @kw ]}
$
(?{ $s })
}x;
}

while (<{A,}{B,}{C,}{D,}{E,}>) {
chomp;
print "$^R\t$_" if /$rx/ and $^R;
}

Run that code through '| sort -n', and you'll get:

2 A
2 B
2 C
2 D
2 E
4 AC
4 AD
4 AE
4 BD
4 BE
4 CE
6 AB
6 ACE
6 BC
6 CD
6 DE
8 ABD
8 ABE
8 ACD
8 ADE
8 BCE
8 BDE
12 ABDE
14 ABC
14 BCD
14 CDE
16 ABCE
16 ACDE
30 ABCD
30 BCDE
62 ABCDE

which I think is consistent with the rules.

I'm on that page, too, but I'm not sure the rules are well defined. I'm
also not sure the rules for an English phrase are applicable to, say, a
Russian phrase.

Curiouser and curiouser.

--Brad
 
B

Brad Baxter

Brad> I think 11. A D should come before 10. B D, because all else being equal,
Brad> A comes before B. In addition, when your code expands the terms 'A'..'E',
Brad> you get:

Brad> A B C D E
Brad> A B C D
Brad> B C D E
Brad> A B C E
Brad> A B D E
Brad> A C D E
Brad> A B C
Brad> B C D
Brad> C D E
Brad> ...

Brad> While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
Brad> should outrank it, because they have more adjacent terms in a row.

It appears as though you are sorting on "edit distance", which is a
well-defined term, and even has a module, String::Approx, to compute
it.

use strict;
use String::Approx 'adist';
my @strings = glob "{,A}{,B}{,C}{,D}{,E}"; # lazy. :)
shift @strings; # leave out empty string
my @dists = map { abs adist("ABCDE", $_) } @strings;
my @sorted = sort {
$dists[$a] <=> $dists[$b] or $strings[$a] cmp $strings[$b]
} 0..$#strings;
printf "%5s %d\n", $strings[$_], $dists[$_] for @sorted;

==>

ABCDE 0
ABCD 1
ABCE 1
ABDE 1
ACDE 1
BCDE 1
ABC 2
ABD 2
ABE 2
ACD 2
ACE 2
ADE 2
BCD 2
BCE 2
BDE 2
CDE 2
AB 3
AC 3
AD 3
AE 3
BC 3
BD 3
BE 3
CD 3
CE 3
DE 3
A 4
B 4
C 4
D 4
E 4

Nice.

Brad
 
T

Tore Aursand

Indeed. He has certainly gotten some food for thought.

No shit, Sherlock?! :) I thought I had the way I wanted the patterns all
laid out, but you've really made me unsure; all the suggestions and
comments I get in this group is really helpful.
 
A

Anno Siegel

Tore Aursand said:
It appears as though you are sorting on "edit distance", which is a
well-defined term, and even has a module, String::Approx, to compute
it.
[...]

Hmm. I've already had a look at String::Approx, but I don't think it will
help me solve this.

Remember that 'A B C D' really are four _words_;

my $query = 'A B C D'; # What the user wants to search for
my @words = split( /\s+/, $query );

Is there really no module which lets you do something like the following?

my @words = ( whatever );
foreach ( @Documents ) {
my $score = search( $_->text(), \@words );
}

Or something? I want it! :)

There's Text::Scan, advertising "Fast search for very large numbers of
keys in a body of text".

There must be stacks of literature about text searches, but if I had
to tackle the problem from scratch I'd first extract all matches and
their positions and take it from there.

my $rex = join '|', map "\\b$_\\b", @words;
my @matches;
push @matches, [ $1, $-[ 0]] while $text =~ /($rex)/g;

Now @matches contains all information about the distribution of
the words in the text. The rest is fun with combinatorics :)

Anno
 
A

Anno Siegel

Anno Siegel said:
[...]
Run that code through '| sort -n', and you'll get:

2 A
2 B
2 C
2 D
2 E
4 AC
4 AD
4 AE
4 BD
4 BE
4 CE
6 AB
6 ACE
6 BC
6 CD
6 DE
8 ABD
8 ABE
8 ACD
8 ADE
8 BCE
8 BDE
[...]


which I think is consistent with the rules.

Consistent, yes, and the fact that it sorts all combinations of n before
those of n+1 is significant.

....or would be, if it were a fact. But "ACE" and "AB" both have scores
of 6, so it isn't. (It is true for all combinations of "A" .. "D".)

I'm afraid, the proposed scoring system is a remarkably close match
to the intended sequence, but it doesn't produce it for larger n.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,156
Messages
2,570,878
Members
47,408
Latest member
AlenaRay88

Latest Threads

Top