Converting a string to multiple search patterns

Tore Aursand · Jun 8, 2004

I think 11. A D should come before 10. B D, because all else being
equal, A comes before B.

Hmm. This is meant for a search engine, and personally I think it makes
more sense to score based on how close the words are to each other. In
addition, words "to the left" should score higher than words on the "right
side".

Maybe I'm totally wrong, but I think most search engines out there works
like this; they always try to match "early" words before "later" words.

Or am I wrong?

Anno Siegel · Jun 8, 2004

Well, my regex solution agrees with you, and while the insides of the
regex are a bit ugly to look at, the algorithm is far simpler than it
seems.

#!/usr/bin/perl -l

my $rx;

{
use re 'eval';
my @kw = qw( A B C D E );
$rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
@{[map qq{ (?:
\Q$_\E (?{ \$s += \$f <<= 1 }) |
(?{ \$f = 1 })
) \\s*
}, @kw ]}
$
(?{ $s })
}x;
}

while (<{A,}{B,}{C,}{D,}{E,}>) {
chomp;
print "$^R\t$_" if /$rx/ and $^R;
}

Run that code through '| sort -n', and you'll get:

2 A
2 B
2 C
2 D
2 E
4 AC
4 AD
4 AE
4 BD
4 BE
4 CE
6 AB
6 ACE
6 BC
6 CD
6 DE
8 ABD
8 ABE
8 ACD
8 ADE
8 BCE
8 BDE
[...]

which I think is consistent with the rules.

Consistent, yes, and the fact that it sorts all combinations of n before
those of n+1 is significant.

However, the order of matches with equal score is undefined. So the
score by itself is insufficient to define the required sequence.

Anno

Brad Baxter · Jun 9, 2004

Brad Baxter said:
Brad Baxter said:

On Tue, 8 Jun 2004, David K. Wall wrote:

Click to expand...

[snip code]

(BTW, I don't like the way my posted code (mis)handles the empty set,
but never mind)

I think I disagree with this. While it agrees with the OP's
stated specs, I'm not sure it agrees with the spirit of the specs.

Heh.

Of course, my interpretation may simply be wrong.

Click to expand...

Or mine. It's Tore's problem, let him worry about it.

Indeed. He has certainly gotten some food for thought.

But all else isn't equal. The "distance" between A and D is greater
than the "distance" between B and D. I'm not sure how to express
this clearly other than in code, but the way I understood Tore was
this: the combinations are grouped

first by the number of terms/elements in a combination,
then by the "range" of the combination,
then by the order of the original set.

-- but maybe I was reading too much into the choice of 'A'..'D' for
the example?

Now I'm not so sure you're wrong. I hadn't thought in terms of
"closeness", only in terms of adjacent or closest-to-first. I'll grant
that closeness is probably more meaningful.

That's a good point -- I certainly won't argue against it. What Would
Google Do?

I assumed that was the issue. Perhaps my thinking is colored toward
three-of-a-kind beats two-pair.

That analogy is highly likely
dubious, and I, too, wonder what research indicates.

Randal mentions "edit distance" which seems to lean toward closeness.

My mind is open.

Regards,

Brad

Brad Baxter · Jun 9, 2004

I think 11. A D should come before 10. B D, because all else being equal,
A comes before B. In addition, when your code expands the terms 'A'..'E',
you get:

A B C D E
A B C D
B C D E
A B C E
A B D E
A C D E
A B C
B C D
C D E

While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
should outrank it, because they have more adjacent terms in a row.

Click to expand...

Well, my regex solution agrees with you, and while the insides of the
regex are a bit ugly to look at, the algorithm is far simpler than it
seems.

#!/usr/bin/perl -l

my $rx;

{
use re 'eval';
my @kw = qw( A B C D E );
$rx = qr{
(?{ local ($s, $f) = (0, 1) })
^ \s*
@{[map qq{ (?:
\Q$_\E (?{ \$s += \$f <<= 1 }) |
(?{ \$f = 1 })
) \\s*
}, @kw ]}
$
(?{ $s })
}x;
}

while (<{A,}{B,}{C,}{D,}{E,}>) {
chomp;
print "$^R\t$_" if /$rx/ and $^R;
}

Run that code through '| sort -n', and you'll get:

2 A
2 B
2 C
2 D
2 E
4 AC
4 AD
4 AE
4 BD
4 BE
4 CE
6 AB
6 ACE
6 BC
6 CD
6 DE
8 ABD
8 ABE
8 ACD
8 ADE
8 BCE
8 BDE
12 ABDE
14 ABC
14 BCD
14 CDE
16 ABCE
16 ACDE
30 ABCD
30 BCDE
62 ABCDE

which I think is consistent with the rules.

I'm on that page, too, but I'm not sure the rules are well defined. I'm
also not sure the rules for an English phrase are applicable to, say, a
Russian phrase.

Curiouser and curiouser.

--Brad

Brad Baxter · Jun 9, 2004

Brad> I think 11. A D should come before 10. B D, because all else being equal,
Brad> A comes before B. In addition, when your code expands the terms 'A'..'E',
Brad> you get:

Brad> A B C D E
Brad> A B C D
Brad> B C D E
Brad> A B C E
Brad> A B D E
Brad> A C D E
Brad> A B C
Brad> B C D
Brad> C D E
Brad> ...

Brad> While 'A B D E' has more terms, I think 'A B C', 'B C D', and 'C D E'
Brad> should outrank it, because they have more adjacent terms in a row.

It appears as though you are sorting on "edit distance", which is a
well-defined term, and even has a module, String::Approx, to compute
it.

use strict;
use String::Approx 'adist';
my @strings = glob "{,A}{,B}{,C}{,D}{,E}"; # lazy.
shift @strings; # leave out empty string
my @dists = map { abs adist("ABCDE", $_) } @strings;
my @sorted = sort {
$dists[$a] <=> $dists[$b] or $strings[$a] cmp $strings[$b]
} 0..$#strings;
printf "%5s %d\n", $strings[$_], $dists[$_] for @sorted;

==>

ABCDE 0
ABCD 1
ABCE 1
ABDE 1
ACDE 1
BCDE 1
ABC 2
ABD 2
ABE 2
ACD 2
ACE 2
ADE 2
BCD 2
BCE 2
BDE 2
CDE 2
AB 3
AC 3
AD 3
AE 3
BC 3
BD 3
BE 3
CD 3
CE 3
DE 3
A 4
B 4
C 4
D 4
E 4

Nice.

Brad

Tore Aursand · Jun 9, 2004

Indeed. He has certainly gotten some food for thought.

No shit, Sherlock?!

I thought I had the way I wanted the patterns all
laid out, but you've really made me unsure; all the suggestions and
comments I get in this group is really helpful.

Anno Siegel · Jun 9, 2004

Tore Aursand said:
It appears as though you are sorting on "edit distance", which is a
well-defined term, and even has a module, String::Approx, to compute
it.
[...]

Click to expand...

Hmm. I've already had a look at String::Approx, but I don't think it will
help me solve this.

Remember that 'A B C D' really are four _words_;

my $query = 'A B C D'; # What the user wants to search for
my @words = split( /\s+/, $query );

Is there really no module which lets you do something like the following?

my @words = ( whatever );
foreach ( @Documents ) {
my $score = search( $_->text(), \@words );
}

Or something? I want it!

There's Text::Scan, advertising "Fast search for very large numbers of
keys in a body of text".

There must be stacks of literature about text searches, but if I had
to tackle the problem from scratch I'd first extract all matches and
their positions and take it from there.

my $rex = join '|', map "\\b$_\\b", @words;
my @matches;
push @matches, [ $1, $-[ 0]] while $text =~ /($rex)/g;

Now @matches contains all information about the distribution of
the words in the text. The rest is fun with combinatorics

Anno

Anno Siegel · Jun 9, 2004

Anno Siegel said:
[...]

Run that code through '| sort -n', and you'll get:

2 A
2 B
2 C
2 D
2 E
4 AC
4 AD
4 AE
4 BD
4 BE
4 CE
6 AB
6 ACE
6 BC
6 CD
6 DE
8 ABD
8 ABE
8 ACD
8 ADE
8 BCE
8 BDE
[...]

which I think is consistent with the rules.

Click to expand...

Consistent, yes, and the fact that it sorts all combinations of n before
those of n+1 is significant.

....or would be, if it were a fact. But "ACE" and "AB" both have scores
of 6, so it isn't. (It is true for all combinations of "A" .. "D".)

I'm afraid, the proposed scoring system is a remarkably close match
to the intended sequence, but it doesn't produce it for larger n.

Anno

Reverse search for a website	2	Apr 24, 2024
Search Results with Pagination	1	Oct 25, 2024
Converting an Array to a String in JavaScript	7	Sep 22, 2023
How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 19, 2024
Joinging the list to string	1	Dec 2, 2024
Creating a regex to get multiple values and print	0	Jan 10, 2021
How to specify a search query when I want to create an application in C++ with the option to open files?	0	Nov 3, 2023
Python3 string slicing	2	Dec 11, 2023

Converting a string to multiple search patterns

Tore Aursand

Anno Siegel

Brad Baxter

Brad Baxter

Brad Baxter

Tore Aursand

Anno Siegel

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads