[regex] grep for chars in any order

V

viki

How can I build regex that matches all characters of the string $STR
in any order with .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

Thanks
vkm
 
B

Ben Bullock

How can I build regex that matches all characters of the string $STR
in any order with .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

I don't think a single regular expression can do that, because there is
some logic involved which doesn't fit the regular expression mentality -
you have to work out what the first character matched was, then change
the second character to match depending on that, and so on.

I would use the easy and fast method and then remove the false positives
by checking that the matched string contained all n characters, perhaps
using s or tr n times and dropping out of the loop if one of my
substitutions failed.

The following seems to work, although I haven't tested it extensively:

#!/usr/local/bin/perl
use warnings;
use strict;

sub matches
{
my ($s, $STR) = @_;
my %chars = map {$_ => 1} split ('', $STR);
my @chars = sort keys %chars;
my $anychar = join '', @chars;
my $matchany = join '.*',map "[$anychar]", @chars; # there's a better
way
if ($s =~ /$matchany/) {
my $copy = $s;
for my $c (@chars) {
return unless $copy =~ s/$c//g;
}
return 1;
}
return;
}

print "OK\n" if (matches('naninuneno','aeiou'));
print "OK\n" if (matches('naninunene','aeiou'));
 
B

Ben Bullock

print 'matched' if $STR =~ /a(?=.*b)(?=.*c)|b(?=.*a)(?=.*c)|c(?=.*a)
(?=.*b)/

Great! That is better than the solution I posted. But I have an
improvement:

/(?=.*a)(?=.*b)(?=.*c)/

without any actual matching string also works, reducing the regex length
from O(n^2) to O(n), where n is the number of characters.
So you don't have to create all the combinations but you do need all the
permutations (if I have my terminology correct)

You mean that you need all the combinations of initial characters, but
not all the permutations (which would be O(n!)).
 
N

nolo contendere

How can I build regex that matches all characters of the string $STR
in any order with  .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

use a math module to get permutations:
http://search.cpan.org/~allenday/Math-Combinatorics-0.09/lib/Math/Combinatorics.pm

then from those, build your regexes.
 
P

Paul Lalli

How can I build regex that matches all characters of the string $STR
in any order with  .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw/all/;

$STR = "whatever";

if (all { $STR =~ /$_/ } qw/a b c/) {
print "Matched all of a, b, c\n";
}

__END__

Paul Lalli
 
J

Jürgen Exner

Maybe I am missing something, but isn't that the same as the text begins
and ends with a character from $str and all the other characters of $str
are included somewhere in the text?
It should be fairly easy to find an algorithm to check for that. You
just need to be careful about how to handle duplicate characters in $STR
and/or the text.

jue
 
N

nolo contendere

Maybe I am missing something, but isn't that the same as the text begins
and ends with a character from $str and all the other characters of $str
are included somewhere in the text?
It should be fairly easy to find an algorithm to check for that. You
just need to be careful about how to handle duplicate characters in $STR
and/or the text.

Those are both great points. Perhaps the OP could further refine the
requirements, or state the larger goal.
 
N

nolo contendere

At said:
 How can I build regex that matches all characters of the string $STR
 in any order with  .* added between any two characters: ?
 And without generating all N! transpositions (where N is length of
 $STR) ?
 Example.
 For $STR "abc", I want to match equivalent to:
 /(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
 Generating all transpositions is not feasible for larger legths of
 $STR.
 /[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
 What is good solution ?

If you're not stuck on creating one regular expression:

    sub has_all_chars {
        my ($string, $chars) = @_;
        my $matched = 1;
        foreach my $char (split //, $chars) {
            if (index($string, $char) == -1) {
                $matched = 0;
                last;
            }
        }
        matched;

what is this? ^^^
    }
    has_all_chars("foobar","rb"); # ==> 1
    has_all_chars("foobar","abc"); # ==> 0

This is pretty much what Paul suggested--this amounts to about the
same thing as List::MoreUtils's all() function.

sub all (&@) {
my $f = shift;
return if ! @_;
for (@_) {
return 0 if ! $f->();
}
return 1;
}
 
S

smallpond

At said:
How can I build regex that matches all characters of the string $STR
in any order with .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/
Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

If you're not stuck on creating one regular expression:

sub has_all_chars {
my ($string, $chars) = @_;
my $matched = 1;
foreach my $char (split //, $chars) {
if (index($string, $char) == -1) {
$matched = 0;
last;
}
}
matched;
}
has_all_chars("foobar","rb"); # ==> 1
has_all_chars("foobar","abc"); # ==> 0


This is a really clever solution. The only thing I would
do differently is to use chop instead of split. Why create
a list unless you need the list?

while (my $char = chop $chars) {

--S
 
M

Mario D'Alessio

viki said:
How can I build regex that matches all characters of the string $STR
in any order with .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

Thanks
vkm

The way I see the solution, you can have any of the $STR characters,
followed by .*, followed by another of any of the $STR characters:

/[$STR].*[$STR]/

Or am I missing something?

Mario
 
M

Mario D'Alessio

Ignore my post. I realize my mistake. I missed the
part about the regex matching ALL of the characters.

Mario

Mario D'Alessio said:
viki said:
How can I build regex that matches all characters of the string $STR
in any order with .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

Thanks
vkm

The way I see the solution, you can have any of the $STR characters,
followed by .*, followed by another of any of the $STR characters:

/[$STR].*[$STR]/

Or am I missing something?

Mario
 
B

Bart Lateur

Glenn said:
Nice. I wonder (without bothering to benchmark it) if anchoring the
expression would be an optimization:

/^(?=.*a)(?=.*b)(?=.*c)/

It would, in case it doesn't match. The latter will only try a match at
the start of the string, the former will try again at every character
position, which is is dead stupid, of course.

Be aware of the possibility of the string containing newlines.

/^(?=.*a)(?=.*b)(?=.*c)/s
 
J

jl_post

How can I build regex that matches all characters of the string $STR
in any order with .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/


Dear Viki,

If you don't mind using several regular expressions (one for each
letter), you can easily write:

/a/ and /b/ and /c/

You can even put it in a Perl grep() statement (which I presume is
what you intend to use it for) like this:

my @firstList = ('cab', 'back', 'cat', 'crab', 'dog', 'baby');
my @secondList = grep { /a/ and /b/ and /c/ } @firstList;

In this way, @secondList would contain 'cab', 'back', and 'crab',
but not 'baby' (which would have been a false positive in your
previous example).

Of course, this approach uses one regular expression for each
letter that you're looking for (instead of just one last regular
expression), but depending on how you're writing your code, that may
be acceptable.

I hope this helps, Viki.

-- Jean-Luc
 
J

John W. Krahn

viki said:
How can I build regex that matches all characters of the string $STR
in any order with .* added between any two characters: ?
And without generating all N! transpositions (where N is length of
$STR) ?
Example.
For $STR "abc", I want to match equivalent to:
/(a.*b.*c)|(a.*c.*b)|(b.*a.*c)|(b.*c.*a)|(c.*a.*b)|(c.*b.*a)/

Generating all transpositions is not feasible for larger legths of
$STR.
/[abc].*[abc].*[abc]/ is easy and fast but gives false positives.
What is good solution ?

I haven't tested this but this may do what you want:

( Assuming the data you are searching is in $data )

$data =~ s/[^\Q$STR\E]+//g;
print "matched!\n" if join( '', sort split //, $data ) eq join( '', sort
split //, $STR );



John
 
B

Ben Bullock

( Assuming the data you are searching is in $data )

$data =~ s/[^\Q$STR\E]+//g;
print "matched!\n" if join( '', sort split //, $data ) eq join( '', sort
split //, $STR );

This fails (gives a false negative) if $data = "abcabc" and $STR = "ab",
because the result of the first "join" is "aabb" and the second "join" is
"ab". You need to do some kind of unique sort.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top