matching '?' in a string ending with digits

R

ReMo...

#!/usr/bin/perl

use strict;
use warnings;

my @arr = ('third1000', 'third1000', 'third?1000', '1000third?', 'third{}1000');
for my $item (@arr) {
my $targ = $item;
print "$targ and $item ";
print "do not " if ($item !~ /$targ/);
print "match\n"
}

The output is:
third1000 and third1000 match
third1000 and third1000 match
third?1000 and third?1000 do not match << I don't understand this
1000third? and 1000third? match
third{}1000 and third{}1000 match

In the above, the nondigits represent arbitrary text that digits are
added to for a multi-array sort in a module I'm making, because there
may be otherwise-identical text items.

/\Q...\E/ seems to make it go away, but then two characters ('$' and '@')
would apparently need to be accounted for.

So my question is, what other characters will fail to match in a string
ending with digits? I assume there are more clues in perlre and perlops,
but I can't find them. I've got to be missing something really elementary
here.
 
J

Jens Thoms Toerring

ReMo... said:
#!/usr/bin/perl
use strict;
use warnings;
my @arr = ('third1000', 'third1000', 'third?1000', '1000third?', 'third{}1000');
for my $item (@arr) {
my $targ = $item;
print "$targ and $item ";
print "do not " if ($item !~ /$targ/);
print "match\n"
}
The output is:
third1000 and third1000 match
third1000 and third1000 match
third?1000 and third?1000 do not match << I don't understand this
1000third? and 1000third? match
third{}1000 and third{}1000 match
In the above, the nondigits represent arbitrary text that digits are
added to for a multi-array sort in a module I'm making, because there
may be otherwise-identical text items.
/\Q...\E/ seems to make it go away, but then two characters ('$' and '@')
would apparently need to be accounted for.
So my question is, what other characters will fail to match in a string
ending with digits?

It's not about the digits in the end, it's about the presence
of characters that have special meanings in a regexp. Take

'third?1000'

As a regexp it says "match everything that starts with the 4
chars 'thri', optionally followed by a 'd', and then by '1000'."
That's obviously something that doesn't describe the string
itself, which contains a question mark.

And the list of strings that you will get problems with can
easily be extended. Take, for example

'third()1000'
'thir\d1000'
'thir{2,}d1000'
'third*1000'

And it gets worse: try

'th(ird1000'

which will end in a complaint about an unmatched '(' in a
regular expression.
I assume there are more clues in perlre and perlops,
but I can't find them. I've got to be missing something really elementary
here.

Using '/\Q$trag\E/' will help with since for all what's en-
closed by '\Q' and '\E' the special meaning of the charac-
ters is removed. I don't see what problems you forsee with
'$' and '@', but then I don't understand the explanation of
what you're planing to do with all this.

Regards, Jens
 
P

Peter Makholm

ReMo... said:
So my question is, what other characters will fail to match in a string
ending with digits? I assume there are more clues in perlre and perlops,
but I can't find them. I've got to be missing something really elementary
here.

Digits are not your problem. The problem is characters that have special
meaning in regexpes, this includes '?'. The regexp 'third?1000' matches
either 'third1000' or 'thir1000' which of course isn't a substring of
'third?1000'.

The regexp 'third1000?' matches 'third1000' and 'third100' which in both
cases are substrings of 'third1000?' and so you get a match.

//Makholm
 
S

sln

#!/usr/bin/perl

use strict;
use warnings;

my @arr = ('third1000', 'third1000', 'third?1000', '1000third?', 'third{}1000');
for my $item (@arr) {
my $targ = $item;
print "$targ and $item ";
print "do not " if ($item !~ /$targ/);
print "match\n"
}

The output is:
third1000 and third1000 match
third1000 and third1000 match
third?1000 and third?1000 do not match << I don't understand this

Your using the wrong operator.
Change the conditional to
if ($item ne $targ)

Otherwise its just dumb to do something you know nothing about.
The !~ is a regular expression operator. Read any document on regular
expressions before actually trying them.

-sln
 
S

sln

It's not about the digits in the end, it's about the presence
of characters that have special meanings in a regexp. Take

'third?1000'

As a regexp it says "match everything that starts with the 4
chars 'thri', optionally followed by a 'd', and then by '1000'."
That's obviously something that doesn't describe the string
itself, which contains a question mark.

And the list of strings that you will get problems with can
easily be extended. Take, for example

'third()1000'
'thir\d1000'
'thir{2,}d1000'
'third*1000'

And it gets worse: try

'th(ird1000'

which will end in a complaint about an unmatched '(' in a
regular expression.


Using '/\Q$trag\E/' will help with since for all what's en-
closed by '\Q' and '\E' the special meaning of the charac-
ters is removed. I don't see what problems you forsee with
'$' and '@', but then I don't understand the explanation of
what you're planing to do with all this.

This is awsome, but shouldn't you recommend he
first looks up regular expressions at wikipedia to find
out what it is?

-sln
 
J

Jürgen Exner

ReMo... said:
my $targ = $item;
print "$targ and $item ";
print "do not " if ($item !~ /$targ/);
print "match\n" [...]
So my question is, what other characters will fail to match in a string

Almost all that are special in REs. The only(?) exception being '.',
which of course will still match itself, too.
ending with digits?

This part of the question is a red herring.
I assume there are more clues in perlre and perlops,
but I can't find them. I've got to be missing something really elementary
here.

The most elementary is: don't use REs unless you need RE behaviour. If
you simply want to check if one string is part of another string then
just use a plain "index()".

jue
 
R

ReMo...

It's not about the digits in the end, it's about the presence
of characters that have special meanings in a regexp. Take

'third?1000'

As a regexp it says "match everything that starts with the 4
chars 'thri', optionally followed by a 'd', and then by '1000'."
That's obviously something that doesn't describe the string
itself, which contains a question mark.

And the list of strings that you will get problems with can
easily be extended. Take, for example

'third()1000'
'thir\d1000'
'thir{2,}d1000'
'third*1000'

And it gets worse: try

'th(ird1000'

which will end in a complaint about an unmatched '(' in a
regular expression.


Using '/\Q$trag\E/' will help with since for all what's en-
closed by '\Q' and '\E' the special meaning of the charac-
ters is removed. I don't see what problems you forsee with
'$' and '@', but then I don't understand the explanation of
what you're planing to do with all this.

I knew it had to be that simple, but since '{' worked I assumed
without much reflection that a match should then work like a
comparison. Thank you!

For reference, a test for the sub that the above represents is
something like:

$outee = modelsort (
['3third','2second','1first','1first'],
['alpha','beta','gamma','delta'],
['apple','banana','cherry','donut']
);
$complex1out = [
['1first','1first','2second','3third'],
['gamma','delta','beta','alpha'],
['cherry','donut','banana','apple']
];
is_deeply ($outee, $complex1out, "modelsort: duplicate sort items");
....

perlre goes on to say: 'You cannot include a literal "$" or "@"
within a "\Q" sequence...' But of course a variable isn't a literal,
which explains why I couldn't make a match on those characters
fail. So that should work just dandy for what I'm doing.
 
R

ReMo...

Digits are not your problem. The problem is characters that have special
meaning in regexpes, this includes '?'. The regexp 'third?1000' matches
either 'third1000' or 'thir1000' which of course isn't a substring of
'third?1000'.

The regexp 'third1000?' matches 'third1000' and 'third100' which in both
cases are substrings of 'third1000?' and so you get a match.

//Makholm

Thank you. I really and truly thought (without thinking) that
since '{' matched, the other quote operators would, also, and that
a solution was elsewhere than somehow accounting for RE metacharacters.
 
R

ReMo...

ReMo... said:
my $targ = $item;
print "$targ and $item ";
print "do not " if ($item !~ /$targ/);
print "match\n" [...]
So my question is, what other characters will fail to match in a string

Almost all that are special in REs. The only(?) exception being '.',
which of course will still match itself, too.
ending with digits?

This part of the question is a red herring.
I assume there are more clues in perlre and perlops,
but I can't find them. I've got to be missing something really elementary
here.

The most elementary is: don't use REs unless you need RE behaviour. If
you simply want to check if one string is part of another string then
just use a plain "index()".

jue

Exactly relevent principle here, because the only reason I'd like
to stick with an RE is because I have the vaguest of ideas that in
the future I might want to use a different sort strategy. But in
fact, I don't actually need to use an RE there right now.

Thanks!
 
C

C.DeRykus

#!/usr/bin/perl

use strict;
use warnings;

my @arr = ('third1000', 'third1000', 'third?1000', '1000third?', 'third{}1000');
for my $item (@arr) {
    my $targ = $item;
    print "$targ and $item ";
    print "do not " if ($item !~ /$targ/);
    print "match\n"

}

The output is:
third1000 and third1000 match
third1000 and third1000 match
third?1000 and third?1000 do not match << I don't understand this
1000third? and 1000third? match
third{}1000 and third{}1000 match

In the above, the nondigits represent arbitrary text that digits are
added to for a multi-array sort in a module I'm making, because there
may be otherwise-identical text items.

/\Q...\E/ seems to make it go away, but then two characters ('$' and '@')
would apparently need to be accounted for.

So my question is, what other characters will fail to match in a string
ending with digits?  I assume there are more clues in perlre and perlops,
but I can't find them.  I've got to be missing something really elementary
here.


See perldoc perlretut for a quick intro about meta-
characters. Various metacharacters will cause the
regex to fail as mentioned.

One problem is that there must be a literal '?' in
the regex in order to match the '?' in the string
being matched. Since '?' is a regex metacharacter
with special meaning to the regex compilation and
not a literal '?', the match would fail.

The 're' pragma can be helpful in seeing what
happens:

perl -Mre=debug -wle "print 'not'
if 'third?1000' !~ /third?1000/"

Compiling REx "third?1000"
Final program:
1: EXACT <thir> (3)
3: CURLY {0,1} (7)
5: EXACT <d> (0)
7: EXACT <1000> (9)
9: END (0)
anchored "thir" at 0 floating "1000" at 4..5 (checking
floating) minlen 8
Guessing start of match in sv for REx "third?1000"
against "third?1000"
Found floating substr "1000" at offset 6...
Contradicts anchored substr "thir", giving up...
Match rejected by optimizer

As it turns out though, the debug looks to me as
if the compilation fails for another reason when
the optimizer determines "1000" will occurs at
offset 4 or 5 in the pattern which won't match its
position at offset 6 in the string being matched.
 
C

ccc31807

In the above, the nondigits represent arbitrary text that digits are
added to for a multi-array sort in a module I'm making, because there
may be otherwise-identical text items.

There's a difference between testing for a match, and testing for
equality. If you want to test for equality, use 'eq' for strings and
'==' for numerical values. REs are great, but they aren't the
universal tool to solve every problem.

If you want to test for the equality of two strings, do that -- don't
try to match them. If you want to test whether a string contains the
exact copy of a substring, use the appropriate functions, like
index().

Obviously, what you do in your script depends on what you want done in
your logic. One thing you might want to consider is substituting
something for every non-word character, or everything that doesn't
match [0-9a-zA-Z]. In my job, I have a problem with extraneous
apostrophes (using CSV, which uses REs indirectly) and have learned to
replace the apostrophes like this:

while (<INPUT>)
{
next unless /\w/;
chomp;
s/'/\\/'g;
# continue processing
}

CC
 
R

ReMo...

See perldoc perlretut for a quick intro about meta-
characters. Various metacharacters will cause the
regex to fail as mentioned.

I wish I'd done that before.
One problem is that there must be a literal '?' in
the regex in order to match the '?' in the string
being matched. Since '?' is a regex metacharacter
with special meaning to the regex compilation and
not a literal '?', the match would fail.

The 're' pragma can be helpful in seeing what
happens:

perl -Mre=debug -wle "print 'not'
if 'third?1000' !~ /third?1000/"

Compiling REx "third?1000"
Final program:
1: EXACT <thir> (3)
3: CURLY {0,1} (7)
5: EXACT <d> (0)
7: EXACT <1000> (9)
9: END (0)
anchored "thir" at 0 floating "1000" at 4..5 (checking
floating) minlen 8
Guessing start of match in sv for REx "third?1000"
against "third?1000"
Found floating substr "1000" at offset 6...
Contradicts anchored substr "thir", giving up...
Match rejected by optimizer

As it turns out though, the debug looks to me as
if the compilation fails for another reason when
the optimizer determines "1000" will occurs at
offset 4 or 5 in the pattern which won't match its
position at offset 6 in the string being matched.

Using debugging would have definitely pointed me in the right
direction.

It starts giving an exception one character previous to the
metacharacter... I think it's checking "d". Then 4..5 may refer
to the boundary between "d" and "?".
 
R

ReMo...

In the above, the nondigits represent arbitrary text that digits are
added to for a multi-array sort in a module I'm making, because there
may be otherwise-identical text items.

There's a difference between testing for a match, and testing for
equality. If you want to test for equality, use 'eq' for strings and
'==' for numerical values. REs are great, but they aren't the
universal tool to solve every problem.

If you want to test for the equality of two strings, do that -- don't
try to match them. If you want to test whether a string contains the
exact copy of a substring, use the appropriate functions, like
index().

Obviously, what you do in your script depends on what you want done in
your logic. One thing you might want to consider is substituting
something for every non-word character, or everything that doesn't
match [0-9a-zA-Z]. In my job, I have a problem with extraneous
apostrophes (using CSV, which uses REs indirectly) and have learned to
replace the apostrophes like this:

while (<INPUT>)
{
next unless /\w/;
chomp;
s/'/\\/'g;
# continue processing
}

CC

Something like that was my backup plan, tho I would have used
3-character strings.
 
J

Jim Gibson

ReMo... said:
Using debugging would have definitely pointed me in the right
direction.

It starts giving an exception one character previous to the
metacharacter... I think it's checking "d". Then 4..5 may refer
to the boundary between "d" and "?".

The '?' modifies the character preceding it and means "zero or one", so
yes, it is looking at the 'd'. The regular expression 'third?1000' can
match either 'third1000' or 'thir1000' but does not match the string
'third?1000' because the '?' in the string is not matched by anything
in the regular expression (I hope that is clear).
 
S

sln

Using debugging would have definitely pointed me in the right
direction.

It starts giving an exception one character previous to the
metacharacter... I think it's checking "d". Then 4..5 may refer
to the boundary between "d" and "?".

There are ways to physically visualize regular expressions such
that it is a lot easier to read. The easier it is to read, the better.

Think of a regular expression as a 2 dimentional object,
with literals being one dimension X and metacharacters being the other
dimension Y.

--------
This third?1000
is really this:

third 1000
?

where a space is left as a placeholder where ? goes.

Quantifier metachars like +*? affect the thing to its immediate left.
Here, ? affects only the character 'd' it says match 'd'
once or not at all.
So, third1000 or thir1000 will match

-----
This third\?1000
is really this:

third?1000

where ? is now literally a ? not a metacharacter.
It will only match third?1000.

------
This thi(rd)?1000
is really this:

thi rd 1000
( )?

where the quantifier ? has the same meaning but affects
the group of characters enclosed by the parenths ( ).
In this case the parenthesis are grouping metachars.

------
After you get the hang of it, you can structure a
regular expression into a pseudo dimensioned object so
that the quantifiers and other metachars are distinguishable
from the literal text.

/ # regex delimeter

^ # metachar, begining of string
( # metachar, start of grouping 1
third # literal text 'third'
){2} # metachar, end of grouping 1, {range}, match group 1 exactly 2 times

( # beginning of grouping 2
1000 # literal text '1000'
){3} # metachar, end of grouping 2, match group 2 exactly 3 times
$ # metachar, end of string

/x # regex delimeter and x modifier (ignore literal whitespace in expression)

This will only match 'thirdthird100010001000.

When you look at it this way, regular expressions are not confusing at all.
Some herky jerky's jam it all together in a single line to feel superior, just
ignore them.

The first thing you do when trying to decipher one is to convert it into a structure
like above. When its broken down like this its easier.

good luck.
-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,816
Latest member
SapanaCarpetStudio

Latest Threads

Top