regex newbie

G

Greg Carlson

I've looked through a number of books and faq's and such and haven't been
able to solve my regex conundrum. I need to find the first match before
another match. For example, with the string 'abcdefgabcdefgfooabcdefg', I
need to match 'foo' and the 'a' previous to but nearest 'foo' (not the one
at the beginning of the string). Also, there's an unknown number of
characters between the 'a' and the 'foo'. Any help would be greatly
appreciated.

Greg Carlson
 
D

Dave Cardwell

Greg Carlson said:
I've looked through a number of books and faq's and such and haven't been
able to solve my regex conundrum. I need to find the first match before
another match. For example, with the string 'abcdefgabcdefgfooabcdefg', I
need to match 'foo' and the 'a' previous to but nearest 'foo' (not the one
at the beginning of the string). Also, there's an unknown number of
characters between the 'a' and the 'foo'. Any help would be greatly
appreciated.

Greg Carlson

Normally a regular expression tries to gobble up as much as it can, in this
case it will try to match the 'a' furthest away from 'foo'.

To get round this, you can do:
/a[^a]*foo/
which will match an 'a', any number of anything-but-a, then foo.

Alternatively you can do:
/a.*?foo/
Here the ? makes the regexp 'not greedy'. That is, it will try to match
across the minimum amount of characters (hence the closest 'a' to 'foo').


Either would work, though I'd wager the second was using the best coding
practice.


Regards,
 
B

Brian McCauley

Greg Carlson said:
Subject: regex newbie

Please put the subject of your post in the Subject of your post. If
in doubt try this simple test. Imagine you could have been bothered
to have done a search before you posted. Next imagine you found a
thread with your subject line. Would you have been able to recognise
it as the same subject?
I've looked through a number of books and faq's and such and haven't been
able to solve my regex conundrum. I need to find the first match before
another match. For example, with the string 'abcdefgabcdefgfooabcdefg', I
need to match 'foo' and the 'a' previous to but nearest 'foo' (not the one
at the beginning of the string). Also, there's an unknown number of
characters between the 'a' and the 'foo'. Any help would be greatly
appreciated.

If 'a' really is a single character then see other response.

Otherwise I'd usually use...

/(.*)(a.*foo)/

Note this actually matches both everything before the desired target
and the desired target. Note also this finds the last 'a' before the
_last_ 'foo'.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Brian McCauley

/a[^a]*foo/
which will match an 'a', any number of anything-but-a, then foo.

That's the normal solution assuming 'a' really is single character.
Alternatively you can do:
/a.*?foo/
Here the ? makes the regexp 'not greedy'. That is, it will try to match
across the minimum amount of characters (hence the closest 'a' to 'foo').

Bzzzt! Non-geedy does not trump first-match.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
G

Greg Carlson

Brian McCauley said:
Please put the subject of your post in the Subject of your post....

Oops. I see your point.
If 'a' really is a single character then see other response.

Otherwise I'd usually use...

/(.*)(a.*foo)/

Note this actually matches both everything before the desired target
and the desired target. Note also this finds the last 'a' before the
_last_ 'foo'.

That makes sense. So how would I find the last 'a' before the _first_ 'foo'?
My latest attempt is:

$tmp = 'abcdefgabcdefgfooabcdefgfoo';
$tmp =~ m/(foo)/ogcs;
[do stuff with $1] # this part works as I'd hoped
$tmp = substr($tmp, 0, pos($tmp));
$tmp =~ m/.*(a).+?$/os;

But that still got the first 'a'. Also, $tmp can be rather large so the
substr is a bit distasteful. Is there any way to search backward from the
current pos or something similar? Thanks again.

Greg Carlson
 
G

Glenn Jackman

Greg Carlson said:
That makes sense. So how would I find the last 'a' before the _first_ 'foo'?
My latest attempt is:

$tmp = 'abcdefgabcdefgfooabcdefgfoo';

my ($stuff) = $tmp =~ /(a[^a]*foo)/;
 
G

Glenn Jackman

Greg Carlson said:
That makes sense. So how would I find the last 'a' before the _first_ 'foo'?
My latest attempt is:

$tmp = 'abcdefgabcdefgfooabcdefgfoo';

As Dave Cardwell posted earlier:

my ($stuff) = $tmp =~ /(a[^a]*foo)/;
 
B

Brian McCauley

I shall assume that since you are still persuing this approach that in
your real problem 'a' is not a single character.
That makes sense. So how would I find the last 'a' before the _first_ 'foo'?
My latest attempt is:

$tmp = 'abcdefgabcdefgfooabcdefgfoo';
$tmp =~ m/(foo)/ogcs;

Don't put qualifiers on m// that you don't understand. /os have no
effect in the above line so if you understood them you'd not have used
them. :)
[do stuff with $1] # this part works as I'd hoped

Don't ever do stuff with $1 without first checking that the match
succeded. If you are sure that the match will succeded always then
append "or die" to it. This serves a dual function. Firstly it acts
a comment to anyone who reads your program meaning "I don't think this
match can ever fail". Secondly if it turns out you were wrong Perl
will tell you.
$tmp = substr($tmp, 0, pos($tmp));
$tmp =~ m/.*(a).+?$/os;
But that still got the first 'a'. Also, $tmp can be rather large so the
substr is a bit distasteful. Is there any way to search backward from the
current pos or something similar?

Yes, this is what \G is for - it anchors a regex at the current
pos()ition.

$_ = 'abcdefgabcde-FIRST-fooabcdefg-SECOND-foo';

# I assume pos()==0 initially
# Set pos() to be the end of first 'foo'
/foo/gc or die "no foo";

# Extract everything from the last 'a' before the current position
# to the current position.
/.*(a.*)\G/ or die "no a before first foo";

print "$1\n";

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Brian McCauley

That well know clown Brian McCauley said:
Don't put qualifiers on m// that you don't understand.

Advice he'd do well to follow himself :)
$_ = 'abcdefgabcde-FIRST-fooabcdefg-SECOND-foo';
/foo/gc or die "no foo";
/.*(a.*)\G/ or die "no a before first foo";
print "$1\n";

The /c above does nothing.

$_ = 'abcdefgabcde-FIRST-fooabcdefg-SECOND-foo';
/foo/g or die "no foo";
/.*(a.*)\G/ or die "no a before first foo";
print "$1\n";

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
B

Brian McCauley

Showing a worrying trend towards insanity, Brian McCauley
Advice he'd do well to follow himself :)

Yeah, and like don't remove them from other people's code without
thinking either dude!
/foo/g or die "no foo";
/.*(a.*)\G/ or die "no a before first foo";

I suspect in the OP's problem the real target can span newlines so the
OP's use of /s is necessary in the second match.

/.*(a.*)\G/s or die "no a before first foo";

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top