"negative" regex matching?

S

seven.reeds

Hi,

I have a regex question. I have arbitrary text and I want to search
it for a set of terms/substrings. In the simple case of one term
it is easy to find the match(es) and then mark them up with HTML
"span" tags. My issue is with more than one term.

Here is an example to illustrate. If I have the string:

Sarah likes Johnny's cooking

and the single term: "john" then I can match and highlight the match
resulting in:

Sarah likes <span>John</span>ny's cooking

Now what if I have two terms: "Johnny" & "john" -- in that order? I
can easily let myself end up with (in sequence):

<apply Johnny match>
Sarah likes <span>Johnny</span>'s cooking
<apply john match>
Sarah likes <span><span>John</span>ny</span>'s cooking

Ok, so what I want is to be able to search for and mark each term in
the string as long as that term is not already in a "span" clause.

I've done some digging in Friedl's RegEx book but I'm not sure if I
know enough to know what I am looking for?

ideas?
 
S

sln

Hi,

I have a regex question. I have arbitrary text and I want to search
it for a set of terms/substrings. In the simple case of one term
it is easy to find the match(es) and then mark them up with HTML
"span" tags. My issue is with more than one term.

Here is an example to illustrate. If I have the string:

Sarah likes Johnny's cooking

and the single term: "john" then I can match and highlight the match
resulting in:

Sarah likes <span>John</span>ny's cooking

Now what if I have two terms: "Johnny" & "john" -- in that order? I
can easily let myself end up with (in sequence):

<apply Johnny match>
Sarah likes <span>Johnny</span>'s cooking
<apply john match>
Sarah likes <span><span>John</span>ny</span>'s cooking

Ok, so what I want is to be able to search for and mark each term in
the string as long as that term is not already in a "span" clause.

I've done some digging in Friedl's RegEx book but I'm not sure if I
know enough to know what I am looking for?

ideas?

This what you are trying to do?

rxhtml.pl
-sln

----------------
use strict;
use warnings;

## globs ..

my $string = "
<apply Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking
";

## code ..

# use terms: Johnny,john
if ( getMatch( $string,'span','Johnny|john')) # add mods in term's
{ print "Matched:\n'$string'\n\n" }
else
{ print "No match.\n\n" }

# use terms: King,john .. case insensitive
if ( getMatch( $string,'span','(?i)King|john'))
{ print "Matched:\n'$string'\n\n" }
else
{ print "No match.\n\n" }

exit(0);

## subs ..

sub getMatch {
my ($tag,$terms) = @_[1,2];
$_[0] =~ s {(?<!<$tag>)(.*)($terms)(?!.*</?$tag>)}
{$1<$tag>$2</$tag>}g;
}
__END__

Matched:
'
<apply <span>Johnny</span> match>
Sarah likes <span>Johnny</span>'s cooking
<apply <span>john</span> match>
Sarah likes <span>Johnny</span>'s cooking
'

Matched:
'
<apply <span>Johnny</span> match>
Sarah likes <span>Johnny</span>'s coo<span>king</span>
<apply <span>john</span> match>
Sarah likes <span>Johnny</span>'s coo<span>king</span>
'
 
S

sln

Hi,

I have a regex question. I have arbitrary text and I want to search
it for a set of terms/substrings. In the simple case of one term
it is easy to find the match(es) and then mark them up with HTML
"span" tags. My issue is with more than one term.
[snip]

Ok, so what I want is to be able to search for and mark each term in
the string as long as that term is not already in a "span" clause.

I've done some digging in Friedl's RegEx book but I'm not sure if I
know enough to know what I am looking for?

ideas?

I posted an earlier plain look-ahead/behind assertion rx.
But, this won't work because of fixed width look behind.

So this friend, is a bullet proof way to do what you want.
Finally, a use for new 5.10 regex recursion code, which allows
for nested tags.

I've thoroughly tested this code. Taking into account the 'restraints'
of parsing markup (ie: validity), but thats the compromise you are
making for speed.

The regex will go along happily matching tags (in a nested fashion),
or, the terms you specify.

If any terms are inside of the tags (even nested), they are consumed
without any substitution (ie: they are left alone). The only thing
left to match are the terms themselves.

Both match, nested tags or terms, in an alternation (one or the other).
The reason the tags aren't substituted for themselves (ie its capture group)
is because of the new '\K' which excludes the tags.

Read about the new extended expressions
here -> 'perlre' in perldocs.

Also, in addition to tags, tag-attribute form is included as well:
<$tag></$tag> or <$tag attrib></$tag>.

Good luck!
-sln

-------------------
Output:
String =
'
<apply john Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking
<span id="medium_rectangle" class="_fwph">
Because Johnny does good cooking
</span>
King John
'

Terms =

Johnny|john - replaced 5
'
<apply <span>john</span> <span>Johnny</span> match>
Sarah likes <span>Johnny</span>'s cooking
<apply <span>john</span> match>
Sarah likes <span>Johnny</span>'s cooking
<span id="medium_rectangle" class="_fwph">
Because Johnny does good cooking
</span>
King John
'

(?i)King|john - replaced 4
'
<apply <span>john</span> <span>Johnny</span> match>
Sarah likes <span>Johnny</span>'s coo<span>king</span>
<apply <span>john</span> match>
Sarah likes <span>Johnny</span>'s coo<span>king</span>
<span id="medium_rectangle" class="_fwph">
Because Johnny does good cooking
</span>
<span>King</span> <span>John</span>
'
---------------------------------

use strict;
use warnings;
require 5.010_000;

## globs ..

my ($string, $result) =
qq{
<apply john Johnny match>
Sarah likes Johnny's cooking
<apply john match>
Sarah likes Johnny's cooking
<span id="medium_rectangle" class="_fwph">
Because Johnny does good cooking
</span>
King John
};

## code ..

print "\nString = \n'$string'\n\nTerms =\n";

print "\nJohnny|john - replaced ";
#
$result = getMatch( $string, 'span', 'Johnny|john');
print "$result\n";
print "'$string'\n" if $result;

print "\n(?i)King|john - replaced ";
#
$result = getMatch( $string, 'span', '(?i)King|john'); # case insensitive
print "$result\n";
print "'$string'\n" if $result;

exit(0);


## subs ..

sub getMatch
{
#* USES RX RECURSION '(?#)', new to 5.10
#* Start/End tags must have this specific form:
#* <$tag></$tag> or <$tag attrib></$tag>
#* --------------------------------------
my ($tag,$terms) = @_[1,2];
my $start = "<$tag(?:\\s+|>)"; # allow <tag> or <tag attribute>
my $end = "</$tag>";

my $replaced = 0;

$_[0] =~ s
{ # match ..

( # 1
$start
(?:
(?:(?!$start|$end).)++ # no backtracking
|
(?1) # recurse group 1
)*
$end
)
\K # effecient -- don't include tag data in match
|
( # 2
$terms
)
}

{ # replace ..
$replaced++, "<$tag>".$2."</$tag>" if defined $2
}xsge;

return $replaced;
}

__END__
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top