Is there a way to abandon a gsub if you're using a block?

W

Wes Gamble

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don't want
to do the substitution, is there a way to simply abandon the gsub call
on that particular iteration?

Wes
 
B

Brian Adkins

Wes Gamble said:
I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don't want
to do the substitution, is there a way to simply abandon the gsub call
on that particular iteration?

Can't you simply return the match as a no-op ?
 
D

David A. Black

Hi --

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don't want
to do the substitution, is there a way to simply abandon the gsub call
=> "zzzzez"


David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
"Ruby 1.9: What You Need To Know" Envycasts with David A. Black
http://www.envycasts.com
 
R

Robert Klemme

2009/6/25 Wes Gamble said:
I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be

If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

Kind regards

robert
 
W

Wes Gamble

Robert said:
If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

self.html.gsub!(/<a\s+?[^>]*?href=(['"]) #<a up to and including
href=' or href="
(?!mailto:)(.*?) #Contents of any non-mailto:
href attribute
\1.*?> #End of href attribute (same
quote) + arbitrary text to end of opening <a> tag
(.*?) #Contents of <a> - the "link
display"
<\\?\/a>/mix) { #Closing </a> tag, allowing
for optional \, e.g. </a> or <\/a>

So, this regex is attempting to pull out the contents of an href in a
<a> tag, as well as the content enclosed by the <a> tag.

The problem comes when it encounters a particularly nefarious kind of
HTML which looks like this:

<a href="x"><div>....<a href="x"><img src="y"></a>....</div>

and there is no closing </a> for the first anchor. What I want to pull
is the _valid_ <a> tag "on the inside", but what I get is the first <a>
tag up to the closing </a> tag, which is not correct. The problem is
that the first <a> tag just shouldn't be there at all.

So I need to modify my regex to not match if there is a <a> tag inside
of another one. I tried for about 30 minutes yesterday using a (?!)
assertion, but couldn't quite get it.

Thanks,
Wes
 
R

Robert Klemme

Robert said:
If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

self.html.gsub!(/<a\s+?[^>]*?href=(['"]) #<a up to and including
href=' or href="
(?!mailto:)(.*?) #Contents of any non-mailto:
href attribute
\1.*?> #End of href attribute (same
quote) + arbitrary text to end of opening <a> tag
(.*?) #Contents of <a> - the "link
display"
<\\?\/a>/mix) { #Closing </a> tag, allowing
for optional \, e.g. </a> or <\/a>

So, this regex is attempting to pull out the contents of an href in a
<a> tag, as well as the content enclosed by the <a> tag.

The problem comes when it encounters a particularly nefarious kind of
HTML which looks like this:

<a href="x"><div>....<a href="x"><img src="y"></a>....</div>

and there is no closing </a> for the first anchor. What I want to pull
is the _valid_ <a> tag "on the inside", but what I get is the first <a>
tag up to the closing </a> tag, which is not correct. The problem is
that the first <a> tag just shouldn't be there at all.

Another way to put it is that you want to match <a>...</a> without any
intermediate said:
So I need to modify my regex to not match if there is a <a> tag inside
of another one. I tried for about 30 minutes yesterday using a (?!)
assertion, but couldn't quite get it.

So the basic pattern here is that you want to match a combination A...B
without any A in between.

We try with a simple example:

irb(main):005:0> s = '....A;;A+++B'
=> "....A;;A+++B"
irb(main):006:0> s.scan %r{A(?:.(?!A))+B}
=> ["A+++B"]

Now with HTML like string:

irb(main):008:0> t = s.gsub(/A/, '<a href="foo">').gsub(/B/, '</a>')
=> "....<a href=\"foo\">;;<a href=\"foo\">+++</a>"
irb(main):017:0> t.scan %r{<a(?:\s+\w+=["'][^"']*["'])*>(?:.(?!<a))*?</a>}i
=> ["<a href=\"foo\">+++</a>"]

A bit more readable

irb(main):024:0> t.scan %r{
irb(main):025:0/ <a(?:\s+\w+=["'][^"']*["'])*> # opening tag
irb(main):026:0/ (?:.(?!<a))*? # between <a> and </a>
irb(main):027:0/ </a> # closing tag
irb(main):028:0/ }mix
=> ["<a href=\"foo\">+++</a>"]

The trick is to have a negative lookahead assertion on *each* character
between the beginning and ending sequence. Thus avoiding a match if the
opening sequence appears anywhere in between.

Kind regards

robert
 
W

Wes Gamble

So the way to read this:

(?:.(?!<a))*?

would be

"match on any character as long as it isn't followed by a '<a'"

Why do you need the positive lookahead assertion though - to ensure that
the characters aren't consumed in case of a bad match?

Wes
 
R

Robert Klemme

So the way to read this:

(?:.(?!<a))*?

would be

"match on any character as long as it isn't followed by a '<a'"
Exactly.

Why do you need the positive lookahead assertion though - to ensure that
the characters aren't consumed in case of a bad match?

What positive lookahead?

Kind regards

robert
 
W

Wes Gamble

My mistake - ?: doesn't generate backreferences, I thought it was a
positive lookahead.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,701
Latest member
XavierQ83

Latest Threads

Top