Is there a way to abandon a gsub if you're using a block?

Wes Gamble · Jun 25, 2009

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don't want
to do the substitution, is there a way to simply abandon the gsub call
on that particular iteration?

Wes

Brian Adkins · Jun 25, 2009

Wes Gamble said:
I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don't want
to do the substitution, is there a way to simply abandon the gsub call
on that particular iteration?

Can't you simply return the match as a no-op ?

David A. Black · Jun 25, 2009

Hi --

I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be
able to detect that it matched incorrectly, once I can inspect the
backreferences in the block.

So: if I determine via code in the substitution block that I don't want
to do the substitution, is there a way to simply abandon the gsub call
=> "zzzzez"

David

--
David A. Black / Ruby Power and Light, LLC
Ruby/Rails consulting & training: http://www.rubypal.com
Now available: The Well-Grounded Rubyist (http://manning.com/black2)
"Ruby 1.9: What You Need To Know" Envycasts with David A. Black
http://www.envycasts.com

Robert Klemme · Jun 26, 2009

2009/6/25 Wes Gamble said:
I am using the form of gsub that takes a block to determine what to
substitute.

My problem is that I can't quite get the regex working, but I will be

If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

Kind regards

robert

Wes Gamble · Jun 26, 2009

Robert said:
If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

self.html.gsub!(/<a\s+?[^>]*?href=(['"]) #<a up to and including
href=' or href="
(?!mailto

(.*?) #Contents of any non-mailto:
href attribute
\1.*?> #End of href attribute (same
quote) + arbitrary text to end of opening <a> tag
(.*?) #Contents of <a> - the "link
display"
<\\?\/a>/mix) { #Closing </a> tag, allowing
for optional \, e.g. </a> or <\/a>

So, this regex is attempting to pull out the contents of an href in a
<a> tag, as well as the content enclosed by the <a> tag.

The problem comes when it encounters a particularly nefarious kind of
HTML which looks like this:

<a href="x"><div>....<a href="x"><img src="y"></a>....</div>

and there is no closing </a> for the first anchor. What I want to pull
is the _valid_ <a> tag "on the inside", but what I get is the first <a>
tag up to the closing </a> tag, which is not correct. The problem is
that the first <a> tag just shouldn't be there at all.

So I need to modify my regex to not match if there is a <a> tag inside
of another one. I tried for about 30 minutes yesterday using a (?!)
assertion, but couldn't quite get it.

Thanks,
Wes

Robert Klemme · Jun 26, 2009

Robert said:
Robert said:

If you provide more detail about the input and the text that you want
to match we might be able to help fix the regular expression. IMHO
that approach is superior to simply returning the match effectively
replacing it with itself (which does work of course).

Click to expand...

self.html.gsub!(/<a\s+?[^>]*?href=(['"]) #<a up to and including
href=' or href="
(?!mailto(.*?) #Contents of any non-mailto:
href attribute
\1.*?> #End of href attribute (same
quote) + arbitrary text to end of opening <a> tag
(.*?) #Contents of <a> - the "link
display"
<\\?\/a>/mix) { #Closing </a> tag, allowing
for optional \, e.g. </a> or <\/a>

So, this regex is attempting to pull out the contents of an href in a
<a> tag, as well as the content enclosed by the <a> tag.

The problem comes when it encounters a particularly nefarious kind of
HTML which looks like this:

<a href="x"><div>....<a href="x"><img src="y"></a>....</div>

and there is no closing </a> for the first anchor. What I want to pull
is the _valid_ <a> tag "on the inside", but what I get is the first <a>
tag up to the closing </a> tag, which is not correct. The problem is
that the first <a> tag just shouldn't be there at all.

Another way to put it is that you want to match <a>...</a> without any

intermediate said:
So I need to modify my regex to not match if there is a <a> tag inside
of another one. I tried for about 30 minutes yesterday using a (?!)
assertion, but couldn't quite get it.

So the basic pattern here is that you want to match a combination A...B
without any A in between.

We try with a simple example:

irb(main):005:0> s = '....A;;A+++B'
=> "....A;;A+++B"
irb(main):006:0> s.scan %r{A(?:.(?!A))+B}
=> ["A+++B"]

Now with HTML like string:

irb(main):008:0> t = s.gsub(/A/, '<a href="foo">').gsub(/B/, '</a>')
=> "....<a href=\"foo\">;;<a href=\"foo\">+++</a>"
irb(main):017:0> t.scan %r{<a(?:\s+\w+=["'][^"']*["'])*>(?:.(?!<a))*?</a>}i
=> ["<a href=\"foo\">+++</a>"]

A bit more readable

irb(main):024:0> t.scan %r{
irb(main):025:0/ <a(?:\s+\w+=["'][^"']*["'])*> # opening tag
irb(main):026:0/ (?:.(?!<a))*? # between <a> and </a>
irb(main):027:0/ </a> # closing tag
irb(main):028:0/ }mix
=> ["<a href=\"foo\">+++</a>"]

The trick is to have a negative lookahead assertion on *each* character
between the beginning and ending sequence. Thus avoiding a match if the
opening sequence appears anywhere in between.

Kind regards

robert

Wes Gamble · Jun 26, 2009

Robert,

Many thanks,

Wes

Wes Gamble · Jun 26, 2009

So the way to read this:

(?:.(?!<a))*?

would be

"match on any character as long as it isn't followed by a '<a'"

Why do you need the positive lookahead assertion though - to ensure that
the characters aren't consumed in case of a bad match?

Wes

Robert Klemme · Jun 27, 2009

So the way to read this:

(?:.(?!<a))*?

would be

"match on any character as long as it isn't followed by a '<a'"
Exactly.

Why do you need the positive lookahead assertion though - to ensure that
the characters aren't consumed in case of a bad match?

What positive lookahead?

Kind regards

robert

Wes Gamble · Jun 27, 2009

My mistake - ?: doesn't generate backreferences, I thought it was a
positive lookahead.

Is there a way to get a single mode using all the points within a 2D array?	2	Oct 17, 2022
Low level block coding	0	Mar 6, 2022
Is there a way to input a unique number for each array output?	4	Aug 31, 2022
Is there a way to pass this state from component to the fetch?	1	Apr 24, 2023
regex gsub	3	Feb 26, 2011
Is there a way to add strings to a list without the quotation marks in C++?	1	Nov 9, 2020
lambda with $1 fails as gsub block	3	Dec 10, 2008
Interpolating $1, etc., from within a gsub block	4	Mar 17, 2009

Is there a way to abandon a gsub if you're using a block?

Wes Gamble

Brian Adkins

David A. Black

Robert Klemme

Wes Gamble

Robert Klemme

Wes Gamble

Wes Gamble

Robert Klemme

Wes Gamble

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads