Using gsub to remove embedded newlines in HTML file

Wes Gamble · Aug 2, 2006

I have an HTML file that is in a string.

I want to use gsub! to recursively remove any embedded newlines and
whitespace within two known delimeters.

Given a string that includes this kind of string:

~^LNK:http://slashdot.org/login.pl?op=newuserform~
Create a new account
^~

I want to replace the above with:

~^LNK:http://slashdot.org/login.pl?op=newuserform~Create a new account^~

(stripping out the newlines and whitespace)

Having trouble writing the regex for this.

I think I want something like:

/~\^LNK:.*?([\s\r\n])+.*?\^~/

that I could use in:

str.gsub!(/~\^LNK:.*?([\s\r\n])+.*?\^~/, '')

to replace all of the whitespace, or potential newline characters with
null strings.

But I don't think this will work because I really need to loop _within_
each substring of my large HTML string. The thing about gsub is that it
will substitute the entire matched string.

Do I need to scan out the ~^LNK.*?^~, operate on those and then put them
back into the larger string?

I'm not sure I'm asking this very well, so I apologize if that's the
case.

Thanks,
Wes

Wes Gamble · Aug 3, 2006

Something like:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\s\r\n]/, '')
@html.gsub!(/#{link_line}/mi, new_link_line)
end

Wes Gamble · Aug 3, 2006

Wes said:
Something like:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\s\r\n]/, '')
@html.gsub!(/#{link_line}/mi, new_link_line)
end

This seems to work well:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\t\r\n]/, '')
@html.gsub!(/#{Regexp.escape(link_line)}/mi, new_link_line) if
link_line != new_link_line
end

I wonder if I could have done with with one @html.gsub!() command, but
this is much more understandable to me anyway so I'll stick with this.

Thanks,
Wes

Carlos · Aug 3, 2006

Wes said:
Wes said:

Something like:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\s\r\n]/, '')
@html.gsub!(/#{link_line}/mi, new_link_line)
end

Click to expand...

This seems to work well:

@html.scan(/~\^LNK:.*?\^~/mi).each do |link_line|
new_link_line = link_line.gsub(/[\t\r\n]/, '')
@html.gsub!(/#{Regexp.escape(link_line)}/mi, new_link_line) if
link_line != new_link_line
end

You can use a block with gsub:
@html.gsub!(/~\^LNK:.*?~/mi) { |s| s.gsub /\s/, '' }

or something like that.

Good luck.
--

Wes Gamble · Aug 3, 2006

Thanks. That is the _Ruby_ way to do it, and that's what I wanted to
know

.

I've used blocks with gsub but I keep forgetting that I can put anything
in there - so far I've only used backrefs to pull out pieces of the
matching regex to rearrange things.

Wes

Regex to match all trailing whitespace _and_ newlines.	2	Sep 1, 2011
how to stop gsub from returning nil	9	Jun 16, 2008
Is there a way to abandon a gsub if you're using a block?	9	Jun 25, 2009
Remove trailing newlines (blank lines) ???	6	Jan 27, 2008
Replacing part of a matched regular expression using gsub	4	Mar 24, 2008
How to always write Windows style newlines to a file?	7	Dec 13, 2006
Strinpping html using regexp	3	May 5, 2009
Better way to remove all occurrence of matches on a String	6	Oct 9, 2010

Using gsub to remove embedded newlines in HTML file

Wes Gamble

Wes Gamble

Wes Gamble

Carlos

Wes Gamble

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads