Crazy gsub/regex scheme - can this be done better?

Wes Gamble · Aug 11, 2006

All,

I have a method (that I believe to be working) that will take arbitrary
HTML and quote all of the non-quoted attributes (so href=junk would
become href="junk").

The method is below. As you can see it's a gsub within a gsub, where
the first gsub regex basically identifies any tag that has at least one
unquoted attribute, and then the inner gsub fixes ALL of the quoted
attributes.

QUESTION: Is there a way to do this with one gsub, or is this scheme
really the only valid way to handle it?

Thanks,
Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one "bad" attribute value pair
#The "inner" regex is to actually fix ALL of the "bad" attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag
name, followed by whitespace
(?:[a-zA-Z0-9]+?=(['"])(.*?)\1\s*)*? #Any number of valid
attribute-value pairs (attribute="value"), not-greedy
[a-zA-Z0-9]+?=[^"'\s>]+\s*? #An unquoted
attribute-value pair (attribute=value)
.*?> #Rest of tag
/mix) { |s|
s.gsub(/(\s+[a-zA-Z0-9]+?=)([^"'\s>]+)(\s*?)/) {
|sub_s| "#{$1}\"#{$2}\"#{$3}" }
}
end

Eero Saynatkari · Aug 11, 2006

Wes said:
All,

I have a method (that I believe to be working) that will take arbitrary
HTML and quote all of the non-quoted attributes (so href=junk would
become href="junk").

You might want to just look into using Tidy, hpricot
or something that should fix broken HTML to compliant
XHTML. They would probably do this for you.

The method is below. As you can see it's a gsub within a gsub, where
the first gsub regex basically identifies any tag that has at least one
unquoted attribute, and then the inner gsub fixes ALL of the quoted
attributes.

QUESTION: Is there a way to do this with one gsub, or is this scheme
really the only valid way to handle it?

Thanks,
Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one "bad" attribute value pair
#The "inner" regex is to actually fix ALL of the "bad" attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag
name, followed by whitespace
(?:[a-zA-Z0-9]+?=(['"])(.*?)\1\s*)*? #Any number of valid
attribute-value pairs (attribute="value"), not-greedy
[a-zA-Z0-9]+?=[^"'\s>]+\s*? #An unquoted
attribute-value pair (attribute=value)
.*?> #Rest of tag
/mix) { |s|
s.gsub(/(\s+[a-zA-Z0-9]+?=)([^"'\s>]+)(\s*?)/) {
|sub_s| "#{$1}\"#{$2}\"#{$3}" }
}
end

Wes Gamble · Aug 11, 2006

The problem with those kind of parsers (I'm using Rubyful Soup to some
degree) is that they try to "fix" the HTML for you and sometimes cause
it to be rendered incorrectly compared to the original "incorrect"
implementation.

WG

Wes Gamble · Aug 12, 2006

[ INSANE COMMENT: I just want to say that the black magic that is
regexes is so powerful and alluring that I can't resist it and at the
same time so repulsive that I never want to do it again.

]

Update - my original scheme would fail when there was an attribute like

content="text/html; charset=UTF-8"

because the latter half would be seen as needing to be charset="UTF-8".

Thus, I became intimate with negative zero-width lookahead.

Here's what I believe to be a more correct solution (I apologize for the
formatting but I wanted to leave the comments in here).

Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one "bad" attribute value pair
#The "inner" regex is to actually fix ALL of the "bad" attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag name,
followed by whitespace
(?:[a-zA-Z-]+?=(['"]).*?\1\s*)*? #Any number of valid
attribute-value pairs (attribute="value"), not-greedy
[a-zA-Z-]+?=[^"'\s>]+\s*? #An unquoted
attribute-value pair (attribute=value)
.*?> #Rest of tag
/mix) { |s| #For each tag gotten
from the first regex, globally substitute into it based on...
s.gsub(/(\s+[a-zA-Z-]+=) #Attribute name
(?!(['"])[^'"]*?\2[\s>]) #If the value
looks like "stuff", then don't match, it's fine
(?![^'"]*?['"][\s>]) #If the value
looks like stuff", then don't match, it must be the tail end of another
attribute-value pair
([^'"\s>]+) #Get the
no-whitespace, no-'>', no quote text
/mix) { |sub_s| "#{$1}\"#{$3}\"" }
#Substitute attribute name="attribute value"
}
end

How can I remove the specific error message after regex validation is done?	1	Jan 25, 2023
can this be done with generics?	32	Nov 25, 2013
Regex help	6	Jun 15, 2005
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
A nice way to use regex for complicate parsing	3	Mar 29, 2007
can someone help me with this baffling regex ?	4	Feb 8, 2011
regex problem	7	Jun 12, 2009
how do i configure this code	0	Oct 7, 2010

Crazy gsub/regex scheme - can this be done better?

Wes Gamble

Eero Saynatkari

Wes Gamble

Wes Gamble

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads