Crazy gsub/regex scheme - can this be done better?

W

Wes Gamble

All,

I have a method (that I believe to be working) that will take arbitrary
HTML and quote all of the non-quoted attributes (so href=junk would
become href="junk").

The method is below. As you can see it's a gsub within a gsub, where
the first gsub regex basically identifies any tag that has at least one
unquoted attribute, and then the inner gsub fixes ALL of the quoted
attributes.

QUESTION: Is there a way to do this with one gsub, or is this scheme
really the only valid way to handle it?

Thanks,
Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one "bad" attribute value pair
#The "inner" regex is to actually fix ALL of the "bad" attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag
name, followed by whitespace
(?:[a-zA-Z0-9]+?=(['"])(.*?)\1\s*)*? #Any number of valid
attribute-value pairs (attribute="value"), not-greedy
[a-zA-Z0-9]+?=[^"'\s>]+\s*? #An unquoted
attribute-value pair (attribute=value)
.*?> #Rest of tag
/mix) { |s|
s.gsub(/(\s+[a-zA-Z0-9]+?=)([^"'\s>]+)(\s*?)/) {
|sub_s| "#{$1}\"#{$2}\"#{$3}" }
}
end
 
E

Eero Saynatkari

Wes said:
All,

I have a method (that I believe to be working) that will take arbitrary
HTML and quote all of the non-quoted attributes (so href=junk would
become href="junk").

You might want to just look into using Tidy, hpricot
or something that should fix broken HTML to compliant
XHTML. They would probably do this for you.
The method is below. As you can see it's a gsub within a gsub, where
the first gsub regex basically identifies any tag that has at least one
unquoted attribute, and then the inner gsub fixes ALL of the quoted
attributes.

QUESTION: Is there a way to do this with one gsub, or is this scheme
really the only valid way to handle it?

Thanks,
Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one "bad" attribute value pair
#The "inner" regex is to actually fix ALL of the "bad" attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag
name, followed by whitespace
(?:[a-zA-Z0-9]+?=(['"])(.*?)\1\s*)*? #Any number of valid
attribute-value pairs (attribute="value"), not-greedy
[a-zA-Z0-9]+?=[^"'\s>]+\s*? #An unquoted
attribute-value pair (attribute=value)
.*?> #Rest of tag
/mix) { |s|
s.gsub(/(\s+[a-zA-Z0-9]+?=)([^"'\s>]+)(\s*?)/) {
|sub_s| "#{$1}\"#{$2}\"#{$3}" }
}
end
 
W

Wes Gamble

The problem with those kind of parsers (I'm using Rubyful Soup to some
degree) is that they try to "fix" the HTML for you and sometimes cause
it to be rendered incorrectly compared to the original "incorrect"
implementation.

WG
 
W

Wes Gamble

[ INSANE COMMENT: I just want to say that the black magic that is
regexes is so powerful and alluring that I can't resist it and at the
same time so repulsive that I never want to do it again. :) :) :) ]

Update - my original scheme would fail when there was an attribute like

content="text/html; charset=UTF-8"

because the latter half would be seen as needing to be charset="UTF-8".

Thus, I became intimate with negative zero-width lookahead.

Here's what I believe to be a more correct solution (I apologize for the
formatting but I wanted to leave the comments in here).

Wes

#Make sure that every tag attribute is contained within either single or
double quotes.
#The initial regex is to find at least one "bad" attribute value pair
#The "inner" regex is to actually fix ALL of the "bad" attribute value
pairs
private
def ensure_quoted_attributes
@html.gsub!(/<(?!!)[a-zA-Z0-9]+\s+ #Non-comment tag name,
followed by whitespace
(?:[a-zA-Z-]+?=(['"]).*?\1\s*)*? #Any number of valid
attribute-value pairs (attribute="value"), not-greedy
[a-zA-Z-]+?=[^"'\s>]+\s*? #An unquoted
attribute-value pair (attribute=value)
.*?> #Rest of tag
/mix) { |s| #For each tag gotten
from the first regex, globally substitute into it based on...
s.gsub(/(\s+[a-zA-Z-]+=) #Attribute name
(?!(['"])[^'"]*?\2[\s>]) #If the value
looks like "stuff", then don't match, it's fine
(?![^'"]*?['"][\s>]) #If the value
looks like stuff", then don't match, it must be the tail end of another
attribute-value pair
([^'"\s>]+) #Get the
no-whitespace, no-'>', no quote text
/mix) { |sub_s| "#{$1}\"#{$3}\"" }
#Substitute attribute name="attribute value"
}
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,698
Latest member
LydiaHalle

Latest Threads

Top