C
C. Titus Brown
Hi all,
while playing with PBP/mechanize/ClientForm, I ran into a problem with
the way htmllib.HTMLParser was handling encoded tag attributes.
Specifically, the following HTML was not being handled correctly:
<option value="Small (6"">Small (6)</option>
The 'value' attr was being given the escaped value, not the
correct unescaped value, 'Small (6")'.
It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is
based) does not unescape tag attributes. However, HTMLParser.HTMLParser
(the newer, more XHTML-friendly class) does do so.
My proposed fix is to change sgmllib to unescape tags in the same way
that HTMLParser.HTMLParser does. A context diff to sgmllib.py from
Python 2.4 is at the bottom of this message.
I'm posting to this newsgroup before submitting the patch because I'm
not too familiar with these classes and I want to make sure this
behavior is correct.
One question I had was this: as you can see from the code below, a
simple string.replace is done to replace encoded strings with their
unencoded translations. Should handle_entityref be used instead, as
with standard HTML text?
Another question: should this fix, if appropriate, be back-ported to
older versions of Python? (I doubt sgmllib has changed much, so it
should be pretty simple to do.)
thanks for any advice,
--titus
*** /u/t/software/Python-2.4/Lib/sgmllib.py 2004-09-08
18:49:58.000000000 -0700
--- sgmllib.py 2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
def unknown_charref(self, ref): pass
def unknown_entityref(self, ref): pass
+ # Internal -- helper to remove special character quoting
+ def unescape(self, s):
+ if '&' not in s:
+ return s
+ s = s.replace("<", "<")
+ s = s.replace(">", ">")
+ s = s.replace("'", "'")
+ s = s.replace(""", '"')
+ s = s.replace("&", "&") # Must be last
+
+ return s
+
class TestSGMLParser(SGMLParser):
while playing with PBP/mechanize/ClientForm, I ran into a problem with
the way htmllib.HTMLParser was handling encoded tag attributes.
Specifically, the following HTML was not being handled correctly:
<option value="Small (6"">Small (6)</option>
The 'value' attr was being given the escaped value, not the
correct unescaped value, 'Small (6")'.
It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is
based) does not unescape tag attributes. However, HTMLParser.HTMLParser
(the newer, more XHTML-friendly class) does do so.
My proposed fix is to change sgmllib to unescape tags in the same way
that HTMLParser.HTMLParser does. A context diff to sgmllib.py from
Python 2.4 is at the bottom of this message.
I'm posting to this newsgroup before submitting the patch because I'm
not too familiar with these classes and I want to make sure this
behavior is correct.
One question I had was this: as you can see from the code below, a
simple string.replace is done to replace encoded strings with their
unencoded translations. Should handle_entityref be used instead, as
with standard HTML text?
Another question: should this fix, if appropriate, be back-ported to
older versions of Python? (I doubt sgmllib has changed much, so it
should be pretty simple to do.)
thanks for any advice,
--titus
*** /u/t/software/Python-2.4/Lib/sgmllib.py 2004-09-08
18:49:58.000000000 -0700
--- sgmllib.py 2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
def unknown_charref(self, ref): pass
def unknown_entityref(self, ref): pass
+ # Internal -- helper to remove special character quoting
+ def unescape(self, s):
+ if '&' not in s:
+ return s
+ s = s.replace("<", "<")
+ s = s.replace(">", ">")
+ s = s.replace("'", "'")
+ s = s.replace(""", '"')
+ s = s.replace("&", "&") # Must be last
+
+ return s
+
class TestSGMLParser(SGMLParser):