sgmllib problem & proposed fix.

C

C. Titus Brown

Hi all,

while playing with PBP/mechanize/ClientForm, I ran into a problem with
the way htmllib.HTMLParser was handling encoded tag attributes.

Specifically, the following HTML was not being handled correctly:

<option value="Small (6&quot;)">Small (6)</option>

The 'value' attr was being given the escaped value, not the
correct unescaped value, 'Small (6")'.

It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is
based) does not unescape tag attributes. However, HTMLParser.HTMLParser
(the newer, more XHTML-friendly class) does do so.

My proposed fix is to change sgmllib to unescape tags in the same way
that HTMLParser.HTMLParser does. A context diff to sgmllib.py from
Python 2.4 is at the bottom of this message.

I'm posting to this newsgroup before submitting the patch because I'm
not too familiar with these classes and I want to make sure this
behavior is correct.

One question I had was this: as you can see from the code below, a
simple string.replace is done to replace encoded strings with their
unencoded translations. Should handle_entityref be used instead, as
with standard HTML text?

Another question: should this fix, if appropriate, be back-ported to
older versions of Python? (I doubt sgmllib has changed much, so it
should be pretty simple to do.)

thanks for any advice,
--titus

*** /u/t/software/Python-2.4/Lib/sgmllib.py 2004-09-08
18:49:58.000000000 -0700
--- sgmllib.py 2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
def unknown_charref(self, ref): pass
def unknown_entityref(self, ref): pass

+ # Internal -- helper to remove special character quoting
+ def unescape(self, s):
+ if '&' not in s:
+ return s
+ s = s.replace("&lt;", "<")
+ s = s.replace("&gt;", ">")
+ s = s.replace("&apos;", "'")
+ s = s.replace("&quot;", '"')
+ s = s.replace("&amp;", "&") # Must be last
+
+ return s
+

class TestSGMLParser(SGMLParser):
 
C

C. Titus Brown

Whoops! Forgot an executable example ;).

Attached, and also available at

http://issola.caltech.edu/~t/transfer/test-enc.py
http://issola.caltech.edu/~t/transfer/test-enc.html

Run 'python test-enc.py test-enc.html' and note that
htmllib.HTMLParser-based parsers give different output than
HTMLParser.HTMLParser-based parsers.

cheers,
--titus

#!/usr/bin/env python2.4
import htmllib
import HTMLParser
import formatter

### a simple mix-in to demonstrate the problem.

class MixinTest:
def start_option(self, attrs):
print '==> OPTION starting', attrs

# Definition of entities -- derived classes may override
entitydefs = \
{'lt': '<', 'gt': '>', 'amp': '&', 'quot': '"', 'apos': '\''}

def handle_entityref(self, name):
print '==> HANDLING ENTITY', name
table = self.entitydefs
if name in table:
self.handle_data(table[name])
else:
self.unknown_entityref(name)
return

####

class htmllib_Parser(MixinTest, htmllib.HTMLParser):
def __init__(self):
htmllib.HTMLParser.__init__(self, formatter.NullFormatter())

class nonhtmllib_Parser(MixinTest, HTMLParser.HTMLParser):
def handle_starttag(self, name, attrs):
"Redirect OPTION tag ==> MixinTest.start_option"
if name == 'option':
self.start_option(attrs)

pass

###

import sys
data = open(sys.argv[1]).read()

print 'PARSING with htmllib.HTMLParser'

htmllib_p = htmllib_Parser()
htmllib_p.feed(data)

print '\nPARSING with HTMLParser.HTMLParser'

nonhtmllib_p = nonhtmllib_Parser()
nonhtmllib_p.feed(data)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,698
Latest member
LydiaHalle

Latest Threads

Top