sgmllib problem & proposed fix.

C. Titus Brown · Dec 17, 2004

Hi all,

while playing with PBP/mechanize/ClientForm, I ran into a problem with
the way htmllib.HTMLParser was handling encoded tag attributes.

Specifically, the following HTML was not being handled correctly:

<option value="Small (6&quot

">Small (6)</option>

The 'value' attr was being given the escaped value, not the
correct unescaped value, 'Small (6")'.

It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is
based) does not unescape tag attributes. However, HTMLParser.HTMLParser
(the newer, more XHTML-friendly class) does do so.

My proposed fix is to change sgmllib to unescape tags in the same way
that HTMLParser.HTMLParser does. A context diff to sgmllib.py from
Python 2.4 is at the bottom of this message.

I'm posting to this newsgroup before submitting the patch because I'm
not too familiar with these classes and I want to make sure this
behavior is correct.

One question I had was this: as you can see from the code below, a
simple string.replace is done to replace encoded strings with their
unencoded translations. Should handle_entityref be used instead, as
with standard HTML text?

Another question: should this fix, if appropriate, be back-ported to
older versions of Python? (I doubt sgmllib has changed much, so it
should be pretty simple to do.)

thanks for any advice,
--titus

*** /u/t/software/Python-2.4/Lib/sgmllib.py 2004-09-08
18:49:58.000000000 -0700
--- sgmllib.py 2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
def unknown_charref(self, ref): pass
def unknown_entityref(self, ref): pass

+ # Internal -- helper to remove special character quoting
+ def unescape(self, s):
+ if '&' not in s:
+ return s
+ s = s.replace("<", "<")
+ s = s.replace(">", ">")
+ s = s.replace("'", "'")
+ s = s.replace(""", '"')
+ s = s.replace("&", "&") # Must be last
+
+ return s
+

class TestSGMLParser(SGMLParser):

C. Titus Brown · Dec 17, 2004

Whoops! Forgot an executable example

.

Attached, and also available at

http://issola.caltech.edu/~t/transfer/test-enc.py
http://issola.caltech.edu/~t/transfer/test-enc.html

Run 'python test-enc.py test-enc.html' and note that
htmllib.HTMLParser-based parsers give different output than
HTMLParser.HTMLParser-based parsers.

cheers,
--titus

#!/usr/bin/env python2.4
import htmllib
import HTMLParser
import formatter

### a simple mix-in to demonstrate the problem.

class MixinTest:
def start_option(self, attrs):
print '==> OPTION starting', attrs

# Definition of entities -- derived classes may override
entitydefs = \
{'lt': '<', 'gt': '>', 'amp': '&', 'quot': '"', 'apos': '\''}

def handle_entityref(self, name):
print '==> HANDLING ENTITY', name
table = self.entitydefs
if name in table:
self.handle_data(table[name])
else:
self.unknown_entityref(name)
return

####

class htmllib_Parser(MixinTest, htmllib.HTMLParser):
def __init__(self):
htmllib.HTMLParser.__init__(self, formatter.NullFormatter())

class nonhtmllib_Parser(MixinTest, HTMLParser.HTMLParser):
def handle_starttag(self, name, attrs):
"Redirect OPTION tag ==> MixinTest.start_option"
if name == 'option':
self.start_option(attrs)

pass

###

import sys
data = open(sys.argv[1]).read()

print 'PARSING with htmllib.HTMLParser'

htmllib_p = htmllib_Parser()
htmllib_p.feed(data)

print '\nPARSING with HTMLParser.HTMLParser'

nonhtmllib_p = nonhtmllib_Parser()
nonhtmllib_p.feed(data)

Py 2.5: Bug in sgmllib	2	Oct 22, 2006
Making sgmlib more liberal	0	Aug 26, 2004
Memory leak problem (while using tkinter)	2	Dec 31, 2008
not able to HTTPS page from python	3	Nov 9, 2005
very simple Genetic Algorithm completed	4	Jan 31, 2008
Clearing a session and reload() problem (with repro error)	4	Sep 8, 2008
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
Entry Widget problem	0	Jun 30, 2006

sgmllib problem & proposed fix.

C. Titus Brown

C. Titus Brown

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads