Diez said:
Why do you use BeautifulSoup on that? It's generated content, and I
suppose it is well-formed, most probably even xml. So use a standard
parser here, better yet somthing like lxml/elementtree
Diez
Once upon a time I have written for my own purposes some code on this
subject, so maybe it can be used as a starter (tested a bit, but
consider its status as a kind of alpha release):
<code>
from urllib import urlopen
from sgmllib import SGMLParser
class mySGMLParserClassProvidingListOf_HREFs(SGMLParser):
# provides only HREFs <a href="someURL"> for links to another pages skipping
# references to:
# - internal links on same page : "#..."
# - email adresses : "mailto:..."
# and skipping part with appended internal link info, so that e.g.:
# - "LinkSpec#internalLinkID" will be listed as "LinkSpec" only
# ---
# reset() overwrites an empty function available in SGMLParser class
def reset(self):
SGMLParser.reset(self)
self.A_HREFs = []
#: def reset(self)
# start_a() overwrites an empty function available in SGMLParser class
# from which this class is derived. start_a() will be called each
time the
# SGMLParser detects an <a ...> tag within the feed(ed) HTML document:
def start_a(self, tagAttributes_asListOfNameValuePairs):
for attrName, attrValue in tagAttributes_asListOfNameValuePairs:
if attrName=='href':
if attrValue[0] != '#' and attrValue[:7] !='mailto:':
if attrValue.find('#') >= 0:
attrValue = attrValue[:attrValue.find('#')]
#: if
self.A_HREFs.append(attrValue)
#: if
#: if
#: for
#: def start_a(self, attributes_NamesAndValues_AsListOfTuples)
#: class mySGMLParserClassProvidingListOf_HREFs(SGMLParser)
#
------------------------------------------------------------------------------
# ---
# Execution block:
fileLikeObjFrom_urlopen = urlopen('
www.google.com') # set URL
mySGMLParserClassObj_withListOfHREFs =
mySGMLParserClassProvidingListOf_HREFs()
mySGMLParserClassObj_withListOfHREFs.feed(fileLikeObjFrom_urlopen.read())
mySGMLParserClassObj_withListOfHREFs.close()
fileLikeObjFrom_urlopen.close()
for href in mySGMLParserClassObj_withListOfHREFs.A_HREFs:
print href
#: for
</code>
Claudio Grondi