urllib2.urlopen(url) pulling something other than HTML

D

dogatemycomputer

I am reading "Python for Dummies" and found the following example of a
web crawler that I thought was interesting. The first time I keyed
the program and executed it I didn't understand it well enough to
debug it so I just skipped it. A few days later I realized that it
failed after a few seconds and I wanted to know if it was a
shortcoming of Python, a mistype on my part or just an inherent
problem with the script so I retyped it and started trying to figure
out what went wrong.

Please keep in mind I am very new to coding so I have tried RTFM
without much success. I have a basic understanding of what the
application is doing but I want to understand WHY it is doing it or
what the rationale is for doing it. Not necessarily how it does it..
In any case here is the gist of the app.

1 - a new spider is created
2 - it takes a single argument which is a web address (http://
www.google.com)
3 - the spider pulls a copy of the page source
4 - the spider parses it for links and if the link is on the same
domain and has not already been parsed then it appends the link to the
list of pages to be parsed

Being new I have a couple of questions that I am hoping someone can
answer with some degree of detail.

----------------------------------------------------------
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
----------------------------------------------------------

I get the idea that we're allocating some memory that looks like a
file so formatter.dumbwriter can manipulate it. The results are
passed to formatter.abstractformatter which does something else to the
HTML code. The results are then passed to "f" which is then passed to
htmllib.HTMLParser so it can parse the html for links. I guess I
don't understand with any great detail as to why this is happening.
I know someone is going to say that I should RTFM so here is the gist
of the documentation:

formatter.DumbWriter = "This class is suitable for reflowing a
sequence of paragraphs."
formatter.AbstractFormatter = "The standard formatter. This
implementation has demonstrated wide applicability to many writers,
and may be used directly in most circumstances. It has been used to
implement a full-featured World Wide Web browser." <-- huh?

So.. What is dumbwriter and abstractformatter doing with this HTML and
why does it need to be done before parser.feed() gets a hold of it?

The last question is.. I can't find any documentation to explain
where the "anchorlist" attribute came from? Here is the only
reference to this attribute that I can find anywhere in the Python
documentation.

----------------------
anchor_bgn( href, name, type)
This method is called at the start of an anchor region. The
arguments correspond to the attributes of the <A> tag with the same
names. The default implementation maintains a list of hyperlinks
(defined by the HREF attribute for <A> tags) within the document. The
list of hyperlinks is available as the data attribute anchorlist.
----------------------

So .. How does an average developer figure out that parser returns a
list of hyperlinks in an attribute called anchorlist? Is this
something that you just "figure out" or is there some book I should be
reading that documents all of the attributes for a particular
method? It just seems a bit obscure and certainly not something I
would have figured out on my own. Does this make me a poor developer
who should find another hobby? I just need to know if there is
something wrong with me or if this is a reasonable question to ask.

The last question I have is about debugging. The spider is capable
of parsing links until it reaches:

"html = get_page(http://www.google.com/jobs/fortune)" which returns
the contents of a pdf document, assigns the pdf contents to html which
is later passed to parser.feed(html) which crashes.

I'm smart enough to know that whenever you take in some input that you
should do some basic type checking to make sure that whatever you are
trying to manipulate (especially if it originates from outside of your
application) won't cause your application to crash. If you're
expecting an ASCII character then make sure you're not getting an
object or string of text.

How would an experienced python developer check the contents of "html"
to make sure its not something else other than a blob of HTML code? I
should note an obviously catch-22.. How do I check the HTML in such
a way that the check itself doesn't possibly crash the app? I thought
about:

try:
parser.feed(html)
except parser.HTMLParseError:
parser.close()


..... but i'm not sure if that is right or not? The app still crashes
so obviously i'm doing something wrong.


Here is the full app for your review.

Thank you for any help you can provide! I greatly appreciate it!


#!/usr/bin/python

#these modules do most of the work
import sys
import urllib2
import urlparse
import htmllib, formatter
from cStringIO import StringIO

def log_stdout(msg):
"""Print msg to the screen."""
print msg

def get_page(url, log):
"""Retrieve URL and return comments, log errors."""
try:
page = urllib2.urlopen(url)
except urllib2.URLError:
log("Error retrieving: " + url)
return ''
body = page.read()
page.close()
return body

def find_links(html):
"""return a list of links in HTML"""
#We're using the parser just to get the hrefs
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist

class Spider:
"""
The heart of this program, finds all links within a web site.

run() contains the main loop.
process_page() retrieves each page and finds the links.
"""

def __init__(self, startURL, log=None):
#this method sets initial values
self.URLs = set() #create a set
self.URLs.add(startURL) #add the start url to the set
self.include = startURL
self._links_to_process = [startURL]
if log is None:
#use log_stdout function if no log provided
self.log = log_stdout
else:
self.log = log

def run(self):
#process list of URLs one at a time
while self._links_to_process:
url = self._links_to_process.pop()
self.log("Retrieving: " + url)
self.process_page(url)

def url_in_site(self, link):
#checks weather the link starts with the base URL
return link.startswith(self.include)

def process_page(self, url):
#retrieves page and finds links in it
html = get_page(url, self.log)
for link in find_links(html):
#handle relative links
link = urlparse.urljoin(url,link)
self.log("Checking: " + link)
#make sure this is a new URL within current site
if link not in self.URLs and self.url_in_site(link):
self.URLs.add(link)
self._links_to_process.append(link)

if __name__ == '__main__':
#this code runs when script is started from command line
startURL = sys.argv[1]
spider = Spider(startURL)
spider.run()
for URL in sorted(spider.URLs):
print URL
 
J

John J. Lee

----------------------------------------------------------
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist

Don't worry too much about memory. The "StringIO()" probably only
really allocates the memory needed for the "bookkeeping" that StringIO
does for its own internal purposes, not the memory needed to actually
store the HTML. Later, when you use the object, Python will
dynamically (== at run time) allocate the necessary memory for the
HTML, when the .write() method is called on the StringIO instance.
Python handles the memory allocation for you -- though of course the
code you write affects how much memory gets used.

Note:

- The StringIO is where the *output* HTML goes.

- The formatter.DumbWriter likely doesn't do anything with the
StringIO() at the time it's passed (it hasn't even seen your HTML
yet, so how could it?). Instead, it just squirrels away the
StringIO() for later use.
The results are
passed to formatter.abstractformatter which does something else to the
HTML code.

Again, nothing much happens right away on the "f = ..." line. The
formatter.AbstractFormatter just keeps the formatter so it can use it
to format HTML later on.

The results are then passed to "f" which is then passed to

The results are not "passed" to f. Instead, the results are given a
name, "f". You can give a single object as many names as you like.

htmllib.HTMLParser so it can parse the html for links. I guess I

htmllib.HTMLParser wants the formatter so it can format output
(e.g. you might want to write out the same page with some of the links
removed). It doesn't need the formatter to parse the HTML.
HTMLParser itself is responsible for the parsing -- as the name
implies.

don't understand with any great detail as to why this is happening.
I know someone is going to say that I should RTFM so here is the gist
of the documentation:

formatter.DumbWriter = "This class is suitable for reflowing a
sequence of paragraphs."
formatter.AbstractFormatter = "The standard formatter. This
implementation has demonstrated wide applicability to many writers,
and may be used directly in most circumstances. It has been used to
implement a full-featured World Wide Web browser." <-- huh?

The web browser in question was called "Grail". Grail has been
resting for some time now. By today's standards, "full-featured" is a
bit of a stretch.

But I wouldn't worry too much about what they're trying to say there
yet (it has to do with the way the formatter.AbstractFormatter class
is structured, not what it actually does "out of the box").

So.. What is dumbwriter and abstractformatter doing with this HTML and
why does it need to be done before parser.feed() gets a hold of it?

The "heavy lifting" only really actually starts happening when you
call parser.feed(). Before that, you're just setting the stage.

The last question is.. I can't find any documentation to explain
where the "anchorlist" attribute came from? Here is the only
reference to this attribute that I can find anywhere in the Python
documentation.

----------------------
anchor_bgn( href, name, type)
This method is called at the start of an anchor region. The
arguments correspond to the attributes of the <A> tag with the same
names. The default implementation maintains a list of hyperlinks
(defined by the HREF attribute for <A> tags) within the document. The
list of hyperlinks is available as the data attribute anchorlist.
----------------------

That is indeed the (only) documentation for .anchorlist . What more
were you expecting to see?

So .. How does an average developer figure out that parser returns a
list of hyperlinks in an attribute called anchorlist? Is this

They keep the Library Reference under their pillow :)

And strictly it doesn't *return* a list of links. And that's
certainly not HTMLParser's main function in life. It merely makes
such a list available as a convenience. In fact, many people instead
use module sgmllib, which provides no such convenience, but otherwise
does the same parsing work as module htmllib.

something that you just "figure out" or is there some book I should be
reading that documents all of the attributes for a particular
method? It just seems a bit obscure and certainly not something I
would have figured out on my own. Does this make me a poor developer
who should find another hobby? I just need to know if there is
something wrong with me or if this is a reasonable question to ask.

But you *did* figure it out. How else is it that you come to be
explaining it to us?

Keep in mind that *nobody* knows all of the standard library. I've
been writing Python code full time for years, and I often bump into
whole standard library modules whose existence I'd forgotten about, or
was never really aware of in the first place. The more you know about
what it can do, the more convenience you'll get out of it, is all.

The last question I have is about debugging. The spider is capable
of parsing links until it reaches:

"html = get_page(http://www.google.com/jobs/fortune)" which returns
the contents of a pdf document, assigns the pdf contents to html which
is later passed to parser.feed(html) which crashes. [...]
How would an experienced python developer check the contents of "html"
to make sure its not something else other than a blob of HTML code? I
should note an obviously catch-22.. How do I check the HTML in such
a way that the check itself doesn't possibly crash the app? I thought
about:

try:
parser.feed(html)
except parser.HTMLParseError:
parser.close()


.... but i'm not sure if that is right or not? The app still crashes
so obviously i'm doing something wrong.

That kind of idea is often the best way. In this case, though, you
probably want to do an up-front check by looking at the HTTP
Content-Type header (Google for it), something like this:

response = urllib2.urlopen(url)
html = response.read()
if response.info()["Content-Type"] == "text/html":
parse(html)


John
 
G

Gabriel Genellina

----------------------------------------------------------
f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
parser = htmllib.HTMLParser(f)
parser.feed(html)
parser.close()
return parser.anchorlist
----------------------------------------------------------

The htmllib.HTMLParser class is hard to use. I would replace those
lines with:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.anchorlist = []

def handle_starttag(self, tag, attrs):
if tag=="a":
href = dict(attrs).get("href")
if href:
self.anchorlist.append(href)

parser = MyHTMLParser()
parser.feed(htmltext)
print parser.anchorlist

The anchorlist attribute, defined by myself here, is a list containing
all href attributes found in the page.
See said:
I get the idea that we're allocating some memory that looks like a
file so formatter.dumbwriter can manipulate it. The results are
passed to formatter.abstractformatter which does something else to the
HTML code. The results are then passed to "f" which is then passed to
htmllib.HTMLParser so it can parse the html for links. I guess I
don't understand with any great detail as to why this is happening.
I know someone is going to say that I should RTFM so here is the gist
of the documentation:

Don't even try to understand it - it's a mess. Use the HTMLParser
module instead.
The last question is.. I can't find any documentation to explain
where the "anchorlist" attribute came from? Here is the only
reference to this attribute that I can find anywhere in the Python
documentation.

And that's all you will find.
So .. How does an average developer figure out that parser returns a
list of hyperlinks in an attribute called anchorlist? Is this

Usually, those attributes are hyperlinked and you can find them in the
documentation index. Not for this one :(
something that you just "figure out" or is there some book I should be
reading that documents all of the attributes for a particular
method? It just seems a bit obscure and certainly not something I
would have figured out on my own. Does this make me a poor developer
who should find another hobby? I just need to know if there is
something wrong with me or if this is a reasonable question to ask.

It's a very reasonable question. The attribute should be documented
properly. But the class itself is a bit old; I don't never use it
anymore.
The last question I have is about debugging. The spider is capable
of parsing links until it reaches:

"html = get_page(http://www.google.com/jobs/fortune)" which returns
the contents of a pdf document, assigns the pdf contents to html which
is later passed to parser.feed(html) which crashes.

You can verify the Content-Type header before processing. Quoting the
get_page method:
def get_page(url, log):
"""Retrieve URL and return comments, log errors."""
try:
page = urllib2.urlopen(url)
except urllib2.URLError:
log("Error retrieving: " + url)
return ''
body = page.read()
page.close()
return body
method returns a file-like object, which has an additional info()
method holding the response headers. You can get the Content-Type
using page.info().gettype(), which should be text/html or text/xhtml.
For any other type, just return '' as you do for any error.
 
D

dogatemycomputer

Those responses were both very helpful. John's additional type
checking is straight forward and easy to implement. I will also
rewrite the application a second time using the class Gabriel
offered. Both of these suggestions will help gain some insight into
how Python works.

"Don't even try to understand it - it's a mess. Use the HTMLParser
module instead."

I personally think the application itself "feels" more complicated
than it needs to be but its possible that is just my inexperience. I'm
going to do some reading about the HTMLParser module. I'm sure I
could make this spider a bit more functional in the process.

Thank you again for all of your help!!
 
S

Stefan Behnel

I personally think the application itself "feels" more complicated
than it needs to be but its possible that is just my inexperience. I'm
going to do some reading about the HTMLParser module. I'm sure I
could make this spider a bit more functional in the process.

That's because you are using the standard library to parse HTML. While
HTMLParser can do what you want it to, it's rather hard to use, especially for
new users.

If you want to give lxml.html a try, a web spider would be something like this:

import lxml.html as H

def crawl(url, page_dict, depth=2, link_type="a"):
html = H.parse(url).getroot()
html.make_links_absolute()

page_dict = (link_type, html) for e...odespeak.net/svn/lxml/trunk Have fun, Stefan
 
J

John J. Lee

Gabriel Genellina said:
Don't even try to understand it - it's a mess. Use the HTMLParser
module instead.
[...]

Module sgmllib (and therefore module htmllib also) is more tolerant of
bad HTML than module HTMLParser.


John
 
G

Gabriel Genellina

[...]> Don't even try to understand it - it's a mess. Use the HTMLParser
module instead.

[...]

Module sgmllib (and therefore module htmllib also) is more tolerant of
bad HTML than module HTMLParser.

I had the impression it was the opposite; anyway, neither of them can
handle really bad html.
I just don't *like* htmllib.HTMLParser - but that's only a matter of
taste.
 
S

Stefan Behnel

Gabriel said:
[...]> Don't even try to understand it - it's a mess. Use the HTMLParser
module instead.
[...]

Module sgmllib (and therefore module htmllib also) is more tolerant of
bad HTML than module HTMLParser.

I had the impression it was the opposite; anyway, neither of them can
handle really bad html.
I just don't *like* htmllib.HTMLParser - but that's only a matter of
taste.

lxml.html handles bad HTML and it's a powerful tool that is very easy to use.
And if one day you have to deal with really, *really* broken tag soup, it also
comes with BeautifulSoup parser integration.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,705
Latest member
Stefkari24

Latest Threads

Top