HTML page into a string

T

Tempo

In my last post I received some advice to use urllib.read() to get a
whole html page as a string, which will then allow me to use
BeautifulSoup to do what I want with the string. But when I was
researching the 'urllib' module I couldn't find anything about its
sub-section '.read()' ? Is that the right module to get a html page
into a string? Or am I completely missing something here? I'll take
this as the more likely of the two cases. Thanks for any and all help.
 
S

Steve Holden

Tempo said:
In my last post I received some advice to use urllib.read() to get a
whole html page as a string, which will then allow me to use
BeautifulSoup to do what I want with the string. But when I was
researching the 'urllib' module I couldn't find anything about its
sub-section '.read()' ? Is that the right module to get a html page
into a string? Or am I completely missing something here? I'll take
this as the more likely of the two cases. Thanks for any and all help.
I think you've misunderstood. You call urllib.urlopen() with a URL as an
argument. The object that this call returns is file-like (in so far as
you can read it to get the content of the web page):
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
<meta name="generator" content="Adobe GoLive 6">
<meta http-equiv="DESCRIPTION" content="Holden Web provides
architectural design of databases and information systems, with
full-service implementation and support">
...
</tr>
</tbody>
</table>
</div>

You will find there are lots of other things you can do with that
file-like object too, but reading it is the important one as far as
using BeautifulSoup goes.

regards
Steve
 
J

Jason Earl

Tempo said:
In my last post I received some advice to use urllib.read() to get a
whole html page as a string, which will then allow me to use
BeautifulSoup to do what I want with the string. But when I was
researching the 'urllib' module I couldn't find anything about its
sub-section '.read()' ? Is that the right module to get a html page
into a string? Or am I completely missing something here? I'll take
this as the more likely of the two cases. Thanks for any and all help.


Here's a short example of how this all works:

#!/usr/bin/env python

import urllib2
from BeautifulSoup import BeautifulSoup

response = urllib2.urlopen('http://www.cnn.com')
soup = BeautifulSoup(response)
print soup.prettify()

It's not a particularly useful example, unless, of course, you wish to
prettify cnn's html, but it should get you to the point where
BeautifulSoup's documentation starts to make sense.

Jason
 
T

Tempo

Perfect. Thanks a bunch for clearing that all up for me. You have
delayed some long lost hours for me.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,283
Messages
2,571,409
Members
48,103
Latest member
MadieDeitz

Latest Threads

Top