HTML page into a string

Tempo · Feb 8, 2006

In my last post I received some advice to use urllib.read() to get a
whole html page as a string, which will then allow me to use
BeautifulSoup to do what I want with the string. But when I was
researching the 'urllib' module I couldn't find anything about its
sub-section '.read()' ? Is that the right module to get a html page
into a string? Or am I completely missing something here? I'll take
this as the more likely of the two cases. Thanks for any and all help.

Steve Holden · Feb 8, 2006

Tempo said:
In my last post I received some advice to use urllib.read() to get a
whole html page as a string, which will then allow me to use
BeautifulSoup to do what I want with the string. But when I was
researching the 'urllib' module I couldn't find anything about its
sub-section '.read()' ? Is that the right module to get a html page
into a string? Or am I completely missing something here? I'll take
this as the more likely of the two cases. Thanks for any and all help.

I think you've misunderstood. You call urllib.urlopen() with a URL as an
argument. The object that this call returns is file-like (in so far as
you can read it to get the content of the web page):
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
<meta name="generator" content="Adobe GoLive 6">
<meta http-equiv="DESCRIPTION" content="Holden Web provides
architectural design of databases and information systems, with
full-service implementation and support">
...
</tr>
</tbody>
</table>
</div>

You will find there are lots of other things you can do with that
file-like object too, but reading it is the important one as far as
using BeautifulSoup goes.

regards
Steve

Jason Earl · Feb 8, 2006

Tempo said:
In my last post I received some advice to use urllib.read() to get a
whole html page as a string, which will then allow me to use
BeautifulSoup to do what I want with the string. But when I was
researching the 'urllib' module I couldn't find anything about its
sub-section '.read()' ? Is that the right module to get a html page
into a string? Or am I completely missing something here? I'll take
this as the more likely of the two cases. Thanks for any and all help.

Here's a short example of how this all works:

#!/usr/bin/env python

import urllib2
from BeautifulSoup import BeautifulSoup

response = urllib2.urlopen('http://www.cnn.com')
soup = BeautifulSoup(response)
print soup.prettify()

It's not a particularly useful example, unless, of course, you wish to
prettify cnn's html, but it should get you to the point where
BeautifulSoup's documentation starts to make sense.

Jason

Tempo · Feb 8, 2006

Perfect. Thanks a bunch for clearing that all up for me. You have
delayed some long lost hours for me.

Setup a portion of html page as scrollable?	25	Jan 7, 2025
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022
HTML form to csv file on server	1	Feb 12, 2025
How to push data from one HTML page to another	4	Jan 3, 2024
Background image not showing up on html page	3	Sep 23, 2023
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Measuring a string of text	1	Sep 15, 2022
insert html into ElementTree without parsing it	1	Feb 24, 2014

HTML page into a string

Tempo

Steve Holden

Jason Earl

Tempo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads