parsing tables with beautiful soup?

C

cjl

I am learning python and beautiful soup, and I'm stuck.

A web page has a table that contains data I would like to scrape. The
table has a unique class, so I can use:

soup.find("table", {"class": "class_name"})

This isolates the table. So far, so good. Next, this table has a
certain number of rows (I won't know ahead of time how many), and each
row has a set number of cells (which will be constant).

I couldn't find example code on how to loop through the contents of
the rows and cells of a table using beautiful soup. I'm guessing I
need an outer loop for the rows and an inner loop for the cells, but I
don't know how to iterate over the tags that I want. The beautiful
soup documentation is a little beyond me at this point.

Can anyone point me in the right direction?

thanks again,
cjl
 
C

cjl

This works:

for row in soup.find("table",{"class": "class_name"}):
for cell in row:
print cell.contents[0]

Is there a better way to do this?

-cjl
 
D

Duncan Booth

cjl said:
This works:

for row in soup.find("table",{"class": "class_name"}):
for cell in row:
print cell.contents[0]

Is there a better way to do this?

It may work for the page you are testing against, but it wouldn't work if
your page contained valid HTML. You are assuming that the TR elements are
direct children of the TABLE, but HTML requires that the TR elements appear
inside THEAD, TBODY or TFOOT elements, so if anyone ever corrects the html
your code will break.

Something like this (untested) ought to work and be reasonably robust:

table = soup.find("table",{"class": "class_name"})
for row in table.findAll("tr"):
for cell in row.findAll("td"):
print cell.findAll(text=True)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top