Parsing html with Beautifulsoup

Johann Spies · Dec 10, 2009

I am trying to get csv-output from a html-file.

With this code I had a little success:
=========================
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re

f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for th in rows[0]:
t = th.find(text=True)
g.write(t)
g.write(',')
# print(','.join(t))

for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
t = td.find(text=True).replace(' ','')
g.write(t)
except:
g.write ('')
g.write(",")
g.write("\n")
===============================

producing output like this:

RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1,,,,drop,Log,Any,,,
2,All Users@Any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4,,,,drop,None,Any,,,
....

It left out all the non-plaintext parts of <td></td>

I then tried using

t.renderContents and then got something like this (one line broken into
many for the sake of this email):

1,<img src=icons/group.png> <a href=#OBJ_sunetint>

Rainwall_Cluster</A> <BR>,

<img>src=icons/udp.png> <a href=#SVC_IKE >IKE</a><br>,
<img src=icons/drop.png> drop,
<img src=icons/log.png> Log ,

Rainwall_Cluster</A> <BR> , 

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for <img src=icons/group.png> <a
href=#OBJ_sunetint>sunetint</A><BR>

and still provide the text-parts in the <td>'s with plain text?

I have experimented a little bit with regular expressions, but could
so far not find a solution.

Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"Lo, children are an heritage of the LORD: and the
fruit of the womb is his reward." Psalms 127:3

Extracting text using Beautifulsoup	0	Oct 25, 2009
Sort by number of characters	1	Nov 2, 2023
Help with my responsive home page	2	Dec 14, 2022
Only one table shows up with the information	2	Mar 29, 2023
Help with Visual Lightbox: Scripts	2	May 3, 2023
BeautifulSoup: problems with parsing a website	1	May 28, 2008
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Help with code	0	Jun 12, 2022

Parsing html with Beautifulsoup

Johann Spies

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads