J
Johann Spies
I am trying to get csv-output from a html-file.
With this code I had a little success:
=========================
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re
f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for th in rows[0]:
t = th.find(text=True)
g.write(t)
g.write(',')
# print(','.join(t))
for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
t = td.find(text=True).replace(' ','')
g.write(t)
except:
g.write ('')
g.write(",")
g.write("\n")
===============================
producing output like this:
RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1,,,,drop,Log,Any,,,
2,All Users@Any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4,,,,drop,None,Any,,,
....
It left out all the non-plaintext parts of <td></td>
I then tried using
t.renderContents and then got something like this (one line broken into
many for the sake of this email):
1,<img src=icons/group.png> <a href=#OBJ_sunetint>
<img src=icons/drop.png> drop,
<img src=icons/log.png> Log ,
How do I get Beautifulsoup to render (taking the above line as
example)
sunentint for <img src=icons/group.png> <a
href=#OBJ_sunetint>sunetint</A><BR>
and still provide the text-parts in the <td>'s with plain text?
I have experimented a little bit with regular expressions, but could
so far not find a solution.
Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch
"Lo, children are an heritage of the LORD: and the
fruit of the womb is his reward." Psalms 127:3
With this code I had a little success:
=========================
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re
f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for th in rows[0]:
t = th.find(text=True)
g.write(t)
g.write(',')
# print(','.join(t))
for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
t = td.find(text=True).replace(' ','')
g.write(t)
except:
g.write ('')
g.write(",")
g.write("\n")
===============================
producing output like this:
RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1,,,,drop,Log,Any,,,
2,All Users@Any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4,,,,drop,None,Any,,,
....
It left out all the non-plaintext parts of <td></td>
I then tried using
t.renderContents and then got something like this (one line broken into
many for the sake of this email):
1,<img src=icons/group.png> <a href=#OBJ_sunetint>
<img>src=icons/udp.png> <a href=#SVC_IKE >IKE</a><br>,Rainwall_Cluster</A> <BR>,
<img src=icons/drop.png> drop,
<img src=icons/log.png> Log ,
Rainwall_Cluster</A> <BR> ,
How do I get Beautifulsoup to render (taking the above line as
example)
sunentint for <img src=icons/group.png> <a
href=#OBJ_sunetint>sunetint</A><BR>
and still provide the text-parts in the <td>'s with plain text?
I have experimented a little bit with regular expressions, but could
so far not find a solution.
Regards
Johann
--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch
"Lo, children are an heritage of the LORD: and the
fruit of the womb is his reward." Psalms 127:3