RiGGa said:
Hi,
I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.
Many thanks
RiGga
RiGGa -
The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields). You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.
This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers. The expected pattern is:
<td>ip-address</td><td>arbitrary text giving server location</td>
For example:
<td>132.163.4.101</td>
<td>NIST, Boulder, Colorado</td>
(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern. If you convert to regexp's, you need to add re fields for the
whitespace.)
The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia
Download pyparsing at
http://pyparsing.sourceforge.net .
-- Paul
# getNTPservers.py
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib
integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>")
tdEnd = Suppress("</td>")
timeServerPattern = tdStart + ipAddress.setResultsName("ipAddr") + tdEnd +
\
tdStart + CharsNotIn("<").setResultsName("loc") + tdEnd
# get list of time servers
nistTimeServerURL =
"
http://www.boulder.nist.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()
addrs = {}
for srvr,startloc,endloc in timeServerPattern.scanString( serverListHTML ):
print srvr.ipAddr, "-", srvr.loc
addrs[srvr.ipAddr] = srvr.loc
# or do this:
#~ addr,loc = srvr
#~ print addr, "-", loc