Help with parsing web page

RiGGa · Jun 14, 2004

Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGGa · Jun 15, 2004

RiGGa said:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

thanks

R

Miki Tebeka · Jun 15, 2004

Hello RiGGa,

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.

RiGGa · Jun 15, 2004

Miki said:
Hello RiGGa,

Anyone?, I have found out I can use sgmllib but find the documentation is
not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

Click to expand...

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.

Thanks for taking the time to help its appreciated, I am new to Python so a
little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga

Thomas Guettler · Jun 15, 2004

Am Mon, 14 Jun 2004 17:48:33 +0100 schrieb RiGGa:

Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Hi,

Since HTML can be broken in several ways, I would
pipe the HTML thru tidy first. You can use the "-asxml"
option, and then parse the xml.

http://tidy.sourceforge.net/

Thomas

Paul McGuire · Jun 15, 2004

RiGGa said:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGGa -

The following program is included in the examples shipped with pyparsing.
This uses a slightly different technique than working with a complete HTML
parser - instead, it scans the input HTML for an expected pattern, and
extracts it (and several named subfields). You can accomplish this same
behavior using regular expressions, but you might find pyparsing a bit
easier to read.

This program uses urllib to capture the HTML from NIST's time server web
site, then scans the HTML for NTP servers. The expected pattern is:

<td>ip-address</td><td>arbitrary text giving server location</td>

For example:
<td>132.163.4.101</td>
<td>NIST, Boulder, Colorado</td>

(pyparsing ignores whitespace, so the line breaks and tabs are not a
concern. If you convert to regexp's, you need to add re fields for the
whitespace.)

The output from running this program gives:
129.6.15.28 - NIST, Gaithersburg, Maryland
129.6.15.29 - NIST, Gaithersburg, Maryland
132.163.4.101 - NIST, Boulder, Colorado
132.163.4.102 - NIST, Boulder, Colorado
132.163.4.103 - NIST, Boulder, Colorado
128.138.140.44 - University of Colorado, Boulder
192.43.244.18 - NCAR, Boulder, Colorado
131.107.1.10 - Microsoft, Redmond, Washington
69.25.96.13 - Symmetricom, San Jose, California
216.200.93.8 - Abovenet, Virginia
208.184.49.9 - Abovenet, New York City
207.126.98.204 - Abovenet, San Jose, California
207.200.81.113 - TrueTime, AOL facility, Sunnyvale, California
64.236.96.53 - TrueTime, AOL facility, Virginia

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# getNTPservers.py
#
# Demonstration of the parsing module, implementing a HTML page scanner,
# to extract a list of NTP time servers from the NIST web site.
#
# Copyright 2004, by Paul McGuire
#
from pyparsing import Word, Combine, Suppress, CharsNotIn, nums
import urllib

integer = Word(nums)
ipAddress = Combine( integer + "." + integer + "." + integer + "." +
integer )
tdStart = Suppress("<td>")
tdEnd = Suppress("</td>")
timeServerPattern = tdStart + ipAddress.setResultsName("ipAddr") + tdEnd +
\
tdStart + CharsNotIn("<").setResultsName("loc") + tdEnd

# get list of time servers
nistTimeServerURL =
"http://www.boulder.nist.gov/timefreq/service/time-servers.html"
serverListPage = urllib.urlopen( nistTimeServerURL )
serverListHTML = serverListPage.read()
serverListPage.close()

addrs = {}
for srvr,startloc,endloc in timeServerPattern.scanString( serverListHTML ):
print srvr.ipAddr, "-", srvr.loc
addrs[srvr.ipAddr] = srvr.loc
# or do this:
#~ addr,loc = srvr
#~ print addr, "-", loc

wes weston · Jun 15, 2004

RiGGa said:
Hi,

I want to parse a web page in Python and have it write certain values out to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

RiGga,
If you want something, hopefully, not too simple. Frequently, you can
strip out the html and the resulting list will have a label followed by
the piece of data you want to save.
Do you need mysql code?
wes

def RemoveLessThanGreaterThanSectionsTokenize( s ):
state = 0
str = ""
list = []
for ch in s:
#grabbing good chars state
if state == 0: # s always starts with '<'
if ch == '<':
state = 1
if len(str) > 0:
list.append(str)
str = ""
else:
str += ch
#dumping bad chars state
elif state == 1: # looking for '>'
if ch == '>':
state = 0
return list

RiGGa · Jun 16, 2004

RiGGa said:
Hi,

I want to parse a web page in Python and have it write certain values out
to
a mysql database. I really dont know where to start with parsing the html
code ( I can work out the database part ). I have had a look at htmllib
but I need more info. Can anyone point me in the right direction , a
tutorial or something would be great.

Many thanks

RiGga

Many thanks for all your help, I will go away and digest it.

R

RiGGa · Jun 19, 2004

RiGGa said:
Miki said:

Hello RiGGa,

Anyone?, I have found out I can use sgmllib but find the documentation
is not that clear, if anyone knows of a tutorial or howto it would be
appreciated.

Click to expand...

I'm not an expert but this is how I work:

You make a subclass of HTMLParser and override the callback functions.
Usually I use only start_<TAB> end_<TAB> and handle_data.
Since you don't know *when* each callback function is called you need to
keep an internal state. It can be a simple variable or a stack if you
want to deal with nested tags.

A short example:
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import NullFormatter

class TitleParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, NullFormatter())
self.state = ""
self.data = ""

def start_title(self, attrs):
self.state = "title"
self.data = ""

def end_title(self):
print "Title:", self.data.strip()

def handle_data(self, data):
if self.state:
self.data += data

if __name__ == "__main__":
from sys import argv

parser = TitleParser()
parser.feed(open(argv[1]).read())

HTH.

Click to expand...

Thanks for taking the time to help its appreciated, I am new to Python so
a little confused with what you have posted however I will go through it
again and se if it makes more sense.

Many thanks

Rigga

Said I would be back

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R

David Fisher · Jun 19, 2004

[...]

How do I get the current position (offset) which I am at in the file?

I have tried getpos() and variations thereof and keep getting syntax
errors...

Thanks

R

http://www.python.org/doc/current/lib/bltin-file-objects.html

Web Page Parsing/Downloading	1	Nov 22, 2013
help with link parsing?	3	Dec 20, 2010
Troubles with Fullpage / please help	0	Dec 14, 2023
Help with my responsive home page	2	Dec 14, 2022
Help me, going live with a quiz	0	Jun 27, 2023
Sending data from web page to Raspberry Pi	0	Nov 26, 2022
Help with an algorythm	5	Aug 30, 2024
Help with Github???	2	Jan 6, 2024

Help with parsing web page

RiGGa

RiGGa

Miki Tebeka

RiGGa

Thomas Guettler

Paul McGuire

wes weston

RiGGa

RiGGa

David Fisher

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads