HTML data extraction?

D

Dave Kuhlman

I recently read an article by Jon Udell about extracting data from
Web pages as a poor person's Web services. So, I have a question:

Is there any Python support for finding and extracting information
from HTML documents.

I'd like something that would do things like the following:

- return the data which is inside a <b> tag which is inside a
<li> tag.

- return the data which is inside a <a> tag that has attribute
href="http://www.python.org".

- Etc.

It would be a sort of structured grep for HTML.

I've found the HTMLParser and htmllib modules in the Python
standard library, but I'm wondering if there is anything at a
higher level.

Web searches did not turn up anything interesting.

Thanks for help.

Dave
 
D

djw

I don't know if there is anything at a higher level (I guess a Google
session would tell you that), but doing what you describe with the
HTMLParser module is very straightforward. All you have to do is keep
some state flags in the derived HTMLParser class that indicate the
found/not-found state of what you are looking for and have that control
the collection of data between the flags.

Starting with the example in the docs, and adding some (untested) additions:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def __init__( self ):
HTMLParser.__init__( self )
self.in_bold_tag = False
self.in_list_tag = False
self.data_in_bold_list = ''

def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
if tag == 'b': self.in_bold_tag = True
if tag == 'li' : self.in_list_tag = True

def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
if tag == 'b': self.in_bold_tag = False
if tag == 'li' : self.in_list_tag = False

def handle_data( self, data ):
if self.in_bold_tag and self.in_list_tag:
self.data_in_bold_list = ''.join( [ self.data_in_bold_list,
data ] )

This is just an outline, but you get the idea...

-Don
 
J

John J. Lee

[Sorry if this got posted twice, not sure what I did...]

Dave Kuhlman said:
I'd like something that would do things like the following:

- return the data which is inside a <b> tag which is inside a
<li> tag.

- return the data which is inside a <a> tag that has attribute
href="http://www.python.org".

- Etc.

It would be a sort of structured grep for HTML.

1. http://wwwsearch.sf.net/bits/pullparser.py

It's a port of Perl's HTML::TokeParser.

p = pullparser.PullParser(f)
p.get_tag("b")
p.get_tag("li")
print p.get_text()


p = pullparser.PullParser(f)
for tag in p:
tag = p.get_tag("a")
if dict(tag.attrs).get("href") == "http://www.python.org":
print p.get_text()

I'll release a beta version in a day or so with a couple of minor
changes (including that .get_text() will no longer raise
NoMoreTagsError) and a proper tarball package.


2. stuff your data through mxTidy or uTidylib to get XHTML, then into
XPath from PyXML.

http://www.zvon.org/xxl/XPathTutorial/General/examples.html

In fact, tidying HTML is sometimes necessary even if you don't need
XHTML or a tree-based API.


3. microdom

http://www.xml.com/pub/a/2003/10/15/microdom.html

Haven't used it myself.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,173
Messages
2,570,937
Members
47,481
Latest member
ElviraDoug

Latest Threads

Top