HTML data extraction?

Dave Kuhlman · Dec 22, 2003

I recently read an article by Jon Udell about extracting data from
Web pages as a poor person's Web services. So, I have a question:

Is there any Python support for finding and extracting information
from HTML documents.

I'd like something that would do things like the following:

- return the data which is inside a <b> tag which is inside a
<li> tag.

- return the data which is inside a <a> tag that has attribute
href="http://www.python.org".

- Etc.

It would be a sort of structured grep for HTML.

I've found the HTMLParser and htmllib modules in the Python
standard library, but I'm wondering if there is anything at a
higher level.

Web searches did not turn up anything interesting.

Thanks for help.

Dave

djw · Dec 22, 2003

I don't know if there is anything at a higher level (I guess a Google
session would tell you that), but doing what you describe with the
HTMLParser module is very straightforward. All you have to do is keep
some state flags in the derived HTMLParser class that indicate the
found/not-found state of what you are looking for and have that control
the collection of data between the flags.

Starting with the example in the docs, and adding some (untested) additions:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

def __init__( self ):
HTMLParser.__init__( self )
self.in_bold_tag = False
self.in_list_tag = False
self.data_in_bold_list = ''

def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag
if tag == 'b': self.in_bold_tag = True
if tag == 'li' : self.in_list_tag = True

def handle_endtag(self, tag):
print "Encountered the end of a %s tag" % tag
if tag == 'b': self.in_bold_tag = False
if tag == 'li' : self.in_list_tag = False

def handle_data( self, data ):
if self.in_bold_tag and self.in_list_tag:
self.data_in_bold_list = ''.join( [ self.data_in_bold_list,
data ] )

This is just an outline, but you get the idea...

-Don

John J. Lee · Dec 22, 2003

[Sorry if this got posted twice, not sure what I did...]

Dave Kuhlman said:
I'd like something that would do things like the following:

- return the data which is inside a <b> tag which is inside a
<li> tag.

- return the data which is inside a <a> tag that has attribute
href="http://www.python.org".

- Etc.

It would be a sort of structured grep for HTML.

1. http://wwwsearch.sf.net/bits/pullparser.py

It's a port of Perl's HTML::TokeParser.

p = pullparser.PullParser(f)
p.get_tag("b")
p.get_tag("li")
print p.get_text()

p = pullparser.PullParser(f)
for tag in p:
tag = p.get_tag("a")
if dict(tag.attrs).get("href") == "http://www.python.org":
print p.get_text()

I'll release a beta version in a day or so with a couple of minor
changes (including that .get_text() will no longer raise
NoMoreTagsError) and a proper tarball package.

2. stuff your data through mxTidy or uTidylib to get XHTML, then into
XPath from PyXML.

http://www.zvon.org/xxl/XPathTutorial/General/examples.html

In fact, tidying HTML is sometimes necessary even if you don't need
XHTML or a tree-based API.

3. microdom

http://www.xml.com/pub/a/2003/10/15/microdom.html

Haven't used it myself.

John

python fast HTML data extraction library	4	Jul 22, 2009
How to push data from one HTML page to another	4	Jan 3, 2024
Trying to get JSON data from API into HTML table	7	Feb 1, 2021
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTML Parser	3	Jul 2, 2013
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
How to return data in specific format from Python Flask API?	0	Aug 10, 2022
HTMLParser not parsing whole html file	4	Oct 24, 2010

HTML data extraction?

Dave Kuhlman

djw

John J. Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads