python fast HTML data extraction library

F

Filip

Hello,

Sometime ago I was searching for a library that would simplify mass
data scraping/extraction from webpages. Python XPath implementation
seemed like the way to go. The problem was that most of the HTML on
the net doesn't conform to XML standards, even the XHTML (those
advertised as valid XHTML too) pages.

I tried to fix that with BeautifulSoup + regexp filtering of some
particular cases I encountered. That was slow and after running my
data scraper for some time a lot of new problems (exceptions from
xpath parser) were showing up. Not to mention that BeautifulSoup
stripped almost all of the content from some heavily broken pages
(50+KiB page stripped down to some few hundred bytes). Character
encoding conversion was a hell too - even UTF-8 pages had some non-
standard characters causing issues.


Cutting to the chase - that's when I decided to take the matter into
my own hands. I hacked together a solution sporting completely new
approach overnight. It's called htxpath - a small, lightweight (also
without dependencies) python library which lets you to extract
specific tag(s) from a HTML document using a path string which has
very similar syntax to xpath (but is more convenient in some cases).
It did a very good job for me.

My library, rather than parsing the whole input into a tree, processes
it like a char stream with regular expressions.

I decided to share it with everyone so there it is: http://code.google.com/p/htxpath/
I am aware that it is not beautifully coded as my experience with
python is rather brief, but I am curious if it will be useful to
anyone (also it's my first potentially [real-world ;)] useful project
gone public). In that case I promise to continue developing it. It's
probably full of bugs, but I can't catch them all by myself.

regards,
Filip Sobalski
 
P

Paul McGuire

My library, rather than parsing the whole input into a tree, processes
it like a char stream with regular expressions.

Filip -

In general, parsing HTML with re's is fraught with easily-overlooked
deviations from the norm. But since you have stepped up to the task,
here are some comments on your re's:

# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal). raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.

# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)

# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

# what about HTML entities defined using hex syntax, such as
amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)

How would you extract data from a table? For instance, how would you
extract the data entries from the table at this URL:
http://tf.nist.gov/tf-cgi/servers.cgi ? This would be a good example
snippet for your module documentation.

Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.

Good luck!

-- Paul
 
A

Aahz

I tried to fix that with BeautifulSoup + regexp filtering of some
particular cases I encountered. That was slow and after running my
data scraper for some time a lot of new problems (exceptions from
xpath parser) were showing up. Not to mention that BeautifulSoup
stripped almost all of the content from some heavily broken pages
(50+KiB page stripped down to some few hundred bytes). Character
encoding conversion was a hell too - even UTF-8 pages had some non-
standard characters causing issues.

Have you tried lxml?
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"At Resolver we've found it useful to short-circuit any doubt and just
refer to comments in code as 'lies'. :)"
--Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22
 
J

John Machin

On Jul 22, 5:43 pm, Filip <[email protected]> wrote:
# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

Just in case somebody actually uses valid XHTML :) it might be a good
idea to allow for said:
# what about HTML entities defined using hex syntax, such as
amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)

What about the decimal syntax ones? E.g. not only &nbsp; and  
but also  

Also, entity names can contain digits e.g. &sup1; &frac34;
 
F

Filip

# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal).  raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.

Thanks, I didn't know about that, updated my code.
# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)

Of course, you mean attribute's *value* can be enclosed in single/
double quotes?
To be true, I haven't seen single quote variant in HTML lately but I
checked it and it seems to be in the specs and it can be even quite
useful (man learns something every day).
Thank you for pointing that one out, I updated the code accordingly
(just realized that condition check REs need an update too :/).

As far as the lack of value quoting is concerned, I am not so sure I
need this - It would significanly obfuscate my REs and this practice
is rather deprecated, considered unsafe
and I've seen it only in very old websites.
How would you extract data from a table?  For instance, how would you
extract the data entries from the table at this URL:http://tf.nist.gov/tf-cgi/servers.cgi?  This would be a good example
snippet for your module documentation.

This really seems like a nice example. I'll surely explain it in my
docs (examples are surely needed there ;)).
Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.

The library was used in my humble production environment, processing a
few hundred thousand+ of pages and spitting out about 10000 SQL
records so it does work quite good with a simple task like extracting
all links. However, I can't really say that the task introduced enough
diversity (there were only 9 different page templates) to say that the
library is 'tested'...

On Jul 22, 5:43 pm, Filip <[email protected]> wrote:
# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

Just in case somebody actually uses valid XHTML :) it might be a good
idea to allow for said:
# what about HTML entities defined using hex syntax, such as
amp_re = re.compile('\&(?![a-z]+?\;)', re.UNICODE | re.IGNORECASE)

What about the decimal syntax ones? E.g. not only &nbsp; and  
but also  

Also, entity names can contain digits e.g. &sup1; &frac34;

Thanks for pointing this out, I fixed that. Although it has very
little impact on how the library performs its main task (I'd like to
see some comments on that ;)).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top