Any equivalent to Ruby's 'hpricot' html/xpath/css selector package?

K

Kenneth McDonald

Ruby has a package called 'hpricot' which can perform limited xpath
queries, and CSS selector queries. However, what makes it really
useful is that it does a good job of handling the "broken" html that
is so commonly found on the web. Does Python have anything similar,
i.e. something that will not only do XPath queries, but will do so on
imperfect HTML? (A good HTML neatener would also be fine, of course,
as I could then pass the result to a Python XPath package.)

And, what are people's favorite Python XPath solutions?

Thanks,
Ken McDonald
 
B

Bruno Desthuilliers

Kenneth McDonald a écrit :
Ruby has a package called 'hpricot' which can perform limited xpath
queries,

ElementTree ? (it's in the stdlib now)
and CSS selector queries.

PyQuery ?
http://pypi.python.org/pypi/pyquery
However, what makes it really useful
is that it does a good job of handling the "broken" html that is so
commonly found on the web.

BeautifulSoup ?
http://pypi.python.org/pypi/BeautifulSoup/3.0.7a

possibly with ElementSoup ?
http://pypi.python.org/pypi/ElementSoup/rev452
 
M

Mark Thomas

Ruby has a package called 'hpricot' which can perform limited xpath  
queries, and CSS selector queries. However, what makes it really  
useful is that it does a good job of handling the "broken" html that  
is so commonly found on the web. Does Python have anything similar,  
i.e. something that will not only do XPath queries, but will do so on  
imperfect HTML?

Hpricot is a fine package but I prefer Nokogiri (see
http://www.rubyinside.com/nokogiri-ruby-html-parser-and-xml-parser-1288.html)
because it is based on libxml2 and therefore is faster, conforms to
the full XPath 1.0 spec, works on imperfect HTML, and exposes the
Hpricot API.

In python, the equivalent is lxml (http://codespeak.net/lxml/), which
is similarly based on libxml2, very fast, XPath-1.0 conformant, and
exposes the now-standard ElementTree API.

The main difference is that lxml doesn't have CSS selector syntax, but
IMHO that's a gimmick when you have a full XPath 1.0 engine at your
disposal.

-- Mark.
 
S

Stefan Behnel

Kenneth said:
Ruby has a package called 'hpricot' which can perform limited xpath
queries, and CSS selector queries. However, what makes it really useful
is that it does a good job of handling the "broken" html that is so
commonly found on the web. Does Python have anything similar, i.e.
something that will not only do XPath queries, but will do so on
imperfect HTML?

lxml.html is your friend.

http://codespeak.net/lxml/lxmlhtml.html

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,744
Latest member
CortneyMcK

Latest Threads

Top