[ANN] hpricot 0.7

_why · Mar 17, 2009

Please enjoy a succulent, new Hpricot. A bit faster, some Ruby 1.9
support, and assorted fixes.

gem install hpricot --source http://code.whytheluckystiff.net

It should show up at Rubyforge in a bit.

I'm sure you're wondering what's the reason for Hpricot updates, in
the face of heated competition from the Nokogiri and LibXML
libraries. Remember that Hpricot has no dependencies and is smaller
than either of those libs. Hpricot uses its own Ragel-based
parser, so you have the freedom to hack the parser itself, the code
is dwarven by comparison.

Best of all, Hpricot has run on JRuby in the past. And I am in the
process of merging some IronRuby code[1] and porting 0.7 to
JRuby. This means your code will run on a variety of Ruby platforms
without alteration. That alone makes it worthwhile, wouldn't you
agree?

Clearly, the benchmarks you see on Ruby Inside are skewed to favor
Nokogiri. They parse XML through Hpricot without using Hpricot.XML(),
which is not only wrong, but puts XML through needless HTML cleanup
operations. I am sure that Hpricot 0.7 still fares slower on large
documents. However, for instance, try testing a large amount of
small documents (a much more common scenario) with this latest
version.

You have to question a benchmark that is entirely based on two XML
documents. What about HTML fix ups? What about various platforms
and CPUs? Why not treat Hpricot fairly and use it properly in the
benchmarks? It reeks of something.

_why

[1] http://github.com/nrk/ironruby-hpricot/tree/master

matt neuburg · Mar 17, 2009

_why said:
I'm sure you're wondering what's the reason for Hpricot updates, in
the face of heated competition from the Nokogiri and LibXML
libraries. Remember that Hpricot has no dependencies and is smaller
than either of those libs. Hpricot uses its own Ragel-based
parser, so you have the freedom to hack the parser itself, the code
is dwarven by comparison.

Also, isn't Hpricot more accepting of skanky HTML? m.

Phlip · Mar 17, 2009

Firstly, major props, and keep up the good work...

_why said:
You have to question a benchmark that is entirely based on two XML
documents. What about HTML fix ups? What about various platforms
and CPUs? Why not treat Hpricot fairly and use it properly in the
benchmarks? It reeks of something.

Here's what I use N for:

//form[
./descendant::fieldset[
./descendant::legend and
./descendant::li[
./descendant::label and
./descendant::input ]
]
]

I generate that from some N::HTML::Builder code, form{ fieldset { etc } }, which
turns into a DOM containing <form><fieldset> etc </fieldset></form>. The goal is
an assertion like this:

assert_xhtml do
h2 'Sales'
select! :size => SaleController::LIST_SIZE do
option names[1]
option names[0]
end
end

The point is to match an example HTML to a target HTML. I first tried it by
walking that object model myself, recursing thru all DOM children to find the
ones that match. However, as the recursion got more complex, I was "adding
epicycles" to the code.

I backed off and rewrote, by first converting all the example HTML into one
jiy-normous XPath, shown above. I have to do it like this because the example
HTML could contain _anything_, and I need the query to run fast and absolutely
stable. My assert_xhtml should not fail if the target code has the correct HTML
subset - or vice versa. I can't do that anywhere except LibXML, and I need to
keep that easy to install.

And, in the grand scheme of things, I don't think _you_ have room to complain
about your libraries' adoption rates!

Phlip · Mar 17, 2009

matt said:
Also, isn't Hpricot more accepting of skanky HTML? m.

Yeah, and

A> that can sometimes slow it down!

B> we don't have any in my shop...

tidy -asxhtml -i -wrap 130 -m file.html

Ryan Davis · Mar 17, 2009

Also, isn't Hpricot more accepting of skanky HTML?

no. I've had a bug open for years on hpricot because it couldn't deal
with the relatively simple forms on the trackers on rubyforge.org.
nokogiri dealt with it perfectly and since mechanized migrated from
hpricot to nokogiri I've had fewer issues overall.

I should reemphasize... YEARS. Even the bug tracker has since
disappeared. This is where nokogiri really shines IMBO(*).

*) in my _biased_ opinion. I work/hang out with aaron patterson on a
regular basis. That said, he fixes bugs I (and others--I watch) report
in a timely basis.

Ryan Davis · Mar 17, 2009

You have to question a benchmark that is entirely based on two XML
documents. What about HTML fix ups? What about various platforms
and CPUs? Why not treat Hpricot fairly and use it properly in the
benchmarks? It reeks of something.

You _do_ have to question it (as you should question all benchmarks,
really)... But that question should come in the form of a bug report,
or a patch. To do otherwise... reeks of something.

Daniele Alessandri · Mar 17, 2009

Best of all, Hpricot has run on JRuby in the past. And I am in the
process of merging some IronRuby code[1] and porting 0.7 to

It seems like my port of Hpricot to IronRuby did not go unnoticed
despite having kept quiet about it so far

By the way, porting 0.7 to IronRuby is on my radar: I am just not sure
about how long this will take (I am pretty busy as of lately) but
staying up to date with the current latest version of Hpricot is
indeed something I want to achieve.

PS: thanks for this new release of Hpricot.

John Barnette · Mar 17, 2009

You have to question a benchmark that is entirely based on two XML
documents. What about HTML fix ups? What about various platforms
and CPUs? Why not treat Hpricot fairly and use it properly in the
benchmarks? It reeks of something.

Don't be an ass. Code (and benchmark results) speak much louder than
snark. Aaron has put the current benchmarks up on GitHub[1], and I'm
sure he'll welcome any patches, additions, or corrections.

~ j.

[1] http://github.com/tenderlove/xml_truth

Aaron Patterson · Mar 17, 2009

Please enjoy a succulent, new Hpricot. A bit faster, some Ruby 1.9
support, and assorted fixes.

gem install hpricot --source http://code.whytheluckystiff.net

It should show up at Rubyforge in a bit.

I'm sure you're wondering what's the reason for Hpricot updates, in
the face of heated competition from the Nokogiri and LibXML
libraries. Remember that Hpricot has no dependencies and is smaller
than either of those libs. Hpricot uses its own Ragel-based
parser, so you have the freedom to hack the parser itself, the code
is dwarven by comparison.

Best of all, Hpricot has run on JRuby in the past. And I am in the
process of merging some IronRuby code[1] and porting 0.7 to
JRuby. This means your code will run on a variety of Ruby platforms
without alteration. That alone makes it worthwhile, wouldn't you
agree?

Clearly, the benchmarks you see on Ruby Inside are skewed to favor
Nokogiri. They parse XML through Hpricot without using Hpricot.XML(),
which is not only wrong, but puts XML through needless HTML cleanup
operations. I am sure that Hpricot 0.7 still fares slower on large
documents. However, for instance, try testing a large amount of
small documents (a much more common scenario) with this latest
version.

Thank you for pointing out my mistakes. The repository[1] is public in
order to keep myself honest. Patches are welcome.

You have to question a benchmark that is entirely based on two XML
documents. What about HTML fix ups? What about various platforms
and CPUs? Why not treat Hpricot fairly and use it properly in the
benchmarks? It reeks of something.

HTML fix ups will be tested as well. So will CSS searches, XPath
searches, memory usage, and many other things. As I said[2], these benchmarks
are not complete. If you're worried about being treated fairly, fork my
repository and write tests.

[1] https://github.com/tenderlove/xml_truth/tree
[2] http://www.rubyinside.com/ruby-xml-performance-benchmarks-1641.html#comment-38293

_why · Mar 17, 2009

HTML fix ups will be tested as well. So will CSS searches, XPath
searches, memory usage, and many other things. As I said[2], these benchmarks
are not complete. If you're worried about being treated fairly, fork my
repository and write tests.

No no, don't be silly, I'd much rather complain and be a sore
loser. I insist.

Look, I think I'd just rather see the benchmarks kept up by a
third party who has nothing to gain and can show a more nuanced
view of the scene. I really wish I could drop Hpricot (as
RubyfulSoup did,) but I think it has its strengths.

Let me ask you this. You're neck and neck with libxml-ruby. The
bulk of your time is spent in the exact same HTML parser as
libxml-ruby. Why the hyperfocus on benchmarks and declaring
yourselves winners? You're never going to be too far off from
their speed. So, I mean, it strikes me as adversarial and needless,
if your library quality and bug fixing are of the sort that Ryan
David has just touted.

_why

Charles Oliver Nutter · Mar 18, 2009

_why said:
Clearly, the benchmarks you see on Ruby Inside are skewed to favor
Nokogiri. They parse XML through Hpricot without using Hpricot.XML(),
which is not only wrong, but puts XML through needless HTML cleanup
operations. I am sure that Hpricot 0.7 still fares slower on large
documents. However, for instance, try testing a large amount of
small documents (a much more common scenario) with this latest
version.

You have to question a benchmark that is entirely based on two XML
documents. What about HTML fix ups? What about various platforms
and CPUs? Why not treat Hpricot fairly and use it properly in the
benchmarks? It reeks of something.

Welcome to my personal hell.

- Charlie

Sean O'Halpin · Mar 18, 2009

[snip the yak]

We're missing you man. Forget the fruit. Just hang out with us mortals
here a little.

All the best,
Sean

John Wells · Mar 18, 2009

Welcome to my personal hell.

Ironic that Peter just posted a positive note about new JRuby benchmarks ;-)

http://rubyflow.com/items/1913

Charles Oliver Nutter · Mar 19, 2009

John said:
Ironic that Peter just posted a positive note about new JRuby benchmarks ;-)

http://rubyflow.com/items/1913

I don't mind benchmarks as much as the constant cat-and-mouse game we
have to play. Ultimately most of the microbenchmarks published are
meaningless, but we have to spend a lot of time flexing that muscle to
remain a contender. It's tiring

- Charlie

David Villa · Mar 19, 2009

_why said:
Please enjoy a succulent, new Hpricot. A bit faster, some Ruby 1.9
support, and assorted fixes.

gem install hpricot --source http://code.whytheluckystiff.net

It should show up at Rubyforge in a bit.

.....

i am trying to install this gem :

powerbook-g4-15-de-villa:/opt/local/bin villa$ sudo gem install hpricot
--source http://code.whytheluckystiff.net
Building native extensions. This could take a while...
Successfully installed hpricot-0.7
1 gem installed
Installing ri documentation for hpricot-0.7...
Installing RDoc documentation for hpricot-0.7...
powerbook-g4-15-de-villa:/opt/local/bin villa$ irb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'hpricot'
LoadError: Failed to load
/usr/local/lib/ruby/gems/1.8/gems/hpricot-0.7/lib/hpricot_scan.bundle
from
/usr/local/lib/ruby/gems/1.8/gems/hpricot-0.7/lib/hpricot_scan.bundle
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:27:in
`require'
from
/usr/local/lib/ruby/gems/1.8/gems/hpricot-0.7/lib/hpricot.rb:20
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:32:in
`gem_original_require'
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:32:in
`require'
from (irb):2

Does anybody know where is the error ?

this is a powerbook g4 and tiger .

thanks

Marc Heiler · Mar 20, 2009

and since mechanized migrated from hpricot

to nokogiri I've had fewer issues overall.

I have had issues after mechanize migrated to nokogiri. In fact I am
using the older mechanize without the dependency on nokogiri, until I am
able to install nokogiri without a problem.

(PS: For the record, I never use rubygems and never will for various
reason, most importantly because I do not need and do not want automatic
dependency handling without me controlling it, so a part of this issue
is surely my own doing. But fact remains that the older mechanize at the
moment works like a charm for me, whereas the newer mechanize does not
work because I can not install nokogiri easily. Just for the record, the
error with nokogiri "rake" is:

"3) Failure:
test_exslt(TestXsltTransforms) [./test/test_xslt_transforms.rb:76]:
<"2009-03-20"> expected to be =~
</\d{4}-\d\d-\d\d[-|+]\d\d:\d\d/>.

348 tests, 939 assertions, 3 failures, 0 errors
rake aborted!"

Trying to use setup.rb on mechanize and nokogiri installs it of course
but as expected a later error emerges:

"lib/ruby/site_ruby/1.8/nokogiri.rb:6:in `require': no such file to load
-- nokogiri/native (LoadError)"

So for me the situation is reversed - with hpricot right now I do have
less problems than with nokogiri/mechanize.

trans · Mar 20, 2009

Since Mechanize can use either Nokogiri or Hpricot as a backend, it
seems like a good idea if neither were an actual dependency.

Either that or fork the project, how about Wechanize ;-)

But the first option seems the better course, I imagine other backends
could be added eventually too, eg. libxml-ruby.

T.

Jörg W Mittag · Mar 20, 2009

trans said:
Since Mechanize can use either Nokogiri or Hpricot as a backend, it
seems like a good idea if neither were an actual dependency.

Actually, IMO they should both be alternative dependencies. Which, of
course, RubyGems doesn't support. But since Marc doesn't use RubyGems,
it should work fine.

jwm

Phlip · Mar 20, 2009

Marc said:
"3) Failure:
test_exslt(TestXsltTransforms) [./test/test_xslt_transforms.rb:76]:
<"2009-03-20"> expected to be =~
</\d{4}-\d\d-\d\d[-|+]\d\d:\d\d/>.

Add it to the do-list!:

http://nokogiri.lighthouseapp.com/projects/19607-nokogiri/tickets/

Eric Hodel · Mar 20, 2009

(PS: For the record, I never use rubygems and never will for various
reason, most importantly because I do not need and do not want
automatic
dependency handling without me controlling it

$ gem help install
Usage: gem install GEMNAME [GEMNAME ...] [options] -- --build-flags
[options]

[...]
Install/Update Options:
[...]
--ignore-dependencies Do not install any required
dependent gems

[ANN] Hpricot 0.8.2 released	1	Nov 6, 2009
[ANN] Hpricot 0.6 -- the swift, delightful HTML parser	0	Jun 16, 2007
[ANN] hpricot 0.8	0	Apr 1, 2009
[ANN] hpricot 0.5 -- a fast, forgiving HTML reader	5	Feb 1, 2007
Hpricot test for equivalence of two xml segments?	4	Jul 16, 2010
Hpricot and XML	0	Nov 29, 2007
Help with Hpricot and collect	0	Dec 18, 2008
Ruby(and programming) beginners question regarding 'NoMethodError'while using Hpricot	5	Feb 15, 2011

[ANN] hpricot 0.7

_why

matt neuburg

Phlip

Phlip

Ryan Davis

Ryan Davis

Daniele Alessandri

John Barnette

Aaron Patterson

_why

Charles Oliver Nutter

Sean O'Halpin

John Wells

Charles Oliver Nutter

David Villa

Marc Heiler

trans

Jörg W Mittag

Phlip

Eric Hodel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads