Hpricot Relative Path

Josh Cheek · Mar 12, 2010

[Note: parts of this message were removed to make it a legal post.]

I'm trying to write a script that pulls out an image from a yfrog page

So this is what I have

require 'rubygems'
require 'hpricot'
require 'open-uri'

url = 'http://yfrog.com/03gssacj'
doc = Hpricot(open(url))

(doc%"#main_image").attributes['src'] # => "/img3/7036/gssac.jpg"

The problem is that the path is relative.
I've done a little googling, queried my ruby and rails ML archives, glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don't see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can reference
the picture later?

The only thing I've found that works so far involves string manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

url = 'http://yfrog.com/03gssacj'
page = open(url)
base = page.base_uri.to_s[ /(?:http:\/\/)?[^\/]*\// ] # => "
http://img3.yfrog.com/"
relative = (Hpricot(page)%"#main_image").attributes['src'] # =>
"/img3/7036/gssac.jpg"
absolute = URI.join( base , relative )
absolute.to_s # => "http://img3.yfrog.com/img3/7036/gssac.jpg"

Anyone know of a better solution?

Ben Bleything · Mar 12, 2010

The problem is that the path is relative.
I've done a little googling, queried my ruby and rails ML archives, glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don't see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can reference
the picture later?

Hpricot is just telling you what's in the HTML. Munging the
document's contents are your responsibility, not the parser's

The only thing I've found that works so far involves string manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

Look into the URI library.

require 'uri'

uri = URI.parse( "http://yfrog.com/03gssacj" )
uri.path = # your hpricot magic to get the image path goes here

Ben

Josh Cheek · Mar 12, 2010

[Note: parts of this message were removed to make it a legal post.]

Hpricot is just telling you what's in the HTML. Munging the
document's contents are your responsibility, not the parser's

Look into the URI library.

require 'uri'

uri = URI.parse( "http://yfrog.com/03gssacj" )
uri.path = # your hpricot magic to get the image path goes here

Ben

Thanks, this is what I am using now:

page = open url
image_path = URI.parse page.base_uri.to_s.sub( %r(/$) , '' )
image_path.path = (Hpricot(page)%"#main_image").attributes['src']
image_path.to_s

It still seems a little excessive, but it's a lot better than what I had
before.

Hpricot scraping returns nil	4	Nov 20, 2008
[ANN] Hpricot 0.6 -- the swift, delightful HTML parser	0	Jun 16, 2007
Can't use variable from cgi with hpricot	5	Oct 10, 2009
Hpricot and xpath	7	Aug 12, 2008
Hpricot problems	0	Feb 20, 2007
Hpricot and path of an elememt	2	Aug 10, 2008
hpricot	9	Aug 18, 2006
How can one get the Hpricot DOM document from Mechanize?	3	Sep 13, 2008

Hpricot Relative Path

Josh Cheek

Ben Bleything

Josh Cheek

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads