Hpricot Relative Path

J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

I'm trying to write a script that pulls out an image from a yfrog page

So this is what I have

require 'rubygems'
require 'hpricot'
require 'open-uri'

url = 'http://yfrog.com/03gssacj'
doc = Hpricot(open(url))

(doc%"#main_image").attributes['src'] # => "/img3/7036/gssac.jpg"

The problem is that the path is relative.
I've done a little googling, queried my ruby and rails ML archives, glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don't see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can reference
the picture later?

The only thing I've found that works so far involves string manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

url = 'http://yfrog.com/03gssacj'
page = open(url)
base = page.base_uri.to_s[ /(?:http:\/\/)?[^\/]*\// ] # => "
http://img3.yfrog.com/"
relative = (Hpricot(page)%"#main_image").attributes['src'] # =>
"/img3/7036/gssac.jpg"
absolute = URI.join( base , relative )
absolute.to_s # => "http://img3.yfrog.com/img3/7036/gssac.jpg"

Anyone know of a better solution?
 
B

Ben Bleything

The problem is that the path is relative.
I've done a little googling, queried my ruby and rails ML archives, glanced
at hpricot code, and looked through the method lists for open-uri and
hpricot.
So far, I don't see anything that looks very useful.
Is there a way to have it give me the absolute path so that I can reference
the picture later?

Hpricot is just telling you what's in the HTML. Munging the
document's contents are your responsibility, not the parser's :)
The only thing I've found that works so far involves string manipulation,
which seems like a brittle workaround to replace something that probably
exists if I could just find it.

Look into the URI library.

require 'uri'

uri = URI.parse( "http://yfrog.com/03gssacj" )
uri.path = # your hpricot magic to get the image path goes here

Ben
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

Hpricot is just telling you what's in the HTML. Munging the
document's contents are your responsibility, not the parser's :)


Look into the URI library.

require 'uri'

uri = URI.parse( "http://yfrog.com/03gssacj" )
uri.path = # your hpricot magic to get the image path goes here

Ben
Thanks, this is what I am using now:

page = open url
image_path = URI.parse page.base_uri.to_s.sub( %r(/$) , '' )
image_path.path = (Hpricot(page)%"#main_image").attributes['src']
image_path.to_s

It still seems a little excessive, but it's a lot better than what I had
before.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top