Hpricot & mechanize fail to parse page after redirect

E

Ehud Rosenberg

Hi everyone,
My quest with mechanize/Hpricot continues :)
Something extremely strange happened today - some simple working code
broke down, and i can't figure out why.

I am trying to access a piratebay.org search page, which does a redirect
to a relative url like this:
original link:
http://thepiratebay.org/s/?page=0&orderby=3&q=football+manager+2008&searchTitle=on

redirects to:
/search/football manager 2008/0/3/0

Now, this all worked dandily up till yesterday. the page was redirected,
and mechanize even handled the cookie that was sent back from the site.
But today, i am getting this strange error:
"URI::InvalidURIError: bad URI(is not URI?): /search/football manager
2008/0/3/0"
from Hpricot. Mechanize gives a different one, but i'm sure it's
inherited from hpricot's problem with getting the page.

I have tested this on 2 different machines, and they both break down.
Can someone please give it a go and see if they can figure it out?
I would be very very thankful :)

Thanks,
Ehud

PS - I am using hpricot 0.6, and the redirected page is parsed correctly
when accessed directly
 
R

Rob Biedenharn

Hi everyone,
My quest with mechanize/Hpricot continues :)
Something extremely strange happened today - some simple working code
broke down, and i can't figure out why.

I am trying to access a piratebay.org search page, which does a
redirect
to a relative url like this:
original link:
http://thepiratebay.org/s/?page=0&orderby=3&q=football+manager+2008&searchTitle=on

redirects to:
/search/football manager 2008/0/3/0

Now, this all worked dandily up till yesterday. the page was
redirected,
and mechanize even handled the cookie that was sent back from the
site.
But today, i am getting this strange error:
"URI::InvalidURIError: bad URI(is not URI?): /search/football manager
2008/0/3/0"
from Hpricot. Mechanize gives a different one, but i'm sure it's
inherited from hpricot's problem with getting the page.

I have tested this on 2 different machines, and they both break down.
Can someone please give it a go and see if they can figure it out?
I would be very very thankful :)

Thanks,
Ehud

PS - I am using hpricot 0.6, and the redirected page is parsed
correctly
when accessed directly


If the redirect is via a 302 with a Location: header that is just the:
"/search/football manager 2008/0/3/0"

it's probably similar to the issue I had using HTTPClient. The
relevant bit of code from HTTPClient is:
def default_redirect_uri_callback(uri, res)
newuri = URI.parse(res.header['location'][0])
unless newuri.is_a?(URI::HTTP)
newuri = URI.join(uri, newuri)
STDERR.puts(
"could be a relative URI in location header which is not
recommended")
STDERR.puts(
"'The field value consists of a single absolute URI' in HTTP
spec")
end
puts "Redirect to: #{newuri}" if $DEBUG
newuri
end

Note the line: URI.join(uri, newuri) which takes the (presumed)
relative newuri and interprets it with respect to the original uri.
(Note also that I've recently sent the author of httpclient a patch
that fixed this line.)

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
E

Ehud Rosenberg

That is probably the case when using Hpricot - but mechanize handles
this and has a method that takes a relative url redirect and creates a
fully qualified one.
Also it worked for me yesterday with the exact same code (I know that
sounds crazy! :)

Thanks for the quick and thorough reply bob!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top