Question: Downloading files with open(-uri)?

Mariano Kamp · Dec 23, 2006

Hi,

I could need a quick hand here.

I want to watch the RailsConf 2006 videos and want to download
them with a script.

Unfortunately open("http:/xx") never comes back? Any idea what I
am doing wrong here?

I tested it with an URL that returns plain html and that worked
fine. See the first line, ibm.com.

require 'open-uri'

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul_graham_full.m4v
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
http://downloads.scribemedia.net/rails2006/07_why_lucky_stiff.m4v
}
BUFFER_SIZE = 1_024*1_024*1

urls.each do |url|
puts "downloading #{url}"
open(url) do |input|
puts "opened connection."
output = open(url.split(/\//).last, "wb")
while (buffer = input.read(BUFFER_SIZE))
print "."
$stdout.flush
output.write(buffer)
end
output.close
end
puts "done."
end
puts "All downloads done."

Cheers,
Mariano

William James · Dec 23, 2006

Mariano said:
Hi,

I could need a quick hand here.

I want to watch the RailsConf 2006 videos and want to download
them with a script.

Unfortunately open("http:/xx") never comes back? Any idea what I
am doing wrong here?

I tested it with an URL that returns plain html and that worked
fine. See the first line, ibm.com.

require 'open-uri'

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul_graham_full.m4v
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
http://downloads.scribemedia.net/rails2006/07_why_lucky_stiff.m4v
}
BUFFER_SIZE = 1_024*1_024*1

urls.each do |url|
puts "downloading #{url}"
open(url) do |input|
puts "opened connection."
output = open(url.split(/\//).last, "wb")
while (buffer = input.read(BUFFER_SIZE))
print "."
$stdout.flush
output.write(buffer)
end
output.close
end
puts "done."
end
puts "All downloads done."

Cheers,
Mariano

There's nothing wrong with your program; I tested it by
downloading a picture. If you have a dial-up connection, maybe
the transfer is progressing very slowly.

Mariano Kamp · Dec 23, 2006

There's nothing wrong with your program; I tested it by
downloading a picture. If you have a dial-up connection, maybe
the transfer is progressing very slowly.

Hey Bill,

hmm, not sure. If I change the BUFFER_SIZE to 1KB I still don't
see anything and the "puts 'opened connection'" should at least be
visible, shouldn't it?

Anyways I have a 6 MBit/s downstream so even a 1MB buffer
shouldn't be a problem.

I also suspected that the server is checking for deep links and
would evaluate the referer in the process, but when I enter one of
the urls directly into my browser it works.

Very strange.

Cheers,
Mariano

Edwin Fine · Dec 23, 2006

William said:
There's nothing wrong with your program; I tested it by
downloading a picture. If you have a dial-up connection, maybe
the transfer is progressing very slowly.

Actually, I think the site is slow or overloaded. The movies are 250MB -
500MB in size, and the download speed I am getting is around 52
KBytes/second (and I have a broadband connection). This code works
better at showing progress:

require 'open-uri'

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
http://downloads.scribemedia.net/rails2006/01_dh_hansson.m4v
http://downloads.scribemedia.net/rails2006/04_paul_graham_full.m4v
http://downloads.scribemedia.net/rails2006/06_railsCorePanel_full.m4v
http://downloads.scribemedia.net/rails2006/07_why_lucky_stiff.m4v
}

BUFFER_SIZE = 8 * 1_024

urls.each do |url|
puts "downloading #{url}"
out_file = url.split(/\//).last
puts "Writing to #{out_file}"

open(url, "r",
:content_length_proc => lambda {|content_length| puts "Content
length: #{content_length} bytes" },

rogress_proc => lambda { |size| printf("Read %010d bytes\r",
size.to_i) }) do |input|
open(out_file, "wb") do |output|
while (buffer = input.read(BUFFER_SIZE))
output.write(buffer)
end
end
end
puts "\ndone."
end
puts "All downloads done."

Robert Klemme · Dec 23, 2006

Hey Bill,

hmm, not sure. If I change the BUFFER_SIZE to 1KB I still don't see
anything and the "puts 'opened connection'" should at least be visible,
shouldn't it?

Anyways I have a 6 MBit/s downstream so even a 1MB buffer shouldn't be
a problem.

I also suspected that the server is checking for deep links and would
evaluate the referer in the process, but when I enter one of the urls
directly into my browser it works.

Very strange.

I observe the same behavior that you see. I have no knowledge of
openuri internals but here's my theory: the page is probably loaded
completely before open returns. This would explain why you see the dots
from ibm.com in one go. I would test the same with net/http and see
whether there is any difference. Make sure to use the stream form.

Kind regards

robert

Mariano Kamp · Dec 23, 2006

If you have libcurl and are willing to install an extension, the
rececently released () Curb 0.1 makes this as easy as:

Thanks for the tip Ross.

I tried gem install curb ;-) but that didn't work. And as the other
version is already downloading the files and I just wanted this
program to do this single job I will try out curb the next time ;-)

You've implemented it in C, so you probably can't answer my question
how you dealt with the buffer size too, can you?
Cheers,
Mariano

Robert Klemme · Dec 23, 2006

I observe the same behavior that you see. I have no knowledge of
openuri internals but here's my theory: the page is probably loaded
completely before open returns. This would explain why you see the dots
from ibm.com in one go. I would test the same with net/http and see
whether there is any difference. Make sure to use the stream form.

Try this (note, this will not follow redirects):

robert

require 'net/http'
require 'uri'

urls = %w{
http://ibm.com
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
http://downloads.scribemedia.net/rails2006/02_dave_thomas_full.m4v
}

$stdout.sync=true

urls.each do |url|
puts "downloading #{url}"

Net::HTTP.get_response(URI.parse(url)) do |res|
puts "opened connection."
target = url.split(/\//).last
puts "writing to #{target}"

File.open(target, "wb") do |output|
# next line will read in chunks but not provide option for dots...
# res.read_body(output)
res.read_body do |chunk|
output.write(chunk)
print "."
end
end
end

puts "done."
end

puts "All downloads done."

Edwin Fine · Dec 23, 2006

Mariano said:
Wow. Cool. How did you know about the content_length and progress
hooks? I don't see them in the docs.

Anyway ... That looks nice, but I still don't see the progress on the
console, other than for ibm.com. Do you?

I can see that I am downloading at 50KBytes/s using a network traffic
monitor, but not on the console. And if I read this right it should
yield a progress update roughly every kilobyte , right?

This is what I see after ... say ... 5 minutes after launching the
program.

downloading http://ibm.com
Writing to ibm.com
Content
length: 25348 bytes
Read 0000000822 bytes Read 0000001158 bytes Read 0000002182 bytes
Read 0000002518 bytes Read 0000003542 bytes Read 0000003878 bytes
Read 0000004902 bytes Read 0000005238 bytes Read 0000006262 bytes
Read 0000006598 bytes Read 0000007622 bytes Read 0000007958 bytes
Read 0000008982 bytes Read 0000009318 bytes Read 0000010342 bytes
Read 0000011366 bytes Read 0000012390 bytes Read 0000013398 bytes
Read 0000014422 bytes Read 0000014758 bytes Read 0000015782 bytes
Read 0000016118 bytes Read 0000017142 bytes Read 0000017478 bytes
Read 0000018502 bytes Read 0000018838 bytes Read 0000019862 bytes
Read 0000020198 bytes Read 0000021222 bytes Read 0000021558 bytes
Read 0000022582 bytes Read 0000022918 bytes Read 0000023942 bytes
Read 0000024278 bytes Read 0000025302 bytes Read 0000025348 bytes
done.
downloading http://downloads.scribemedia.net/
rails2006/03_martin_fowler_full.m4v
Writing to 03_martin_fowler_full.m4v
Content
length: 413031533 bytes

Cheers,
Mariano

It's documented here:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

This is what I am seeing:
downloading http://ibm.com
Writing to ibm.com
Content length: 25348 bytes
Read 0000025348 bytes
done.
downloading
http://downloads.scribemedia.net/rails2006/03_martin_fowler_full.m4v
Writing to 03_martin_fowler_full.m4v
Content length: 413031533 bytes
Read 0131826472 bytes

It seems to update around every second, based on informal observation. I
don't know why your output looks different; did you redirect or tee it
to a file? I'm using an old 'C' trick of printing a CR (\r) after each
update, which should keep the output on the same line and just overwrite
what was there before.

I'm running this using Ruby 1.8.5 on Ubuntu Edgy x86_64. Perhaps your OS
is different and has some other behavior.

I tried everything I could think of to disable or bypass buffering,
including $stdout.sync = true, using $stderr, calling $stdout.flush,
using syswrite, and so on, to get the output to appear periodically,
without success. I think the output is buffered at the OS level, or
something like that, so that even calling flush won't always work. The
only thing that works for me is the progress hook.

Mariano Kamp · Dec 23, 2006

Edwin said:
It's documented here:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

Grmpfh. I looked there, but probably too properly.

downloading http://ibm.com [..]
Read 0131826472 bytes

Thanks for trying that out.

Well, it seems, that open already read all the bytes. Changing the
implementation the way Robert suggested healed that.

So it was not really a problem with the buffering, as I suspected,
but with improper use of the API.

Cheers,
Mariano

Eric Hodel · Dec 23, 2006

Wow. Cool. How did you know about the content_length and progress
hooks? I don't see them in the docs.

ri OpenURI::OpenRead#open

Efficient file downloading	4	Feb 22, 2008
Get a web page with open-uri	6	Jul 8, 2009
how to judge web exists or not with open-uri?	3	Jul 20, 2010
open - uri question	4	Jul 26, 2006
open-uri 414 Request-URI Too Large	2	Apr 4, 2009
open-uri question	1	Jul 26, 2006
open uri buf read overflow	0	Sep 19, 2007
Open URI and web scraping...	0	Nov 12, 2007

Question: Downloading files with open(-uri)?

Mariano Kamp

William James

Mariano Kamp

Edwin Fine

Robert Klemme

Mariano Kamp

Robert Klemme

Edwin Fine

Mariano Kamp

Eric Hodel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads