Downloading Large Files -- Feedback?

M

mwt

This code works fine to download files from the web and write them to
the local drive:

import urllib
f = urllib.urlopen("http://www.python.org/blah/blah.zip")
g = f.read()
file = open("blah.zip", "wb")
file.write(g)
file.close()

The process is pretty opaque, however. This downloads and writes the
file with no feedback whatsoever. You don't see how many bytes you've
downloaded already, etc. Especially the "g = f.read()" step just sits
there while downloading a large file, presenting a pregnant, blinking
cursor.

So my question is, what is a good way to go about coding this kind of
basic feedback? Also, since my testing has only *worked* with this
code, I'm curious if it will throw a visibile error if something goes
wrong with the download.

Thanks for any pointers. I'm busily Googling away.
 
P

Paul Rubin

mwt said:
f = urllib.urlopen("http://www.python.org/blah/blah.zip")
g = f.read() # ...
So my question is, what is a good way to go about coding this kind of
basic feedback? Also, since my testing has only *worked* with this
code, I'm curious if it will throw a visibile error if something goes
wrong with the download.

One obvious type of failure is running out of memory if the file is
too large. Python can be fairly hosed (VM thrashing etc.) by the time
that happens. Normally you shouldn't read a potentially big file of
unknown size all in one gulp like that. You'd instead say something
like

while True:
block = f.read(4096) # read a 4k block from the file
if len(block) == 0:
break # end of file
# do something with the block

Your "do something with..." could involve updating a status display
or something, saying how much has been read so far.
 
M

mwt

Pardon my ignorance here, but could you give me an example of what
would constitute file that is unreasonably or dangerously large? I'm
running python on a ubuntu box with about a gig of ram.

Also, do you know of any online examples of the kind of robust,
real-world code you're describing?

Thanks.
 
A

Alex Martelli

mwt said:
The process is pretty opaque, however. This downloads and writes the
file with no feedback whatsoever. You don't see how many bytes you've
downloaded already, etc. Especially the "g = f.read()" step just sits
there while downloading a large file, presenting a pregnant, blinking
cursor.

So my question is, what is a good way to go about coding this kind of
basic feedback? Also, since my testing has only *worked* with this

You may use urlretrieve instead of urlopen: urlretrieve accepts an
optional argument named reporthook, and calls it once in a while ("zero
or more times"...;-) with three arguments block_count (number of blocks
downloaded so far), block_size (size of each block in bytes), file_size
(total size of the file in bytes if known, otherwise -1). The
reporthook function (or other callable) may display a progress bar or
whatever you like best.

urlretrieve saves what's downloading to a disk file (you may specify a
filename, or let it pick an appropriate temporary filename) and returns
two things, the filename where it's downloaded the data and a
mimetools.Message instance whose headers have metadata (such as content
type information).

If that doesn't fit your needs well, you may study the sources of
urllib.py in your Python's library source directory, to see exactly what
it's doing and code your own modified version.


Alex

Alex
 
S

Steven D'Aprano

mwt said:
Pardon my ignorance here, but could you give me an example of what
would constitute file that is unreasonably or dangerously large? I'm
running python on a ubuntu box with about a gig of ram.

1GB of RAM plus (say) 2GB of virtual memory = 3GB in total.

Your OS and other running processes might be using
(say) 1GB. So 2GB might be the absolute limit.

Of course your mileage will vary, and in practice your
machine will probably start slowing down long before
that limit.

Also, do you know of any online examples of the kind of robust,
real-world code you're describing?

It isn't written in C, but get your hands on wget. It
is probably already on your Linux distro, but if not,
check it out here:

http://www.gnu.org/software/wget/wget.html
 
M

mwt

Thanks for the explanation. That is exactly what I'm looking for. In a
way, it's kind of neat that urlopen just *does* it, no questions asked,
but I'd like to just know the basics, which is what it sounds like
urlretrieve covers. Excellent. Now, let's see what I can whip up with
that.

-- just bought "cookbook" and "nutshell" moments ago btw....
 
A

Alex Martelli

mwt said:
Thanks for the explanation. That is exactly what I'm looking for. In a
way, it's kind of neat that urlopen just *does* it, no questions asked,
but I'd like to just know the basics, which is what it sounds like
urlretrieve covers. Excellent. Now, let's see what I can whip up with
that.

Yes, I entirely understand your mindset, because mine is so similar: I
prefer using higher-level "just works" abstractions, BUT also want to
understand what's going on "below"... "just in case"!-)
-- just bought "cookbook" and "nutshell" moments ago btw....

Nice coincidence, and thanks!-)


Alex
 
M

mwt

So, I just put this little chunk to the test, which does give you
feedback about what's going on with a file download. Interesting that
with urlretrieve, you don't do all the file opening and closing stuff.

Works fine:

------------------
import urllib

def download_file(filename, URL):
f = urllib.urlretrieve(URL, filename, reporthook=my_report_hook)

def my_report_hook(block_count, block_size, total_size):
total_kb = total_size/1024
print "%d kb of %d kb downloaded" %(block_count *
(block_size/1024),total_kb )

if __name__ == "__main__":
download_file("test_zip.zip","http://blah.com/blah.zip")
 
A

Alex Martelli

mwt said:
import urllib

def download_file(filename, URL):
f = urllib.urlretrieve(URL, filename, reporthook=my_report_hook)

If you wanted to DO anything with the results, you'd probably want to
assign to
f, m = ...
not just f. This way, f is the filename, m a message object useful for
metadata (e.g., content type).

Otherwise looks fine.


Alex
 
F

Fuzzyman

mwt said:
This code works fine to download files from the web and write them to
the local drive:

import urllib
f = urllib.urlopen("http://www.python.org/blah/blah.zip")
g = f.read()
file = open("blah.zip", "wb")
file.write(g)
file.close()

The process is pretty opaque, however. This downloads and writes the
file with no feedback whatsoever. You don't see how many bytes you've
downloaded already, etc. Especially the "g = f.read()" step just sits
there while downloading a large file, presenting a pregnant, blinking
cursor.

So my question is, what is a good way to go about coding this kind of
basic feedback? Also, since my testing has only *worked* with this
code, I'm curious if it will throw a visibile error if something goes
wrong with the download.

By the way, you can achieve what you want with urllib2, you may also
want to check out the pycurl library - which is a Python interface to a
very good C library called curl.

With urllib2 you don't *have* to read the whole thing in one go -

import urllib2
f = urllib2.urlopen("http://www.python.org/blah/blah.zip")
g = ''
while True:
a = f.read(1024*10)
if not a:
break
print 'Read another 10k'
g += a

file = open("blah.zip", "wb")
file.write(g)
file.close()

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top