Downloading binary files - Python3

A

Anders Eriksson

Hello,

I have made a short program that given an url will download all referenced
files on that url.

It works, but I'm thinking it could use some optimization since it's very
slow.

I create a list of tuples where each tuple consist of the url to the file
and the path to where I want to save it. E.g (http://somewhere.com/foo.mp3,
c:\Music\foo.mp3)

The downloading part (which is the part I need help with) looks like this:
def GetFiles():
"""do the actual copying of files"""
for url,path in hreflist:
print(url,end=" ")
srcdata = urlopen(url).read()
dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

hreflist if the list of tuples.

at the moment the print(url,end=" ") will not be printed before the actual
download, instead it will be printed at the same time as print("Done!").
This I would like to have the way I intended.

Is downloading a binary file using: srcdata = urlopen(url).read()
the best way? Is there some other way that would speed up the downloading?

// Anders
 
M

Matteo

srcdata = urlopen(url).read()
dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

Have you tried reading all files first, then saving each one on the
appropriate directory? It might work if you have enough memory, i.e.
if the files you are downloading are small, and I assume they are,
otherwise it would be almost useless to optimize the code, since the
most time consuming part would always be the download. Anyway, I would
try and time it, or timeit. ;)

Anyway, opening a network connection does take some time, independent
of the size of the files you are downloading and of the kind of code
requesting it, you can't do much about that. If you had linux you
could probably get better results with wget, but that's another story
altogether.
 
P

Peter Otten

Anders said:
Hello,

I have made a short program that given an url will download all referenced
files on that url.

It works, but I'm thinking it could use some optimization since it's very
slow.

I create a list of tuples where each tuple consist of the url to the file
and the path to where I want to save it. E.g
(http://somewhere.com/foo.mp3, c:\Music\foo.mp3)

The downloading part (which is the part I need help with) looks like this:
def GetFiles():

Consider passing 'hreflist' explicitly. Global variables make your script
harder to manage in the long run.
"""do the actual copying of files"""
for url,path in hreflist:
print(url,end=" ")

You can force python to write out its internal buffer by calling

sys.stdout.flush()

You may also take a look at the logging package.
srcdata = urlopen(url).read()

For large files you would read the source in chunks:

src = urlopen(url)
with open(path, mode="wb") as dstfile:
while True:
chunk = src.read(2**20)
if not chunk:
break
dstfile.write(chunk)

Instead of writing this loop yourself you can use

shutil.copyfileobj(src, dstfile)

or even

urllib.request.urlretrieve(url, path)

which also takes care of opening the file.
dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

hreflist if the list of tuples.

at the moment the print(url,end=" ") will not be printed before the actual
download, instead it will be printed at the same time as print("Done!").
This I would like to have the way I intended.

Is downloading a binary file using: srcdata = urlopen(url).read()
the best way? Is there some other way that would speed up the downloading?


The above method may not faster (the operation is "io-bound") but it is able
to handle large files gracefully.

Peter
 
S

Stefan Behnel

Anders said:
I have made a short program that given an url will download all referenced
files on that url.

It works, but I'm thinking it could use some optimization since it's very
slow.

What's slow about it? Is downloading each file slow, is it the overhead of
connecting to the server before the download, or is it more the feeling
that the overall process could use your bandwidth better?

I create a list of tuples where each tuple consist of the url to the file
and the path to where I want to save it. E.g (http://somewhere.com/foo.mp3,
c:\Music\foo.mp3)

The downloading part (which is the part I need help with) looks like this:
def GetFiles():
"""do the actual copying of files"""
for url,path in hreflist:
print(url,end=" ")
srcdata = urlopen(url).read()
dstfile = open(path,mode='wb')
dstfile.write(srcdata)
dstfile.close()
print("Done!")

hreflist if the list of tuples.

at the moment the print(url,end=" ") will not be printed before the actual
download, instead it will be printed at the same time as print("Done!").
This I would like to have the way I intended.

Is downloading a binary file using: srcdata = urlopen(url).read()
the best way? Is there some other way that would speed up the downloading?

Yes. Instead of running the downloads in a sequential loop, put the code
for downloading one file into a function and start one thread per file,
each of which runs that function (see the threading module). That way, each
thread can happily sit and wait for data coming from its server, without
preventing other threads from receiving data from their server at the same
time. That should get your bandwidth usage up.

You may have to take care that you do not run too many threads against the
same server (which may get upset and block your requests, depending on the
site), or that you limit the number of threads when you download a large
number of files. Running too many threads can slow things down again. But
you'll see that when you try.

Stefan
 
M

MRAB

Matteo said:
Have you tried reading all files first, then saving each one on the
appropriate directory? It might work if you have enough memory, i.e.
if the files you are downloading are small, and I assume they are,
otherwise it would be almost useless to optimize the code, since the
most time consuming part would always be the download. Anyway, I would
try and time it, or timeit. ;)

Anyway, opening a network connection does take some time, independent
of the size of the files you are downloading and of the kind of code
requesting it, you can't do much about that. If you had linux you
could probably get better results with wget, but that's another story
altogether.
If your net connection is working at its maximum then there's nothing
you can do to speed up the downloads.

If it's the response time that's the problem then you could put the
tuples into a queue and run a number of threads, each one repeatedly
getting a tuple from the queue and downloading, until the queue is
empty.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,818
Latest member
Brigette36

Latest Threads

Top