urllib2 performance on windows, usb connection

D

dq

I've googled this pretty extensively and can't find anyone who's had the
same problem, so here it is:

I wrote a console program in python to download podcasts, so speed is an
issue. I have 1.6 M down. The key bit of downloading code is this:

source = urllib2.urlopen( url )
target = open( filename, 'wb' )
target.write( source.read() )

This runs great on Ubuntu. I get DL speeds of about 1.5 Mb/s on the
SATA HD or on a usb-connected iPod, but if I run the same program on
Windows (with a 2 GHz core 2 duo, 7200 rpm sata drive---better hardware
specs than the Ubuntu box), it maxes out at about 500 kb/s. Worse, if I
DL directly to my iPod in disk mode, I'm lucky if I even hit 100 kb/s.

So does anyone know what the deal is with this? Why is the same code so
much slower on Windows? Hope someone can tell me before a holy war
erupts :)

--danny
 
M

Martin v. Löwis

So does anyone know what the deal is with this? Why is the same code so
much slower on Windows? Hope someone can tell me before a holy war
erupts :)

Only the holy war can give an answer here. It certainly has *nothing* to
do with Python; Python calls the operating system functions to read from
the network and write to the disk almost directly. So it must be the
operating system itself that slows it down.

To investigate further, you might drop the write operating, and measure
only source.read(). If that is slower, then, for some reason, the
network speed is bad on Windows. Maybe you have the network interfaces
misconfigured? Maybe you are using wireless on Windows, but cable on
Linux? Maybe you have some network filtering software running on
Windows? Maybe it's just that Windows sucks?-)

If the network read speed is fine, but writing slows down, I ask the
same questions. Perhaps you have some virus scanner installed that
filters all write operations? Maybe Windows sucks?

Regards,
Martin
 
D

dq

Martin said:
Only the holy war can give an answer here. It certainly has *nothing* to
do with Python; Python calls the operating system functions to read from
the network and write to the disk almost directly. So it must be the
operating system itself that slows it down.

To investigate further, you might drop the write operating, and measure
only source.read(). If that is slower, then, for some reason, the
network speed is bad on Windows. Maybe you have the network interfaces
misconfigured? Maybe you are using wireless on Windows, but cable on
Linux? Maybe you have some network filtering software running on
Windows? Maybe it's just that Windows sucks?-)

If the network read speed is fine, but writing slows down, I ask the
same questions. Perhaps you have some virus scanner installed that
filters all write operations? Maybe Windows sucks?

Regards,
Martin

Thanks for the ideas, Martin. I ran a couple of experiments to find the
culprit, by downloading the same 20 MB file from the same fast server.
I compared:

1. DL to HD vs USB iPod.
2. AV on-access protection on vs. off
3. "source. read()" only vs. "file.write( source.read() )"

The culprit is definitely the write speed on the iPod. That is,
everything runs plenty fast (~1 MB/s down) as long as I'm not writing
directly to the iPod. This is kind of odd, because if I copy the file
over from the HD to the iPod using windows (drag-n-drop), it takes about
a second or two, so about 10 MB/s.

So the problem is definitely partially Windows, but it also seems that
Python's file.write() function is not without blame. It's the
combination of Windows, iPod and Python's data stream that is slowing me
down.

I'm not really sure what I can do about this. I'll experiment a little
more and see if there's any way around this bottleneck. If anyone has
run into a problem like this, I'd love to hear about it...

thanks again,
--danny
 
M

MRAB

dq said:
>
> Thanks for the ideas, Martin. I ran a couple of experiments to find the
> culprit, by downloading the same 20 MB file from the same fast server. I
> compared:
>
> 1. DL to HD vs USB iPod.
> 2. AV on-access protection on vs. off
> 3. "source. read()" only vs. "file.write( source.read() )"
>
> The culprit is definitely the write speed on the iPod. That is,
> everything runs plenty fast (~1 MB/s down) as long as I'm not writing
> directly to the iPod. This is kind of odd, because if I copy the file
> over from the HD to the iPod using windows (drag-n-drop), it takes about
> a second or two, so about 10 MB/s.
>
> So the problem is definitely partially Windows, but it also seems that
> Python's file.write() function is not without blame. It's the
> combination of Windows, iPod and Python's data stream that is slowing me
> down.
>
> I'm not really sure what I can do about this. I'll experiment a little
> more and see if there's any way around this bottleneck. If anyone has
> run into a problem like this, I'd love to hear about it...
>
You could try copying the file to the iPod using the command line, or
copying data from disk to iPod in, say, C, anything but Python. This
would allow you to identify whether Python itself has anything to do
with it.
 
D

dq

MRAB said:
You could try copying the file to the iPod using the command line, or
copying data from disk to iPod in, say, C, anything but Python. This
would allow you to identify whether Python itself has anything to do
with it.

Well, I think I've partially identified the problem. target.write(
source.read() ) runs perfectly fast, copies 20 megs in about a second,
from HD to iPod. However, if I run the same code in a while loop, using
a certain block size, say target.write( source.read(4096) ), it takes
forever (or at least I'm still timing it while I write this post).

The mismatch seems to be between urllib2's block size and the write
speed of the iPod, I might try to tweak this a little in the code and
see if it has any effect.

Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want to
try to improve that...
 
M

MRAB

dq said:
Well, I think I've partially identified the problem. target.write(
source.read() ) runs perfectly fast, copies 20 megs in about a
second, from HD to iPod. However, if I run the same code in a while
loop, using a certain block size, say target.write( source.read(4096)
), it takes forever (or at least I'm still timing it while I write
this post).

The mismatch seems to be between urllib2's block size and the write
speed of the iPod, I might try to tweak this a little in the code and
see if it has any effect.

Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want to
try to improve that...
How long does it take to transfer 4KB? If it can transfer 1MB/s then I'd
say that 4KB is too small. Generally speaking, the higher the data rate,
the larger the blocks you should be transferring at a time, IMHO.

You could write a script to test the transfer speed with different block
sizes.
 
D

dq

dq said:
Well, I think I've partially identified the problem. target.write(
source.read() ) runs perfectly fast, copies 20 megs in about a second,
from HD to iPod. However, if I run the same code in a while loop, using
a certain block size, say target.write( source.read(4096) ), it takes
forever (or at least I'm still timing it while I write this post).

The mismatch seems to be between urllib2's block size and the write
speed of the iPod, I might try to tweak this a little in the code and
see if it has any effect.

Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want to
try to improve that...

After some tweaking of the block size, I managed to get the DL speed up
to about 900 Mb/s. It's still not quite Ubuntu, but it's a good order
of magnitude better. The new DL code is pretty much this:

"""
blocksize = 2 ** 16 # plus or minus a power of 2
source = urllib2.urlopen( 'url://string' )
target = open( pathname, 'wb')
fullsize = float( source.info()['Content-Length'] )
DLd = 0
while DLd < fullsize:
DLd = DLd + blocksize
# optional: write some DL progress info
# somewhere, e.g. stdout
target.close()
source.close()
"""
 
D

dq

MRAB said:
How long does it take to transfer 4KB? If it can transfer 1MB/s then I'd
say that 4KB is too small. Generally speaking, the higher the data rate,
the larger the blocks you should be transferring at a time, IMHO.

You could write a script to test the transfer speed with different block
sizes.

Thanks MRAB, 32 or 64 KB seems to be quickest, but I'll do a more
scientific test soon and see what turns up.
 
M

MRAB

dq said:
>>
>> Well, I think I've partially identified the problem. target.write(
>> source.read() ) runs perfectly fast, copies 20 megs in about a
>> second, from HD to iPod. However, if I run the same code in a
>> while loop, using a certain block size, say target.write(
>> source.read(4096) ), it takes forever (or at least I'm still timing
>> it while I write this post).
>>
>> The mismatch seems to be between urllib2's block size and the write
>> speed of the iPod, I might try to tweak this a little in the code
>> and see if it has any effect.
>>
>> Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want
>> to try to improve that...
>
> After some tweaking of the block size, I managed to get the DL speed
> up to about 900 Mb/s. It's still not quite Ubuntu, but it's a good
> order of magnitude better. The new DL code is pretty much this:
>
> """
> blocksize = 2 ** 16 # plus or minus a power of 2
> source = urllib2.urlopen( 'url://string' )
> target = open( pathname, 'wb')
> fullsize = float( source.info()['Content-Length'] )
> DLd = 0
> while DLd < fullsize:
> DLd = DLd + blocksize
> # optional: write some DL progress info
> # somewhere, e.g. stdout
> target.close()
> source.close()
> """
>
I'd like to suggest that the block size you add to 'DLd' be the actual
size of the returned block, just in case the read() doesn't return all
you asked for (it might not be guaranteed, and the chances are that the
final block will be shorter, unless 'fullsize' happens to be a multiple
of 'blocksize').

If less is returned by read() then the while-loop might finish before
all the data has been downloaded, and if you just add 'blocksize' each
time it might end up > 'fullsize', ie apparently >100% downloaded!
 
D

dq

MRAB said:
dq said:
dq said:
MRAB wrote:
dq wrote:
Martin v. Löwis wrote:
So does anyone know what the deal is with this? Why is
the same code so much slower on Windows? Hope someone
can tell me before a holy war erupts :)

Only the holy war can give an answer here. It certainly has
*nothing* to do with Python; Python calls the operating
system functions to read from the network and write to the
disk almost directly. So it must be the operating system
itself that slows it down.

To investigate further, you might drop the write operating,
and measure only source.read(). If that is slower, then,
for some reason, the network speed is bad on Windows. Maybe
you have the network interfaces misconfigured? Maybe you
are using wireless on Windows, but cable on Linux? Maybe
you have some network filtering software running on
Windows? Maybe it's just that Windows sucks?-)

If the network read speed is fine, but writing slows down,
I ask the same questions. Perhaps you have some virus
scanner installed that filters all write operations? Maybe
Windows sucks?

Regards, Martin


Thanks for the ideas, Martin. I ran a couple of experiments
to find the culprit, by downloading the same 20 MB file from
the same fast server. I compared:

1. DL to HD vs USB iPod. 2. AV on-access protection on vs.
off 3. "source. read()" only vs. "file.write(
source.read() )"

The culprit is definitely the write speed on the iPod. That
is, everything runs plenty fast (~1 MB/s down) as long as I'm
not writing directly to the iPod. This is kind of odd,
because if I copy the file over from the HD to the iPod using
windows (drag-n-drop), it takes about a second or two, so
about 10 MB/s.

So the problem is definitely partially Windows, but it also
seems that Python's file.write() function is not without
blame. It's the combination of Windows, iPod and Python's
data stream that is slowing me down.

I'm not really sure what I can do about this. I'll
experiment a little more and see if there's any way around
this bottleneck. If anyone has run into a problem like this,
I'd love to hear about it...

You could try copying the file to the iPod using the command
line, or copying data from disk to iPod in, say, C, anything
but Python. This would allow you to identify whether Python
itself has anything to do with it.

Well, I think I've partially identified the problem.
target.write( source.read() ) runs perfectly fast, copies 20 megs
in about a second, from HD to iPod. However, if I run the same
code in a while loop, using a certain block size, say
target.write( source.read(4096) ), it takes forever (or at least
I'm still timing it while I write this post).

The mismatch seems to be between urllib2's block size and the
write speed of the iPod, I might try to tweak this a little in
the code and see if it has any effect.

Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might
want to try to improve that...

After some tweaking of the block size, I managed to get the DL
speed up to about 900 Mb/s. It's still not quite Ubuntu, but it's
a good order of magnitude better. The new DL code is pretty much
this:

""" blocksize = 2 ** 16 # plus or minus a power of 2 source =
urllib2.urlopen( 'url://string' ) target = open( pathname, 'wb')
fullsize = float( source.info()['Content-Length'] ) DLd = 0 while
DLd < fullsize: DLd = DLd + blocksize # optional: write some DL
progress info # somewhere, e.g. stdout target.close()
source.close() """
I'd like to suggest that the block size you add to 'DLd' be the
actual size of the returned block, just in case the read() doesn't
return all you asked for (it might not be guaranteed, and the chances
are that the final block will be shorter, unless 'fullsize' happens
to be a multiple of 'blocksize').

If less is returned by read() then the while-loop might finish before
all the data has been downloaded, and if you just add 'blocksize'
each time it might end up > 'fullsize', ie apparently >100%
downloaded!

Interesting. I'll if to see if any of the downloaded files end
prematurely :)

btw, I forgot the most important line of the code!

"""
blocksize = 2 ** 16 # plus or minus a power of 2
source = urllib2.urlopen( 'url://string' )
target = open( pathname, 'wb')
fullsize = float( source.info()['Content-Length'] )
DLd = 0
while DLd < fullsize:
# +++
target.write( source.read( blocksize ) ) # +++
# +++
DLd = DLd + blocksize
# optional: write some DL progress info
# somewhere, e.g. stdout
target.close()
source.close()
"""

Using that, I'm not quite sure where I can grab onto the value of how
much was actually read from the block. I suppose I could use an
intermediate variable, read the data into it, measure the size, and then
write it to the file stream, but I'm not sure it would be worth the
overhead. Or is there some other magic I should know about?

If I start to get that problem, at least I'll know where to look...
 
M

MRAB

dq said:
MRAB said:
dq said:
dq wrote:
MRAB wrote:
dq wrote:
Martin v. Löwis wrote:
So does anyone know what the deal is with this? Why is the same
code so much slower on Windows? Hope someone can tell me before
a holy war erupts :)

Only the holy war can give an answer here. It certainly has
*nothing* to do with Python; Python calls the operating system
functions to read from the network and write to the disk almost
directly. So it must be the operating system itself that slows it
down.

To investigate further, you might drop the write operating,
and measure only source.read(). If that is slower, then, for
some reason, the network speed is bad on Windows. Maybe
you have the network interfaces misconfigured? Maybe you are
using wireless on Windows, but cable on Linux? Maybe you have
some network filtering software running on Windows? Maybe it's
just that Windows sucks?-)

If the network read speed is fine, but writing slows down,
I ask the same questions. Perhaps you have some virus scanner
installed that filters all write operations? Maybe
Windows sucks?

Regards, Martin


Thanks for the ideas, Martin. I ran a couple of experiments
to find the culprit, by downloading the same 20 MB file from
the same fast server. I compared:

1. DL to HD vs USB iPod. 2. AV on-access protection on vs.
off 3. "source. read()" only vs. "file.write(
source.read() )"

The culprit is definitely the write speed on the iPod. That is,
everything runs plenty fast (~1 MB/s down) as long as I'm
not writing directly to the iPod. This is kind of odd, because if
I copy the file over from the HD to the iPod using
windows (drag-n-drop), it takes about a second or two, so about
10 MB/s.

So the problem is definitely partially Windows, but it also seems
that Python's file.write() function is not without blame. It's the
combination of Windows, iPod and Python's data stream that is
slowing me down.

I'm not really sure what I can do about this. I'll experiment a
little more and see if there's any way around this bottleneck. If
anyone has run into a problem like this,
I'd love to hear about it...

You could try copying the file to the iPod using the command line,
or copying data from disk to iPod in, say, C, anything but Python.
This would allow you to identify whether Python itself has anything
to do with it.

Well, I think I've partially identified the problem. target.write(
source.read() ) runs perfectly fast, copies 20 megs
in about a second, from HD to iPod. However, if I run the same
code in a while loop, using a certain block size, say target.write(
source.read(4096) ), it takes forever (or at least
I'm still timing it while I write this post).

The mismatch seems to be between urllib2's block size and the write
speed of the iPod, I might try to tweak this a little in the code
and see if it has any effect.

Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want
to try to improve that...

After some tweaking of the block size, I managed to get the DL speed
up to about 900 Mb/s. It's still not quite Ubuntu, but it's
a good order of magnitude better. The new DL code is pretty much
this:

""" blocksize = 2 ** 16 # plus or minus a power of 2 source =
urllib2.urlopen( 'url://string' ) target = open( pathname, 'wb')
fullsize = float( source.info()['Content-Length'] ) DLd = 0 while DLd
< fullsize: DLd = DLd + blocksize # optional: write some DL progress
info # somewhere, e.g. stdout target.close() source.close() """
I'd like to suggest that the block size you add to 'DLd' be the actual
size of the returned block, just in case the read() doesn't return all
you asked for (it might not be guaranteed, and the chances
are that the final block will be shorter, unless 'fullsize' happens
to be a multiple of 'blocksize').

If less is returned by read() then the while-loop might finish before
all the data has been downloaded, and if you just add 'blocksize'
each time it might end up > 'fullsize', ie apparently >100% downloaded!

Interesting. I'll if to see if any of the downloaded files end
prematurely :)

btw, I forgot the most important line of the code!

"""
blocksize = 2 ** 16 # plus or minus a power of 2
source = urllib2.urlopen( 'url://string' )
target = open( pathname, 'wb')
fullsize = float( source.info()['Content-Length'] )
DLd = 0
while DLd < fullsize:
# +++
target.write( source.read( blocksize ) ) # +++
# +++
DLd = DLd + blocksize
# optional: write some DL progress info
# somewhere, e.g. stdout
target.close()
source.close()
"""

Using that, I'm not quite sure where I can grab onto the value of how
much was actually read from the block. I suppose I could use an
intermediate variable, read the data into it, measure the size, and then
write it to the file stream, but I'm not sure it would be worth the
overhead. Or is there some other magic I should know about?

If I start to get that problem, at least I'll know where to look...
It's just:

data = source.read(blocksize)
target.write(data)
DLd = DLd + len(data)

The overhead is tiny because you're not copying the data.

If 'x' refers to a 1MB bytestring and you do "y = x" or "foo(x)", you're
not actually copying that bytestring; you're just making 'y' also refer
to it or passing the reference to it into 'foo'. It's a bit passing
pointers around, but without the nasty bits! :)
 
D

dq

MRAB said:
dq said:
MRAB said:
dq wrote:
dq wrote:
MRAB wrote:
dq wrote:
Martin v. Löwis wrote:
So does anyone know what the deal is with this? Why is the
same code so much slower on Windows? Hope someone can tell me
before a holy war erupts :)

Only the holy war can give an answer here. It certainly has
*nothing* to do with Python; Python calls the operating system
functions to read from the network and write to the disk almost
directly. So it must be the operating system itself that slows
it down.

To investigate further, you might drop the write operating,
and measure only source.read(). If that is slower, then, for
some reason, the network speed is bad on Windows. Maybe
you have the network interfaces misconfigured? Maybe you are
using wireless on Windows, but cable on Linux? Maybe you have
some network filtering software running on Windows? Maybe it's
just that Windows sucks?-)

If the network read speed is fine, but writing slows down,
I ask the same questions. Perhaps you have some virus scanner
installed that filters all write operations? Maybe
Windows sucks?

Regards, Martin


Thanks for the ideas, Martin. I ran a couple of experiments
to find the culprit, by downloading the same 20 MB file from
the same fast server. I compared:

1. DL to HD vs USB iPod. 2. AV on-access protection on vs.
off 3. "source. read()" only vs. "file.write(
source.read() )"

The culprit is definitely the write speed on the iPod. That is,
everything runs plenty fast (~1 MB/s down) as long as I'm
not writing directly to the iPod. This is kind of odd, because
if I copy the file over from the HD to the iPod using
windows (drag-n-drop), it takes about a second or two, so about
10 MB/s.

So the problem is definitely partially Windows, but it also seems
that Python's file.write() function is not without blame. It's
the combination of Windows, iPod and Python's data stream that is
slowing me down.

I'm not really sure what I can do about this. I'll experiment a
little more and see if there's any way around this bottleneck.
If anyone has run into a problem like this,
I'd love to hear about it...

You could try copying the file to the iPod using the command line,
or copying data from disk to iPod in, say, C, anything but Python.
This would allow you to identify whether Python itself has
anything to do with it.

Well, I think I've partially identified the problem. target.write(
source.read() ) runs perfectly fast, copies 20 megs
in about a second, from HD to iPod. However, if I run the same
code in a while loop, using a certain block size, say
target.write( source.read(4096) ), it takes forever (or at least
I'm still timing it while I write this post).

The mismatch seems to be between urllib2's block size and the write
speed of the iPod, I might try to tweak this a little in the code
and see if it has any effect.

Oh, there we go: 20 megs in 135.8 seconds. Yeah... I might want
to try to improve that...

After some tweaking of the block size, I managed to get the DL speed
up to about 900 Mb/s. It's still not quite Ubuntu, but it's
a good order of magnitude better. The new DL code is pretty much
this:

""" blocksize = 2 ** 16 # plus or minus a power of 2 source =
urllib2.urlopen( 'url://string' ) target = open( pathname, 'wb')
fullsize = float( source.info()['Content-Length'] ) DLd = 0 while
DLd < fullsize: DLd = DLd + blocksize # optional: write some DL
progress info # somewhere, e.g. stdout target.close() source.close()
"""

I'd like to suggest that the block size you add to 'DLd' be the
actual size of the returned block, just in case the read() doesn't
return all you asked for (it might not be guaranteed, and the chances
are that the final block will be shorter, unless 'fullsize' happens
to be a multiple of 'blocksize').

If less is returned by read() then the while-loop might finish before
all the data has been downloaded, and if you just add 'blocksize'
each time it might end up > 'fullsize', ie apparently >100% downloaded!

Interesting. I'll if to see if any of the downloaded files end
prematurely :)

btw, I forgot the most important line of the code!

"""
blocksize = 2 ** 16 # plus or minus a power of 2
source = urllib2.urlopen( 'url://string' )
target = open( pathname, 'wb')
fullsize = float( source.info()['Content-Length'] )
DLd = 0
while DLd < fullsize:
# +++
target.write( source.read( blocksize ) ) # +++
# +++
DLd = DLd + blocksize
# optional: write some DL progress info
# somewhere, e.g. stdout
target.close()
source.close()
"""

Using that, I'm not quite sure where I can grab onto the value of how
much was actually read from the block. I suppose I could use an
intermediate variable, read the data into it, measure the size, and
then write it to the file stream, but I'm not sure it would be worth
the overhead. Or is there some other magic I should know about?

If I start to get that problem, at least I'll know where to look...
It's just:

data = source.read(blocksize)
target.write(data)
DLd = DLd + len(data)

The overhead is tiny because you're not copying the data.

If 'x' refers to a 1MB bytestring and you do "y = x" or "foo(x)", you're
not actually copying that bytestring; you're just making 'y' also refer
to it or passing the reference to it into 'foo'. It's a bit passing
pointers around, but without the nasty bits! :)

Yeah, that's about what I was thinking, although not quite as
succintly. Thanks for the help!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top