Moving large amount of files, 1.750.000+

Sebastian Newstream · Nov 9, 2008

Hello fellow Rubyists!

I'm trying to impress my boss and co-workers with Ruby so we
hopefully can start to use it in work more often. I was given
the task with moving a *large* repository of images from one
source to the next. The repository consists of around 1.750.000
images and requires around 350GB of space.
I though this would be no match for Ruby!
Even though it proved no match for Ruby it was a large match for me. =)

I have attached the source code with this post.
Please be gentle on me, I'm quite new to Ruby. =D

So far I have run test on my local machine and it took around 47s to
copy 4.211 items. *calculating* With this speed it would take around
13H to copy the whole repository. That's a lot of time.
If I present this to my co-workers I know they will instantly blame Ruby
for this, even though I am the one to blame.

My question is this: How do I speed up my application?
I reused my filehandler and skipped the printing to the console,
but it is still taking time.

Also if any one has any previous experience of handling this many files
any kind of tips are welcome. I'm quite worried that the array
containing
the path to all the files will flood the stack.

Thanks in advance and my regards.
//Sebastian

Attachments:
http://www.ruby-forum.com/attachment/2908/eXtremeCop.rb

Robert Klemme · Nov 9, 2008

Hello fellow Rubyists!

I'm trying to impress my boss and co-workers with Ruby so we
hopefully can start to use it in work more often. I was given
the task with moving a *large* repository of images from one
source to the next. The repository consists of around 1.750.000
images and requires around 350GB of space.

My question is this: How do I speed up my application?
I reused my filehandler and skipped the printing to the console,
but it is still taking time.

Also if any one has any previous experience of handling this many files
any kind of tips are welcome. I'm quite worried that the array
containing
the path to all the files will flood the stack.

Sorry to disappoint you but this amount of copying won't be really fast
regardless of programming language. You do not mention what a "source"
in your case is, what operating systems are involved and what transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works pretty
well. But no matter what you do, the slowest link will determine your
throughput: you cannot go faster than network speed or the speed that
your "sources" can read or write.

Here's the tar variant, since you copy images I assume data is
compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
tar xf - )

If you can physically move the source disk to the target host and then
do a local copy with cp -a that's probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote locations).

Kind regards

robert

Randy Kramer · Nov 9, 2008

Sorry to disappoint you but this amount of copying won't be really fast
regardless of programming language. You do not mention what a "source"
in your case is, what operating systems are involved and what transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works pretty
well. But no matter what you do, the slowest link will determine your
throughput: you cannot go faster than network speed or the speed that
your "sources" can read or write.

Here's the tar variant, since you copy images I assume data is
compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
tar xf - )

If you can physically move the source disk to the target host and then
do a local copy with cp -a that's probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote

locations).

I agree with Robert, but before I saw his response I did some
calculations. Assuming all the images are the same size (about 200
KB), moving 4,211 of them in 47 seconds is a data rate close to 18
MB/sec.--that's faster than a 100 mb/sec Ethernet, not counting any
overhead due to collisions.

That's pretty fast for most channels. Are you moving data from one disk
to another on the same computer? Or over a high speed connection
between two computers? What is the raw hardware speed of the
interconnect?

I wouldn't be too worried about the 13 hours, you've got a lot of data
to move.

Randy Kramer

Randy Kramer · Nov 9, 2008

I wouldn't be too worried about the 13 hours, you've got a lot of data
to move.

PS: I wish I had added: Since all you're doing is copying files, do it
from the CLI (as Robert suggested)--no need to involve any programming
language which is just added overhead. Then let us know how many hours
it takes that way, for comparison.

Randy Kramer

Sebastian Newstream · Nov 10, 2008

First of all, thanks for your quick answer!
I was a bit tired when I asked the question so I'm sorry
for the lacking information.

Robert said:
Sorry to disappoint you but this amount of copying won't be really fast
regardless of programming language. You do not mention what a "source"
in your case is, what operating systems are involved and what transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works pretty
well. But no matter what you do, the slowest link will determine your
throughput: you cannot go faster than network speed or the speed that
your "sources" can read or write.

The target system I will use is a virtual Windows 2003 server with a
mounted network drive. Unfortunatly I have no access to any of the
hardware.
But I know there is at least a 100Mbit Ethernet connection between the
server and the mounted disk.

Here's the tar variant, since you copy images I assume data is
compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd "$source" && tar cf - . ) | ( ssh user@target "cd '$target' &&
tar xf - )

Thanks for your tips, but it's a Windows system.

If you can physically move the source disk to the target host and then
do a local copy with cp -a that's probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote
locations).

Since our company outsourced the hardware maintenance the moon or across
the street makes no difference. =(

Kind regards

robert

What I meant to ask was, I what way can I change my source code to be
more effective?
Thanks a lot for your time.
//Sebastian

Sebastian Newstream · Nov 10, 2008

Thank you as well Kramer! I will try to clarify...

Randy said:
locations).

I agree with Robert, but before I saw his response I did some
calculations. Assuming all the images are the same size (about 200
KB), moving 4,211 of them in 47 seconds is a data rate close to 18
MB/sec.--that's faster than a 100 mb/sec Ethernet, not counting any
overhead due to collisions.

That's pretty fast for most channels. Are you moving data from one disk
to another on the same computer? Or over a high speed connection
between two computers? What is the raw hardware speed of the
interconnect?

I know it is a very rough estimation, and the test I performed where on
my Macbook Pro from one folder to another. Of course when I run this
live, the environment will be very different. I just wanted to estimate
a minimum time for the copy.

Sebastian Newstream · Nov 10, 2008

Your probably right, I will start the job on a friday evening and let it
take it's time.

PS: I wish I had added: Since all you're doing is copying files, do it
from the CLI (as Robert suggested)--no need to involve any programming
language which is just added overhead. Then let us know how many hours
it takes that way, for comparison.

Randy Kramer

Your probably right about this as well, but I can't backout of the Ruby
corner now. I already opened my mouth about Ruby to much now, if I
change my method now it will make Ruby look realy bad. =(

This is what I succeded with:
* I removed all of the console prints for each file. (This lowered the
time with about 20s! I had no idea that output was so demanding.).
* I kept the filehandle open for writing to the process.log.
* I also removed any line of unessesary code in the critical part of my
application.
This lowered the time to around 17s. I will now try to run the test on
the right environment.

Of course I will post the results here for your guys to se.
Thanks again for your time.
//Sebastian

Saji N. Hameed · Nov 10, 2008

* Sebastian Newstream said:
Your probably right, I will start the job on a friday evening and let it
take it's time.

Your probably right about this as well, but I can't backout of the Ruby
corner now. I already opened my mouth about Ruby to much now, if I
change my method now it will make Ruby look realy bad. =(

This is what I succeded with:
* I removed all of the console prints for each file. (This lowered the
time with about 20s! I had no idea that output was so demanding.).
* I kept the filehandle open for writing to the process.log.
* I also removed any line of unessesary code in the critical part of my
application.
This lowered the time to around 17s. I will now try to run the test on
the right environment.

Of course I will post the results here for your guys to se.
Thanks again for your time.
//Sebastian

This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the transfer
jobs among multiple threads ??? ... )

saji
--
Saji N. Hameed

APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 (e-mail address removed)
KOREA

Jano Svitok · Nov 10, 2008

This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the transfer
jobs among multiple threads ??? ... )

I guess Ara Howard's threadify
(http://codeforpeople.com/lib/ruby/threadify/) might be handy.

The usefulness of more threads depends on network saturation - measure
your network/disk throughput
using plain system copy (maybe several parallel ones), then measure
what your script does.
I'm afraid if you're going over ethernet, one thread would be enough.

I'd also suggest using File.directory? for testing if the file is
directory, instead of searching for '.'

Jano

Robert Klemme · Nov 10, 2008

2008/11/10 Sebastian Newstream said:
The target system I will use is a virtual Windows 2003 server with a
mounted network drive. Unfortunatly I have no access to any of the
hardware.
But I know there is at least a 100Mbit Ethernet connection between the
server and the mounted disk.

Thanks for your tips, but it's a Windows system.

The command above works on a cygwin shell. Alternatively you can use
XCOPY or directly use the Windows Shell (Explorer).

Since our company outsourced the hardware maintenance the moon or across
the street makes no difference. =(

What I meant to ask was, I what way can I change my source code to be
more effective?

And the answer is and was: don't bother too much because your transfer
is IO bound regardless of programming language or tool used.

Cheers

robert

Sebastian Newstream · Nov 10, 2008

Robert said:
The command above works on a cygwin shell. Alternatively you can use
XCOPY or directly use the Windows Shell (Explorer).

Ok great tip. Will keep it as a backup plan.
The thing is I need loging of all files being transfered so I know if
something is missing.

And the answer is and was: don't bother too much because your transfer
is IO bound regardless of programming language or tool used.

OK! I will listen to your tips.
Thanks for all your input Robert.
Best regards
//Sebastian

Sebastian Newstream · Nov 10, 2008

Jano said:
I guess Ara Howard's threadify
(http://codeforpeople.com/lib/ruby/threadify/) might be handy.

The usefulness of more threads depends on network saturation - measure
your network/disk throughput
using plain system copy (maybe several parallel ones), then measure
what your script does.
I'm afraid if you're going over ethernet, one thread would be enough.

Thats what I though to. Thanks for confirming this.

I'd also suggest using File.directory? for testing if the file is
directory, instead of searching for '.'

I will definitly do this.
Thanks for your input Jano.
Best regards
//Sebastian

Sebastian Newstream · Nov 10, 2008

Saji said:
This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the
transfer
jobs among multiple threads ??? ... )

saji
--
Saji N. Hameed

APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 (e-mail address removed)
KOREA

Thanks for your input. I the time wont go down any more I will
definitley try this.
Best regards
//Sebastian

Robert Klemme · Nov 10, 2008

2008/11/10 Sebastian Newstream said:
Ok great tip. Will keep it as a backup plan.
The thing is I need loging of all files being transfered so I know if
something is missing.

xcopy writes all filenames to the console. You can easily redirect
this to a file.

xcopy from to /e /i > log

For tar just add letter "v" for output of file names, e.g.,

( cd "$source" && tar cvf - . 2>copied_files ) | ( ssh user@target "cd
'$target' && tar xf - )

Cheers

robert

Siep Korteling · Nov 10, 2008

Robert said:
xcopy writes all filenames to the console. You can easily redirect
this to a file.

xcopy from to /e /i > log

For tar just add letter "v" for output of file names, e.g.,

( cd "$source" && tar cvf - . 2>copied_files ) | ( ssh user@target "cd
'$target' && tar xf - )

Cheers

robert

xcopy has (or used to have) problems with long pathnames. Microsoft
provides "robocopy" (standard in Vista, downloadable for others).It has
features like tolerance for network outages and the ability to copy
ACL's on ntfs. http://en.wikipedia.org/wiki/Robocopy

I've no experience with xxcopy (an improved xcopy). It looks good too.

hth,

Siep

hth,

Siep

Pre-allocate large amount of memory?	14	Sep 14, 2009
Sending Error when attaching files	1	Aug 7, 2023
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
Optimize write of large file	6	May 12, 2011
Fast searching of large files	6	Jul 1, 2010
How to reliably determine paths of active apache .conf files from within php	2	Jul 27, 2022
Large amount of files to parse/organize, tips on algorithm?	6	Sep 2, 2008

Moving large amount of files, 1.750.000+

Sebastian Newstream

Robert Klemme

Randy Kramer

Randy Kramer

Sebastian Newstream

Sebastian Newstream

Sebastian Newstream

Saji N. Hameed

Jano Svitok

Robert Klemme

Sebastian Newstream

Sebastian Newstream

Sebastian Newstream

Robert Klemme

Siep Korteling

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads