D
Devesh Agrawal
Hi Folks,
I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.
For this I am using the following psuedocode:
outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P
check for eof conditions on f
if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files
//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.
This is performing miserbly SLOW. I have the following questions:
i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.
iii. Is it bad to have around 40-50 files opened at the same time ?.
iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.
v. Would coding the realine part in C using rubyinline offer me speed
advantages ?
vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:
a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.
b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.
c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?
To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.
I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.
--Devesh Agrawal
I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.
For this I am using the following psuedocode:
outputfile.open
open all files f1..fn
while (!(all files have eof))
(f1..fn).each{|f|
next if f.eof
line = f.readline
parse the line, and get a structure P out of it
put P into a hashtable: H[P.time] << P
check for eof conditions on f
if (H has more than k keys ? (ie has it become very large))
H.keys.sort{|t|
outputfile << Marshal.dump(H[t])
H.delete(t)
}
end
}
end
close all files
//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.
This is performing miserbly SLOW. I have the following questions:
i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.
iii. Is it bad to have around 40-50 files opened at the same time ?.
iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.
v. Would coding the realine part in C using rubyinline offer me speed
advantages ?
vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:
a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.
b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.
c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
d. How much of the slowness is due to the fact that it is ruby and not
C ?
To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.
I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.
--Devesh Agrawal