J
Janus Bor
Hello everyone,
I'm pretty new to Ruby and programming in general. Here's my problem:
I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.
I could save all processed id's in an array and then check if the array
includes my current id:
sequences = []
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end
end
But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.
I could also save all processed id's as keys of a hash, however I don't
have any use for a value:
sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
end
Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?
Thanks in advance!
I'm pretty new to Ruby and programming in general. Here's my problem:
I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.
I could save all processed id's in an array and then check if the array
includes my current id:
sequences = []
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end
end
But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.
I could also save all processed id's as keys of a hash, however I don't
have any use for a value:
sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
end
Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?
Thanks in advance!