implementing a simple and efficient index system

J

Janus Bor

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

I could save all processed id's in an array and then check if the array
includes my current id:

sequences = []
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end
end

But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.

I could also save all processed id's as keys of a hash, however I don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!
 
P

phlip

Janus said:
I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key)

How do you know that? Did you try it, as an experiment?
 
J

Joel VanderWerf

Janus said:
I could also save all processed id's as keys of a hash, however I don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

It's not so bad to use true as a hash value. But if it bothers you,
there is the Set class, which is really a hash underneath, but the
interface is set-membership rather than associative lookup:

require 'set'

s = Set.new

s << 123
s << 456

p s.include?(456) # ==> true
p s.include?(789) # ==> false
 
J

Joel VanderWerf

Janus said:
Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

Can you simply use the id as a filename, and check for file existence
before writing? If you file system doesn't handle huge dirs well, then
split the id into several terms. But I'd try the hash or set approach
first, to avoid all the system calls.
 
J

Janus Bor

phlip said:
How do you know that? Did you try it, as an experiment?

No, I didn't try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.

Joel said:
It's not so bad to use true as a hash value. But if it bothers you,
there is the Set class, which is really a hash underneath, but the
interface is set-membership rather than associative lookup:

require 'set'

s = Set.new

s << 123
s << 456

p s.include?(456) # ==> true
p s.include?(789) # ==> false

Thanks, that's exactly what I was looking for! I didn't know set
basically works like a hash without a key...
 
I

Igal Koshevoy

Janus said:
No, I didn't try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.
Here are some simple alternatives for persisting and retrieving your
data in the order I'd recommend them based on what you've described so far:

1. PStore standard library: Put your objects into a magical hash, that's
automatically persisted to a file. Probably the quickest and easiest
solution. See
http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

2. Lightweight SQL database: Maybe store sequences in SQLite as BLOBs.
Probably the best long-term solution, but will require you to work
harder to transform data to and from storage. See
http://sqlite-ruby.rubyforge.org/

3. Marshall core class: Dump objects to and from strings, and then
files. Useful if you need something more than PStore, but still want to
persist objects directly. See http://ruby-doc.org/core/classes/Marshal.html

Best of luck.

-igal
 
R

Robert Blum

No, I didn't try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.

Try it with random data first. That way, you know the behavior under
load without paying the acquisition time.


- Robert
 
J

Joel VanderWerf

Igal said:
Here are some simple alternatives for persisting and retrieving your
data in the order I'd recommend them based on what you've described so far:

1. PStore standard library: Put your objects into a magical hash, that's
automatically persisted to a file. Probably the quickest and easiest
solution. See
http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

PStore writes the whole file at once, not incrementally. Not really what
OP is looking for, IMO.
2. Lightweight SQL database: Maybe store sequences in SQLite as BLOBs.
Probably the best long-term solution, but will require you to work
harder to transform data to and from storage. See
http://sqlite-ruby.rubyforge.org/

Not clear that would be better than files. Maybe so, if the individual
strings are short. Would be interesting to get some benchmarks on this
question.
3. Marshall core class: Dump objects to and from strings, and then
files. Useful if you need something more than PStore, but still want to
persist objects directly. See http://ruby-doc.org/core/classes/Marshal.html

PStore uses Marshall, so it's odd to say that Marshall is more than PStore.

If you're looking for a way to manage marshalled (or string or yaml...)
data in multiple files, using file paths as db keys, look no further than:

http://raa.ruby-lang.org/project/fsdb/

I think the Set/Hash + many files option is best here, though.
 
I

Igal Koshevoy

Joel said:
PStore writes the whole file at once, not incrementally. Not really
what OP is looking for, IMO.
It takes ~2s for my machine to read or write the 50MB PStore file. This
isn't a big deal if the original poster (OP) doesn't mind keeping the
program running to process multiple sequences at once.
Not clear that would be better than files. Maybe so, if the individual
strings are short. Would be interesting to get some benchmarks on this
question.
Files would probably be faster, but with such a small dataset, we're
probably talking about less than a second of difference for processing
the full dataset. I like using SQLite for stuff like this because it
provides a standard, out-of-the-box solution for working with
persistence, incremental processing, structured data, queries, and the
ability to easily add more fields to a record.
PStore uses Marshal, so it's odd to say that Marshal is more than PStore.
Working directly with Marshall allows greater flexiblity than using the
PStore wrapper, for example, if they decided to write a filesystem
database class. :)
If you're looking for a way to manage marshalled (or string or
yaml...) data in multiple files, using file paths as db keys, look no
further than: http://raa.ruby-lang.org/project/fsdb/
Cool project, thanks for writing it. Sounds useful.

-igal
 
J

Joel VanderWerf

Igal said:
It takes ~2s for my machine to read or write the 50MB PStore file. This
isn't a big deal if the original poster (OP) doesn't mind keeping the
program running to process multiple sequences at once.

I got the impression that Mr. O. P. was trying to avoid waiting until
the end of the download to write the file (maybe in case the network
went down halfway through).
Working directly with Marshall allows greater flexiblity than using the
PStore wrapper, for example, if they decided to write a filesystem
database class. :)

Less is more ;)
 
A

ara.t.howard

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein
sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates.
However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every
sequence to
the corresponding file immediately, but first I have to check if it
has
been processed already.

I could save all processed id's in an array and then check if the
array
includes my current id:

sequences = []
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end
end

But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.

I could also save all processed id's as keys of a hash, however I
don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way?
Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!


the simplest and most robust method is probably going to be to use
sqlite to store the id of each sequence. this will help you in the
case if a program crash and as you develop. for example:


cfp:~ > ruby a.rb


cfp:~ > sqlite3 .proteins.db 'select * from proteins'
42|ABC123


cfp:~ > ruby a.rb
a.rb:27:in `[]=': 42 (IndexError)
from /opt/local/lib/ruby/gems/1.8/gems/amalgalite-0.2.1/lib/
amalgalite/database.rb:477:in `transaction'
from a.rb:24:in `[]='
from a.rb:6


cfp:~ > sqlite3 .proteins.db 'select * from proteins'
42|ABC123





cfp:~ > cat a.rb

db = ProteinDatabase.new

id, sequence = 42, 'ABC123'

db[id] = sequence



BEGIN {

require 'rubygems'
require 'amalgalite'

class ProteinDatabase
SCHEMA = <<-SQL
create table proteins(
id integer primary key,
sequence blob
);
SQL

def []= id, sequence
@db.transaction {
query = 'select id from proteins where id=$id'
rows = @db.execute(query, '$id' => id)
raise IndexError, id.to_s if rows and rows[0] and rows[0][0]
blob = blob_for( sequence )
insert = 'insert into proteins values ($id, $sequence)'
@db.execute(insert, '$id' => id, '$sequence' => blob)
}
end

private
def initialize path = default_path
@path = path
setup!
end

def setup!
@db = Amalgalite::Database.new @path
unless @db.schema.tables['proteins']
@db.execute SCHEMA
@db = Amalgalite::Database.new @path
end
@sequence_column =
@db.schema.tables['proteins'].columns['sequence']
end

def blob_for string
Amalgalite::Blob.new(
:string => string,
:column => @sequence_column
)
end

def default_path
File.join( home, '.proteins.db' )
end

def home
home =
catch :home do
["HOME", "USERPROFILE"].each do |key|
throw:)home, ENV[key]) if ENV[key]
end

if ENV["HOMEDRIVE"] and ENV["HOMEPATH"]
throw:)home, "#{ ENV['HOMEDRIVE'] }:#{ ENV['HOMEPATH'] }")
end

File.expand_path("~") rescue(File::ALT_SEPARATOR ? "C:/" :
"/")
end

File.expand_path home
end
end

}



a @ http://codeforpeople.com/
 
D

Dave Bass

Robert said:
make that "without a value".

For sets in Perl I've used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?
 
R

Robert Dober

For sets in Perl I've used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?

true might be a better choice than nil ;)

{}[42] --> nil

R.
 
J

Joel VanderWerf

Robert said:
For sets in Perl I've used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?

true might be a better choice than nil ;)

{}[42] --> nil

R.

Tsk. Don't you know that "true or nil" evaluates to "true"? :p

(Srsly, I think he meant true for membership and nil otherwise.)
 
R

Rick DeNatale

[Note: parts of this message were removed to make it a legal post.]

make that "without a value".

Which makes an interesting contrast between Ruby and Smalltalk.

In Smalltalk-80 Set is the more "fundamental" class, the implementation uses
hashing to ensure that duplicates are eliminated and to speed up the test of
whether or not a Set contains a given element.

Smalltalks equivalent to Hash, the Dictionary class, is implemented (via
inheritance) as a Set of association objects, where an association
represents a key value pair, and where two associations are equal if the
keys are equal, and the hash of the association is the hash of the key.

Ruby on the other hand implements Set as a Hash where the values are
unimportant, and does this via delegating to a hash rather than via
inheritance.
 
I

ilpuccio.febo

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

I could save all processed id's in an array and then check if the array
includes my current id:

sequences = []
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end
end

But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.

I could also save all processed id's as keys of a hash, however I don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!

BioRuby+BioSQL ?
You can fetch a sequence from servers and dump it directly into the
database. You can choose MySQL, PostgreSQL, SqLite

ok it's not well coded but works:
server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch')
ARGV.flags.accession.split.each do |accession|
puts accession
if Bio::SQL.exists_accession(accession)
puts "Entry #{accession} already exists!"
else
entry_str = server.fetch('embl', accession, 'raw', 'embl')

if entry_str=="No entries found\. \n"
$stderr.puts "Error: no entry #{accession} found.
#{entry_str}"
else
puts "Downloaded!"
puts "Loading..."
puts "Converting EMBL obj..."
entry = Bio::EMBL.new(entry_str)
puts "Converting Biosequence obj..."
biosequence = entry.to_biosequence
puts "Saving Biosequence into Bio::SQL::Sequence database"
result =
Bio::SQL::Sequence.new:)biosequence=>biosequence,:biodatabase_id=>db.id)
unless Bio::SQL.exists_accession(biosequence.primary_accession)
puts entry.entry_id
if result.nil?
pp "The sequence is already present into the biosql
database"
else
pp "Stored."
end
end#notfound on web
end#bioentry exists
end #list accession

PS: I need to write docs about BioSQL and Ruby, sorry my fault.
 
I

ilpuccio.febo

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

You can use BioRuby+BioSQL, fetching data from a remote server and
storing into the db.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top