J
Jürgen Exner
[Subject: Hash array with variable size?]
So what is it? A hash or an array?
And yes, Perl hashes as well as arrays are variable size. Where is the
problem with that?
So, it is what is commonly called a key, is it?
More efficient than what? Show us your code, then we can try to optimize
it.
As far as I can tell from your description it is a simple linear scan
through the file and the most time spent should be in reading the file
line by line:
while ($line = <$F>) {
($key, @others) = split ("\t", $line)
if ($key eq $wanted) {
print_from_data_whatever_you_want(@others)
}
}
What do you mean by "array size difference for each key"?
10 million lines, each maybe 100 characters (based on your sample data
above), means at least 1GB of data in theory. In reality probably
several times this amount, maybe (but this is just a guess) poorly
designed code or poorly programmed so that data structures are copied
repeatedly, yeah, I can see how this could easily lead to swapping and a
thrashing system, even with several GB of RAM.
But you didn't show us any implemention. How could we possibly suggest
improvements to something that we have never seen?
In general:
Use a database. Databases are designed to handle large amounts of data
and to provide fast queries.
Use a disk-based algorithm instead of a RAM based algorithm. Optimize
your code for RAM size instead of for programming convenience or speed
of operations.
You may also benefit from reviewing old algorithms that were developed
decades ago specifically for problems where RAM-size was the restricting
factor.
jue
So what is it? A hash or an array?
And yes, Perl hashes as well as arrays are variable size. Where is the
problem with that?
I have a flatfile containing many rows (maybe up to 10 million) like the
following lines, with each cell separated by the delimiter "\t",
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794971 dnaN
DNA polymerase III subunit beta
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794972 gyrB
DNA gyrase subunit B
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794973 gyrA
DNA gyrase subunit A
the first cell of each row is to look up, i.e. 2794438 in the above example,
and to print something like:
So, it is what is commonly called a key, is it?
dnaA-1 dnaN
dnaA-1 gyrB
dnaA-1 gyrA
....
how to make this lookup process more efficient?
More efficient than what? Show us your code, then we can try to optimize
it.
As far as I can tell from your description it is a simple linear scan
through the file and the most time spent should be in reading the file
line by line:
while ($line = <$F>) {
($key, @others) = split ("\t", $line)
if ($key eq $wanted) {
print_from_data_whatever_you_want(@others)
}
}
I don't know whether it is
due to array size difference for each key, i.e. 2794438 here, it takes a
What do you mean by "array size difference for each key"?
long time (actually never finish) for a query file of about 100,000 rows.
10 million lines, each maybe 100 characters (based on your sample data
above), means at least 1GB of data in theory. In reality probably
several times this amount, maybe (but this is just a guess) poorly
designed code or poorly programmed so that data structures are copied
repeatedly, yeah, I can see how this could easily lead to swapping and a
thrashing system, even with several GB of RAM.
Any better implementation suggestions are highly welcomed.
But you didn't show us any implemention. How could we possibly suggest
improvements to something that we have never seen?
In general:
Use a database. Databases are designed to handle large amounts of data
and to provide fast queries.
Use a disk-based algorithm instead of a RAM based algorithm. Optimize
your code for RAM size instead of for programming convenience or speed
of operations.
You may also benefit from reviewing old algorithms that were developed
decades ago specifically for problems where RAM-size was the restricting
factor.
jue