Hash array with variable size?

J

Jürgen Exner

[Subject: Hash array with variable size?]

So what is it? A hash or an array?
And yes, Perl hashes as well as arrays are variable size. Where is the
problem with that?
I have a flatfile containing many rows (maybe up to 10 million) like the
following lines, with each cell separated by the delimiter "\t",

2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794971 dnaN
DNA polymerase III subunit beta
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794972 gyrB
DNA gyrase subunit B
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794973 gyrA
DNA gyrase subunit A

the first cell of each row is to look up, i.e. 2794438 in the above example,
and to print something like:

So, it is what is commonly called a key, is it?
dnaA-1 dnaN
dnaA-1 gyrB
dnaA-1 gyrA
....

how to make this lookup process more efficient?

More efficient than what? Show us your code, then we can try to optimize
it.
As far as I can tell from your description it is a simple linear scan
through the file and the most time spent should be in reading the file
line by line:

while ($line = <$F>) {
($key, @others) = split ("\t", $line)
if ($key eq $wanted) {
print_from_data_whatever_you_want(@others)
}
}
I don't know whether it is
due to array size difference for each key, i.e. 2794438 here, it takes a

What do you mean by "array size difference for each key"?
long time (actually never finish) for a query file of about 100,000 rows.

10 million lines, each maybe 100 characters (based on your sample data
above), means at least 1GB of data in theory. In reality probably
several times this amount, maybe (but this is just a guess) poorly
designed code or poorly programmed so that data structures are copied
repeatedly, yeah, I can see how this could easily lead to swapping and a
thrashing system, even with several GB of RAM.
Any better implementation suggestions are highly welcomed.

But you didn't show us any implemention. How could we possibly suggest
improvements to something that we have never seen?

In general:
Use a database. Databases are designed to handle large amounts of data
and to provide fast queries.
Use a disk-based algorithm instead of a RAM based algorithm. Optimize
your code for RAM size instead of for programming convenience or speed
of operations.
You may also benefit from reviewing old algorithms that were developed
decades ago specifically for problems where RAM-size was the restricting
factor.

jue
 
C

ccc31807

2794438 dnaA-1  chromosomal replication initiator protein DnaA  2794971 dnaN
DNA polymerase III subunit beta
2794438 dnaA-1  chromosomal replication initiator protein DnaA  2794972 gyrB
DNA gyrase subunit B
2794438 dnaA-1  chromosomal replication initiator protein DnaA  2794973 gyrA
DNA gyrase subunit A

the first cell of each row is to look up, i.e. 2794438 in the above example,
and to print something like:

dnaA-1 dnaN
dnaA-1 gyrB
dnaA-1 gyrA

You seem to have two tasks: (1) process and normalize the data, and
(2) analyze the results.

(1) It strikes me that you should use line oriented logic. You want
the first element (a seven digit number), the second element, and the
element between a seven digit number and the string 'DNA'. Deconstruct
the string using a RE, and use a hash of hashes, like this:
while (<DATA>)
{
my ($key1, $key2, $value) = ...deconstruction RE ...;
$datahash{$key1}{$key2} = $value;
}

(2) Having your data in a hash of hashes will greatly ease your
analysis. Use Data::Dumper to print your data structure to see what it
looks like.

If the size of the input file is a problem, no programming solution
will solve it. You will need either to reduce the amount of your input
data, or increase the size of your memory or storage or both.

CC.
 
E

ela

I have a flatfile containing many rows (maybe up to 10 million) like the
following lines, with each cell separated by the delimiter "\t",

2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794971 dnaN
DNA polymerase III subunit beta
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794972 gyrB
DNA gyrase subunit B
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794973 gyrA
DNA gyrase subunit A

the first cell of each row is to look up, i.e. 2794438 in the above example,
and to print something like:

dnaA-1 dnaN
dnaA-1 gyrB
dnaA-1 gyrA
.....

how to make this lookup process more efficient? I don't know whether it is
due to array size difference for each key, i.e. 2794438 here, it takes a
long time (actually never finish) for a query file of about 100,000 rows.
Any better implementation suggestions are highly welcomed.
 
J

Jürgen Exner

[Someone screwed up the quotation levels badly]
ela said:
You seem to have two tasks: (1) process and normalize the data, and
(2) analyze the results.

There is no normalization needed.
(1) It strikes me that you should use line oriented logic. You want
the first element (a seven digit number), the second element, and the
element between a seven digit number and the string 'DNA'. Deconstruct
the string using a RE, and use a hash of hashes, like this:
while (<DATA>)
{
my ($key1, $key2, $value) = ...deconstruction RE ...;
$datahash{$key1}{$key2} = $value;
}

(2) Having your data in a hash of hashes will greatly ease your
analysis. Use Data::Dumper to print your data structure to see what it
looks like.

If the size of the input file is a problem, no programming solution
will solve it. You will need either to reduce the amount of your input
data, or increase the size of your memory or storage or both.

That is only true if the data doesn't fit on the harddrive. Otherwise
there are well-established algorithms to handle data sets that are too
large to be loaded into RAM completely. This was a major research and
development area some decades ago when RAM was expensive and "640K ought
to be enough for anybody" (falsely attributed to Bill Gates). And it is
still today in the context of e.g. database systems.
Thanks both Jürgen Exner & CC. In fact my description was poor and CC seems
to be closer to what I have to do. Please let me re-define my problem again:

Given:
1) Query data (about 100,000 rows) in the following format:
2794438
20156
2794972
1234

I read this as "You have 100000 individual, independant queries"
2) Database data (about 10 million rows) in the following format (delimited
by tab):
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794971 dnaN
DNA polymerase III subunit beta
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794972 gyrB
DNA gyrase subunit B
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794973 gyrA
DNA gyrase subunit A

Assuming a single scan through the 10M rows takes only 5 seconds (just
loading the 1GB of data from the HD propably takes longer) then we are
already talking about 500k seconds or about 6 days of runtime.
In other words: a naive double loop doesn't work.
So I'm using 2 for loops to retrieve all the relations (Jürgen Exner's
advice on database usage is good but then I have to spend another month to
learn :-( ),

In a relational DB it's two tables with a simple join on the keys.
Exactly what databases are best at.
I readily notice that if I can do something like what CC suggested, e.g.

if exists a_hash[2794438][20156]

then it is much better (faster) than traversing the array 2794438 (e.g. by
foreach) and check whether the 2nd key 20156 exists. It seems that during
the hash building proecess, I should use hash of hash, e.g.

my %hoh;

$hoh{$key1}{$key2}=1;

then later I can use the 2 for loops and check is fine then.

As I mentioned in my first reply: do the calculations first!

Are you planning on putting 10M key/value pairs in RAM, either in a
single hash of in a HoH? Then it will be several GB in size, how large
depends upon the minimum size of a scalar that is used for key and
value.
Do you have that much RAM? If you don't and the OS begins swapping, then
you are toast.

Other ideas (how well they work depends on your distribution of data):

Split your big file with those 10M lines into smaller files, one for
each key , i.e. all lines with a key of "2794438" go into the same file.
Then the lookup for each key becomes trivial. If the number of files
becomes too large then devise a system to store them in sub-directories.

Sort your data before you use it. Then you can do a one-time linear pass
over both sets in parallel to gather all matches. Sorting the 100k query
table is definitely not a problem. About sorting the 10M table you will
have to see if standard Unix sort can handle it. If not then you can
always cut it into smaller chunks and then merge those in a second pass.

Reverse the logic of your algorithm. Currently you are doing
Load Datafile into %Database
foreach $query in <QueryFile>
check if $query isfound in %Database
Reverse this into
Load queryfile into %queryhash
# with only 100k items this will
# fit nicely into RAM
foreach $dataitem in <DataFile>
check if a query exists for this data item
This way you never load more than 1 line at a time from your mega-file
into RAM.

jue
 
C

ccc31807

Thanks both J rgen Exner & CC. In fact my description was poor and CC seems
to be closer to what I have to do. Please let me re-define my problem again:

Given:
1) Query data (about 100,000 rows) in the following format:
2) Database data (about 10 million rows) in the following format (delimited
by tab):

If you have your data within a database (10M rows) and you want to
query that data, all you really need is SQL. The Perl DBI will return
a hash ref in exactly the format you want. If you want a key and
subkey it will return a reference to a hash with the key, subkey, and
value.
So I'm using 2 for loops to retrieve all the relations (J rgen Exner's
advice on database usage is good but then I have to spend another month to
learn :-( ),

SELECT COL1, COL2, COLN FROM TABLE WHERE COL1 = 'value to query by';

What I probably would do is create a new table with your query data in
it, and then do a left/right join on the new table. That is what
relational databases do -- you need to learn to use the tool rather
than reinvent the wheel. Perl does some things very nicely, but so do
databases. Use the tools you have properly, the hammer to drive nails
and the wrench to fasten bolts, not the hammer to fasten bolts and the
wrench to drive nails.

You are much, much better off taking a month learning to use your
database tool that you can use forever, and besides, if it takes more
than a day to learn how to create a table and do a join, there's
something wrong.

CC.
 
E

ela

2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794971 dnaN
DNA polymerase III subunit beta
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794972 gyrB
DNA gyrase subunit B
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794973 gyrA
DNA gyrase subunit A

the first cell of each row is to look up, i.e. 2794438 in the above
example,
and to print something like:

dnaA-1 dnaN
dnaA-1 gyrB
dnaA-1 gyrA

You seem to have two tasks: (1) process and normalize the data, and
(2) analyze the results.

(1) It strikes me that you should use line oriented logic. You want
the first element (a seven digit number), the second element, and the
element between a seven digit number and the string 'DNA'. Deconstruct
the string using a RE, and use a hash of hashes, like this:
while (<DATA>)
{
my ($key1, $key2, $value) = ...deconstruction RE ...;
$datahash{$key1}{$key2} = $value;
}

(2) Having your data in a hash of hashes will greatly ease your
analysis. Use Data::Dumper to print your data structure to see what it
looks like.

If the size of the input file is a problem, no programming solution
will solve it. You will need either to reduce the amount of your input
data, or increase the size of your memory or storage or both.

CC.


Thanks both Jürgen Exner & CC. In fact my description was poor and CC seems
to be closer to what I have to do. Please let me re-define my problem again:

Given:
1) Query data (about 100,000 rows) in the following format:
2794438
20156
2794972
1234
....

2) Database data (about 10 million rows) in the following format (delimited
by tab):
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794971 dnaN
DNA polymerase III subunit beta
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794972 gyrB
DNA gyrase subunit B
2794438 dnaA-1 chromosomal replication initiator protein DnaA 2794973 gyrA
DNA gyrase subunit A

So I'm using 2 for loops to retrieve all the relations (Jürgen Exner's
advice on database usage is good but then I have to spend another month to
learn :-( ),

I readily notice that if I can do something like what CC suggested, e.g.

if exists a_hash[2794438][20156]

then it is much better (faster) than traversing the array 2794438 (e.g. by
foreach) and check whether the 2nd key 20156 exists. It seems that during
the hash building proecess, I should use hash of hash, e.g.

my %hoh;

$hoh{$key1}{$key2}=1;

then later I can use the 2 for loops and check is fine then.

Thanks again, the friends all over the world!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,189
Members
46,734
Latest member
manin

Latest Threads

Top