simple indexing in Perl?

E

ela

I'm new to database programming and just previously learnt to use loops to
look up and enrich information using the following codes. However, when the
tables are large, I find this process is very slow. Then, somebody told me I
can build a database for one of the file real time and so no need to read
the file from the beginning till the end again and again. However, perl DBI
has a lot of sophisticated functions there and in fact my tables are only
large but nothing special, linked by an ID. Is there any simple way to
achieve the same purpose? I just wish the ID can be indexed and then
everytime I access the record through memory and not through I/O...


#!/usr/bin/perl

my ($listfile, $format, $accfile, $infofile) = @ARGV;
print '($listfile, $accfile, $infofile)'; <STDIN>;

print "Working on $listfile...\n";
$outname = $listfile . "_" . $infofile . ".xls";

open (OFP, ">$outname");

open(FP, $listfile);
$line = <FP>;
chomp $line;

if ($format ne "") {
@fields = split(/\t/, $line);
for ($i=0; $i<@fields; $i++) {
############## check fields ###############################
if ( $fields[$i] =~ /accession/) {
$acci = $i;
}
}
}

print OFP "$line\tgene info\n";

$nl = '\n';

while (<FP>) {
$line = $_;
if ($line eq "\n") {
print OFP $line;
next;
}
chomp $line;

if ($format eq "") {
@cells = split (/:/, $line);
$tag = $cells[0];
} else {
@cells = split (/\t/, $line);
$tag = $cells[$acci];
}

open(AFP, $accfile);

while (<AFP>) {
@cells = split (/\t/, $_);
if ($cells[5] =~ /$tag/) {
$des = $cells[1];
last;
}
}
close AFP;

if ($found == 0) {
print OFP "$line\tNo gene info available\n";
}
}
close FP;
 
J

Jens Thoms Toerring

ela said:
I'm new to database programming and just previously learnt to use loops to
look up and enrich information using the following codes. However, when the
tables are large,

Which tables? Do you mean 'files'?
I find this process is very slow. Then, somebody told me I
can build a database for one of the file real time and so no need to read
the file from the beginning till the end again and again. However, perl DBI
has a lot of sophisticated functions there and in fact my tables are only
large but nothing special, linked by an ID. Is there any simple way to
achieve the same purpose? I just wish the ID can be indexed and then
everytime I access the record through memory and not through I/O...
#!/usr/bin/perl

Please, please use

use strict;
use warnings;

It will tell you about a lot of potential problems.
my ($listfile, $format, $accfile, $infofile) = @ARGV;
print '($listfile, $accfile, $infofile)'; <STDIN>;

What's that at end of the line good for?
print "Working on $listfile...\n";
$outname = $listfile . "_" . $infofile . ".xls";
open (OFP, ">$outname");

Better use the three-argument form of open and use normal
variables for file handles, this isn't Perl 4 anymore...

open my $ofp, '>', $outname
or die "Can't open $outfile for writing\n";

Also checking that opening a file succeeded shouldn't be left
out without very good reasons...
open(FP, $listfile);
$line = <FP>;
chomp $line;
if ($format ne "") {
@fields = split(/\t/, $line);
for ($i=0; $i<@fields; $i++) {
############## check fields ###############################
if ( $fields[$i] =~ /accession/) {

Are you aware that this will also match e.g. 'disaccession_123'?
$acci = $i;
}
}
}
print OFP "$line\tgene info\n";
$nl = '\n';
while (<FP>) {
$line = $_;

Why don't you read directly into '$line' but instead do an
additional copy?
if ($line eq "\n") {
print OFP $line;
next;
}
chomp $line;
if ($format eq "") {
@cells = split (/:/, $line);
$tag = $cells[0];
} else {
@cells = split (/\t/, $line);
$tag = $cells[$acci];
}
open(AFP, $accfile);
while (<AFP>) {
@cells = split (/\t/, $_);
if ($cells[5] =~ /$tag/) {
$des = $cells[1];
last;
}
}
close AFP;
if ($found == 0) {
print OFP "$line\tNo gene info available\n";
}

Huh? '$found' is nowhere else used in your program. With
'use warnings' you would have gotten a warning that you
use the value of an uninitialized variable...
}
close FP;

The probably most time-consuming part of your program is that for
each line of the file with the name '$listfile' you read in at
least a certain portion on '$accfile', again and again. To get
around that you don't need a database, you just have to read it
in only once and store the relevant information e.g. in a hash.
If you would do something like

open my $afp, '<', $accfile)
or die "Can't open $accfile for reading\n";

my %ahash;
while ( my line = <$afp> ) {
my @cells = split /\t/, $line;
$ahash{ $cells[ 5 ] } = $cells[ 1 ];
}
close $afp;

somewhere at the begining then you would have all the infor-
mation you use from the '$accfile' file in the %ahash hash and
there would be no need to read the file again and again:

while ( my $line = <$fp> ) {
if ( $line eq "\n" ) {
print $ofp "\n";
next;
}
chomp $line;

if ( $format eq "" ) {
@cells = split /:/, $line;
$tag = $cells[ 0 ];
} else {
@cells = split /\t/, $line;
$tag = $cells[ $acci ];
}

$des = $ahash{ $tag } if exists $ahash{ $tag };
}

close $fp;

Putting things in a database won't do too much good here
since, unless you have an in-memory database, also the
database will put the information on the disk and has to
retrieve it from there (but for sure a lot faster then
re-reading a file for a bit of information lots of times;-)
The only case I can think of where using a database may be
beneficial here is when the '$accfile' is extremely large
and the '%ahash' would use up all the memory you have. In
that case putting things in a database (on disk then of
course) for relatively fast finding the value for a key
(i.e. what you have in the '$tag' variable) might be a rea-
sonable alternative.
Regards, Jens
 
W

wolf

ela said:
I'm new to database programming and just previously learnt to use loops to
look up and enrich information using the following codes. However, when the
tables are large, I find this process is very slow. Then, somebody told me I
can build a database for one of the file real time and so no need to read
the file from the beginning till the end again and again. However, perl DBI
has a lot of sophisticated functions there and in fact my tables are only
large but nothing special, linked by an ID. Is there any simple way to
achieve the same purpose? I just wish the ID can be indexed and then
everytime I access the record through memory and not through I/O...


#!/usr/bin/perl

my ($listfile, $format, $accfile, $infofile) = @ARGV;
print '($listfile, $accfile, $infofile)'; <STDIN>;

print "Working on $listfile...\n";
$outname = $listfile . "_" . $infofile . ".xls";

open (OFP, ">$outname");

open(FP, $listfile);
$line = <FP>;
chomp $line;

if ($format ne "") {
@fields = split(/\t/, $line);
for ($i=0; $i<@fields; $i++) {
############## check fields ###############################
if ( $fields[$i] =~ /accession/) {
$acci = $i;
}
}
}

print OFP "$line\tgene info\n";

$nl = '\n';

while (<FP>) {
$line = $_;
if ($line eq "\n") {
print OFP $line;
next;
}
chomp $line;

if ($format eq "") {
@cells = split (/:/, $line);
$tag = $cells[0];
} else {
@cells = split (/\t/, $line);
$tag = $cells[$acci];
}

open(AFP, $accfile);

while (<AFP>) {
@cells = split (/\t/, $_);
if ($cells[5] =~ /$tag/) {
$des = $cells[1];
last;
}
}
close AFP;

if ($found == 0) {
print OFP "$line\tNo gene info available\n";
}
}
close FP;

Hi ela,

without going too deeply into your code, let's just say that you should
always start you perl scripts with

#!/usr/bin/perl
use warnings;
use strict;

and if you can't make it run with these restrictions there is something
seriously flaky about the way you are persuing.

Apart from the perl aspect, there are some serious information issues
you need to address.

From what i can gather of your description, you are reading in a file
that contains some kind of gene information, and you want to index that
information so that retrieval of information is much faster rather than
iterating SEQUENTIALLY over the whole file(or series of files) every
time you need an answer.

Is my assumption thus far right ?


But to assess that, some real life info on what you are
actually trying to do is needed :p
How big is/are the files - that is .. how big will that index be ?

What is the actual index gonna be .. etc.

Only after that part becomes clear a solution is possible. And you need
to communicate that.


cheers, wolf
 
J

Jürgen Exner

ela said:
I'm new to database programming and just previously learnt to use loops to
look up and enrich information using the following codes. However, when the
tables are large, I find this process is very slow. Then, somebody told me I
can build a database for one of the file real time and so no need to read
the file from the beginning till the end again and again.

What I gathered from your code without going into details is that for
each line in OFP your are opening, reading through, and closing AFP.

I/O operations are by far the slowest operations and there is a trivial
solution that will probably speed up your program dramatically: instead
of reading AFP again and again and again just read it into an array once
at the beginning of your program and then loop over that array instead
of over the file.

Only if AFP is too large for that (serveral GB) then you may need to
look for a better algorithmic solution. This requires knowledge and
experience and a database may or may not help, depending upon what you
actually are trying to achive.

jue
 
C

ccc31807

I'm new to database programming and just previously learnt to use loops to
look up and enrich information using the following codes. However, when the
tables are large, I find this process is very slow. Then, somebody told me I
can build a database for one of the file real time and so no need to read
the file from the beginning till the end again and again. However, perl DBI
has a lot of sophisticated functions there and in fact my tables are only
large but nothing special, linked by an ID. Is there any simple way to
achieve the same purpose? I just wish the ID can be indexed and then
everytime I access the record through memory and not through I/O...

You have input, which you want to process and turn into output.

Your input consists of data contained in some kind of file. This is
exactly the kind of task that Perl excels at.

You have two choices: (1) you can use a database to store and query
your data, or (2) you can use your computer's memory to store and
query your data.

If you have a large amount of permanent data that you need to add to,
delete from, and change, your best strategy is to use a database. Read
your data file into your database. Most databases have external
commands (i.e., not SQL) for doing that, so it should be
straightforward and easy -- note that you do not use Perl for this,
and probably shouldn't.

If you have a small to moderate amount of data, whether permanent or
temporary, that you don't need to add to, delete from, or modify, your
best strategy is to use your computer's memory to store and query your
data. Simply open the file, read each line, destructure each line into
a key and value, and stuff it into a hash.

For example, suppose your data looks like this:
12345,George,Washington,First
23456,John,Adams,Second
34567,Thomas,Jefferson,Third
45678,James,Madison,Fourth

You can do this:
my %pres;
open PRES, '<', 'data.csv' or die "$!";
while(<PRES>)
{
chomp;
my ($id, $first, $last, $place) = split /,/;
$pres{$place} = "$id, $first, $last";
}
close PRES;

If you need a multilevel data structure, see documentation, starting
maybe with lists of lists.

CC.
 
J

Jens Thoms Toerring

Pausing the program until something is typed on STDIN.

Oh, I see. I was a bit confused why to wait for some input
in that situation when one is complaining that the program
is taking so long;-)
Regards, Jens
 
X

Xho Jingleheimerschmidt

ela said:
I'm new to database programming and just previously learnt to use loops to
look up and enrich information using the following codes. However, when the
tables are large,

How large?
I find this process is very slow. Then, somebody told me I
can build a database for one of the file real time and so no need to read
the file from the beginning till the end again and again.

Not sure what you mean by "real time" here.
However, perl DBI
has a lot of sophisticated functions there and in fact my tables are only
large but nothing special, linked by an ID.

Data is data. It doesn't need to "something special" in order to put
into a database. Databases themselves are nothing special, just
specialized tools to do a specialized job.
Is there any simple way to
achieve the same purpose? I just wish the ID can be indexed and then
everytime I access the record through memory and not through I/O...

You can read the data into a hash, depending on just how large it is,
and exactly how it needs to be matched.
open (OFP, ">$outname");

open(FP, $listfile);

You should check that your open commands succeed.
print OFP "$line\tgene info\n";

$nl = '\n';

This is never used, and I don't see what one would use it for.
while (<FP>) { ....


open(AFP, $accfile);

Again, you should check that the open succeeds.
while (<AFP>) {
@cells = split (/\t/, $_);
if ($cells[5] =~ /$tag/) {
$des = $cells[1];
last;
}
}
close AFP;

This would actually be quite hard to optimize if the match really needs
to be as written, $cells[5] =~ /$tag/. Are you sure it wouldn't still
be correct (or even be more correct) to test $cells[5] eq $tag, or at
least $cells[5] =~ /^\Q$tag/ ?


if ($found == 0) {
print OFP "$line\tNo gene info available\n";
}
}

In your code, $found never gets set to anything, or changed.

Xho
 
E

ela

After testing different approaches, Jens Thoms Toerring's works better and
therefore I modified the codes accordingly. Now I just don't know why the
array content cannot be retrieved but only a number "1" is returned. Can
anyone tell me the reason? In fact I can simply pass $line instead of @cells
but what I finally want to achieve is to only print out several cells
instead of all.


my %ahash;
while ( my $line = <$afp> ) {
my @cells = split /\t/, $line;
$ahash{ $cells[ 5 ] } = $cells[ 1 ];
}
close $afp;

open my $ifp, '<', $infofile or die "Can't open $infofile for reading\n";

my %ihash;
while ( my $line = <$ifp> ) {
my @cells = split /\t/, $line;
$ihash{ $cells[ 1 ] } = @cells;
}
close $ifp;

while ( my $line = <$fp> ) {
if ( $line eq "\n" ) {
print $ofp "\n";
next;
}
chomp $line;

if ( $format eq "" ) {
@cells = split /:/, $line;
$tag = $cells[ 0 ];
} else {
@cells = split /\t/, $line;
$tag = $cells[ $acci ];
}

$gid = $ahash{ $tag } if exists $ahash{ $tag };
@gene_info = $ihash{$gid};
print $ofp "$line\t@gene_info";
}

close $fp;
 
S

sln

After testing different approaches, Jens Thoms Toerring's works better and
therefore I modified the codes accordingly. Now I just don't know why the
array content cannot be retrieved but only a number "1" is returned. Can
anyone tell me the reason? In fact I can simply pass $line instead of @cells
but what I finally want to achieve is to only print out several cells
instead of all.


my %ahash;
while ( my $line = <$afp> ) {
my @cells = split /\t/, $line;
$ahash{ $cells[ 5 ] } = $cells[ 1 ];
}
close $afp;

open my $ifp, '<', $infofile or die "Can't open $infofile for reading\n";

my %ihash;
while ( my $line = <$ifp> ) {
my @cells = split /\t/, $line;
$ihash{ $cells[ 1 ] } = @cells;
}
close $ifp;

while ( my $line = <$fp> ) {
if ( $line eq "\n" ) {
print $ofp "\n";
next;
}
chomp $line;

if ( $format eq "" ) {
@cells = split /:/, $line;
$tag = $cells[ 0 ];
} else {
@cells = split /\t/, $line;
$tag = $cells[ $acci ];
}

$gid = $ahash{ $tag } if exists $ahash{ $tag };
@gene_info = $ihash{$gid};
print $ofp "$line\t@gene_info";
}

close $fp;

I'm puzzled why you should tackle this in Perl when
I'm guessing this would be a hard SLQ statement for you
to do.

Realizing its a simple sql from 3 tables on a key field
then trying to do it in Perl, etc ..

Your looking for speed, but you can't normalize the task.
You make the big mistake of gathering everything into memory
thereby hogging memory with useless information, then
compounding that error with one time use. Although, I'm not
sure about the one time use, unless its interactive, but
I didn't look to hard for that in the code.

It doesen't appear you have multiple lines per key
gene data, however, that data could be massive.
There is no need to keep all the data in memory.
You could in effect, keep a key => file position
hash via tell(), then retrieve the data later with a
seek.

Applying a pseudo analysis on your content-less code,
it is storing data beyond its use. Its like formal
symbolic logic. Write the equation, then solve it,
its called reverse-engineering.

This is the bottom line equation of your work:

------------------
@Gene-Info Array = @{ I-Hash{ A-Hash{ fp0 } } } if A-Hash{ fp0 } exists
------------------

From inner to outer, when constructing the A-Hash, there is no
need to add a key to the I-Hash if it does not exist in the A-Hash.
If you wrote the sql for this you would have picked this up.
And since the I-Hash contains all the mega gene data, you just
ruptured your memory's brain.

Start over, write pseudo-code, re-check your work via logic analysis
from the inner to outer context. This will save you countless hours
of headache.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top