Hashed to death... help needed.

T

Teq

First of all, hello!
I've been reading the group for quite a while now, but this is the first
time I really need help. This is my second attempt at posting to this
particular server, my first post has gone missing, hope it won't double it.

To summarize things: I'm building a parser which has to consolidate data
based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are
parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a
file which should contain:
pdbID | startRes | endRes | secstructID

source array with the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N

expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N

As you can see each pdbID is assigned a secStructuID in a sequential manner
and any interruptions are to be considered as points from which the
assignment starts new.
Each pdbID can thus have multiple occurences of for example \N in different
places of the sequence and they are differentiated by the startRes and
endRes values.
All is wonderful and I have a working code which consolidates the data,
unfortunately it doesn't recognize the occurence of the new secstructID
automatically as the end of the previous one rather it finds the last
possible in the whole sequence for one pdbID and considers that as the end.

and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate
"entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B

And here's my code:

---------------------------------------------
#!/usr/bin/perl -w
use strict;
use warnings;

# --------------------------------------------------------------
# This script uses the residue.txt file generated by
# resTabmakerBatch.pl and creates a new file called
# SecStructList.txt
# each protein is described by secondary structures with a
# pdbID, 2ry structureID (char or \N'), startResidue, endResidue
# Input: residue.txt (this file is the output of resTabmakerBatch.pl)
# Output: secStructList.txt
# usage: secStructList.txt to populate the SecStructure entity
# --------------------------------------------------------------

#Read arguments, print error message if insufficient
if ($#ARGV<0)
{
die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n");
}

my $filename = $ARGV[0];

#if either file not found return error message
if (! -e "$filename")
{
die("\n\nresidue file $filename does not exist!\n\n");
}

# Read residue.txt file, extracting the data of interest - only
# pdb id, resNum, resID, secondaryStructID

#First read file, storing each line in an array 'dssplines' splitting the
data
open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
my @dssplines= split(/\r/, <MYFILE>);
my $arraySize=@dssplines;
close(MYFILE);
#read one line from the originally loaded array dssplines at a time and loop
#over it splitting the values using the tabs
my @dsspdata;
my $dsspdataSize=@dssplines;

my $n=0;
for (my $i=0; $i < $arraySize; $i++)
{
#each line from the array goes into a new dsspline variable
my $dsspline = $dssplines[$i];
for (my $j = 0; $j <=4; $j++)
{
#each time values inside are separated using the tabs
my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
$activesite) = split(/\t/, $dsspline);
# now each value of interest is stored into a new array @dsspdata
$dsspdata[$n][0] = $pdbID;
$dsspdata[$n][1] = $resNo;
$dsspdata[$n][2] = $resID;
$dsspdata[$n][3] = $secStruct;
}
$n++;
}

#my @dsspdata array is now perfect to reformat into a hash analyzing the
value correlation

#initialize the hash and counter
my %dane;
my $k=0;
#loop around the dsspdata array
for (my $i=0; $i < $dsspdataSize; $i++)
{
#split each cell in a row into variables for the hash
for (my $k = 0; $k <=4; $k++)
{
my $pdb = $dsspdata[$i][0];
my $residueNum = $dsspdata[$i][1];
my $secStructure = $dsspdata[$i][3];
push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
}
$k++;
}

#now for each pdbID using the hash keys
foreach my $pdbID ( keys %dane )
{
#check the secondary structure id with pdbID as a key (only if the pdbID
is the same will the values be stored)
foreach my $secID ( keys %{ $dane{$pdbID} } )
{
#finally create an array of residue numbers
my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );
#create a new file with the secondary structures list
open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
#append each line to the new file with tab separated data
print SStruc ("$pdbID \t @resnums \t $secID\n");
}
}
close(SStruc);

If anyone has an idea how to deal with this I would be very grateful for
any suggestions.

Cheers,
Matt
 
T

Teq

Jim Gibson said:
Have you taken the time to read the guidelines for posting to this
group? [...]
That is as far as I wished to analyze your program. If you fix up the
peculiarities mentioned and re-post it (albeit a shorter version that
uses <DATA>), I or someone else can probably help you.

Admittedly I deserved that :(
Thanks for the tips, I've redesigned the program (loosing about 50 lines of
code, sweet ;)), but this time I'll post only the part that is of
consequence ;)

----------------------------------------------
#!/usr/bin/perl -w
use strict;
use warnings;

my %dane;
while (<DATA>) {
my ( $x, $y, undef, $z ) = split;

push @{ $dane{$x}->{$z} }, $y;
}

foreach my $pdbID ( keys %dane ) {
foreach my $secID ( keys %{ $dane{$pdbID} } ) {
my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );

print "$pdbID @resnums $secID\n";
}
}

__DATA__
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N

-------------------------------------

the output is
3hba 7 7 H
1b6g 4 5 H
1b6g 1 6 \N
2cdg 8 8 H
2cdg 9 12 \N
2cdg 11 11 B

and I don't know how to get that sucker to identify when secStructID has
changed and finish one entity, and only store sequential groups of secondary
structures for the same pdbID.

the output should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 8 H
2cdg 9 10 \N
2cdg 11 A B
2cdg 12 M \N

And no, it's not homework ;) I'm trying to redo something that I have
written in Java some time ago, at the same time I'm learning perl, hence all
the weird loops in original code, instead of the perl-typical code. All I
know of hashes I have read in the last 7 hours, so please have mercy ;)
Before I started seeking help, it was 250 lines of code (mostly
java-style), doing less than the above. I will appreciate any hints as, I am
stuck.

Cheers,
Matt
 
X

xhoster

Teq said:
----------------------------------------------
#!/usr/bin/perl -w
use strict;
use warnings;

my %dane;
while (<DATA>) {
my ( $x, $y, undef, $z ) = split;
push @{ $dane{$x}->{$z} }, $y;
}

foreach my $pdbID ( keys %dane ) {
foreach my $secID ( keys %{ $dane{$pdbID} } ) {
my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );

print "$pdbID @resnums $secID\n";
}
}

__DATA__
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N

-------------------------------------

the output is
3hba 7 7 H
1b6g 4 5 H
1b6g 1 6 \N
2cdg 8 8 H
2cdg 9 12 \N
2cdg 11 11 B

and I don't know how to get that sucker to identify when secStructID has
changed and finish one entity,

It is not clear that a hash is the right way to go about this. It seems
like you want to address the data in the same sequence that it is
presented, rather than storing everything in a mostly orderless data
structure. If so, then you need to record the PDBid and seqID from the
last time through the loop and check to see if it changed on this time
through the loop. That kind of code is conceptually simple but alas it is
tedious and ugly (in any language, as far as I know). Out of curiosity,
how did you do it in Java?

This is what I came up with:

my ( $x, $y, undef, $z ) = split ' ', <DATA>;
my ($last_x,$last_z,$min,$max)=($x,$z,$y,$y);

while (<DATA>) {
my ( $x, $y, undef, $z ) = split;
if ($x ne $last_x or $z ne $last_z) {
print "$last_x $min $max $last_z\n";
($last_x,$last_z,$min,$max)=($x,$z,$y,$y);
};
$max=$y;
}
print "$last_x $min $max $last_z\n";

It is ugly because the print has to happen in two places (in the loop
and after the loop) and the assignment has to happen in two places
(before the loop and in the loop), but as I say this is a very common
piece of ugliness.

and only store sequential groups of
secondary structures for the same pdbID.

the output should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 8 H

Is this really what you want? 7 was in 3hba and 8 was in 2cdg, so you
didn't detect the transition from 3hba to 2cdg, contrary to what you say
above about "for the same pdbID".
2cdg 9 10 \N
2cdg 11 A B
2cdg 12 M \N

And no, it's not homework ;) I'm trying to redo something that I have
written in Java some time ago, at the same time I'm learning perl, hence
all the weird loops in original code, instead of the perl-typical code.
All I know of hashes I have read in the last 7 hours, so please have
mercy ;) Before I started seeking help, it was 250 lines of code (mostly
java-style), doing less than the above. I will appreciate any hints as, I
am stuck.

Going back to your hash-based code: In stead of taking the first and last
element of the array @{$dane{$pdbID}->{$secID}}, like you currently do,
you could detect gaps in the values of this array and take the first and
last elements of each non-contiguous chunk. You could probably even do
this in a single grep or map statement, but I think doing so would be
overly clever.

Xho
 
T

Teq

It is not clear that a hash is the right way to go about this. It seems
like you want to address the data in the same sequence that it is
presented, rather than storing everything in a mostly orderless data
structure. If so, then you need to record the PDBid and seqID from the
last time through the loop and check to see if it changed on this time
through the loop. That kind of code is conceptually simple but alas it is
tedious and ugly (in any language, as far as I know). Out of curiosity,
how did you do it in Java?

In Java the source was very much different because it was a raw file, these
here are already pre-processed by other scripts.
For each pdb file the data would be parsed and I would iterate through my
original parser-output arrays extracting data into variables, which then I
used to create AAObjects (containing resNum, resID, secStructID), which in
turn were then moved into aaObjectList - an array of objects.
Then I used these objects to create secondary structure objects and
secondary structure array of objects.
I would iterate around the initial aaObjectList checking for secStructID
(e.g. E,G,H) and each time I would discover a new one I'd set start
position, each time I'd encounter a different one I'd set end position (-1)
and pass the secStructID along with the length of the object start-to-end,
resID list from start position to the end position into a new object
SSObject.
Obviously as you said - the data was in proper order, unlike in case of
hashes, of which I was not aware :/
Java for me is more straightforward, but it's overkill for some tasks (my
original dssp parser was with all the bling ;) and it was something around 3
thousand lines), hence me working perl.
Still, it took me 3 days to understand that I can't really work with objects
in perl like I do in Java, and another 2 to realize that java-like loops of
30 lines+ can be compressed into 1 line of code (for this I thank Jim which
was kind enough to point it out ;))
This is what I came up with:
[code cut]
Thank you very much for the code, I will work it out tomorrow, it's almost
3am here, I'm going to bed ;)
Is this really what you want? 7 was in 3hba and 8 was in 2cdg, so you
didn't detect the transition from 3hba to 2cdg, contrary to what you say
above about "for the same pdbID".

You are absolutely right, my bad - these should have been separated, that's
what happens when I sit at my computer after midnight ;(.
Going back to your hash-based code: In stead of taking the first and last
element of the array @{$dane{$pdbID}->{$secID}}, like you currently do,
you could detect gaps in the values of this array and take the first and
last elements of each non-contiguous chunk.

The idea with using a hash here is mostly because I want to learn it, and
because it was adviced by another usenet user :) not because it is a good
practice (also currently I am such a newbie that I don't really know what is
good practice in the first place) - I have managed to conquer this problem
with a set of nested if, elsif's but it was UGLY and a couple HUNDRED lines
longer than a perl script should be, I will follow your tips. Thank you !

Cheers,
Matt
 
T

Teq

Jim Gibson said:
You are discarding your third field, yet it shows up in your output
below (last two lines). Either save it here or don't expect it to be
output.

hmm, I'll double check but I'm almost sure it's not in the output, I am
considering using it, but it'll have to wait until I have finished the
database schema.
foreach my $pdbID ( keys %dane ) {
foreach my $secID ( keys %{ $dane{$pdbID} } ) {
my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );
3hba 7 W H
2cdg 8 N H

The above two lines have the same value in field 4, but different
values in field 1. Your desired output ignores the different value in
field 1 of the second of the two records. Is that correct?

no no, that was a stupid mistake of mine, as Xho has pointed out in the
earlier message, the output should actually differentiate between these two
3hba 7 7 H
2cdg 8 8 H
You need to compare each record read with the previous record and see
if the relevant fields have changed. For this, a hash is not the best
data type. An array would be better.
You want the output in the same order as the input. But plain hashes do
not preserve the order in which elements are inserted (unless you use a
special "tied" hash -- see perldoc -q "hash remember")

I completely disregarded the fact that the data in the hash is unordered
assuming that the identification by pdbID and sequence of values in resNum
would be sufficient, I have to think about this.
Here is a program that gives your desired output, except for the
discarded element in the original record. I leave that as an exercise
for the reader.

Thank you very much for all your trouble. Although the code by Xho above
seems far simpler I will definitely try to work out the idea with hashes,
best time to learn it is now :).

Regards,
Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,782
Latest member
ThomasGex

Latest Threads

Top