T
Teq
First of all, hello!
I've been reading the group for quite a while now, but this is the first
time I really need help. This is my second attempt at posting to this
particular server, my first post has gone missing, hope it won't double it.
To summarize things: I'm building a parser which has to consolidate data
based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are
parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a
file which should contain:
pdbID | startRes | endRes | secstructID
source array with the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N
expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N
As you can see each pdbID is assigned a secStructuID in a sequential manner
and any interruptions are to be considered as points from which the
assignment starts new.
Each pdbID can thus have multiple occurences of for example \N in different
places of the sequence and they are differentiated by the startRes and
endRes values.
All is wonderful and I have a working code which consolidates the data,
unfortunately it doesn't recognize the occurence of the new secstructID
automatically as the end of the previous one rather it finds the last
possible in the whole sequence for one pdbID and considers that as the end.
and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate
"entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B
And here's my code:
---------------------------------------------
#!/usr/bin/perl -w
use strict;
use warnings;
# --------------------------------------------------------------
# This script uses the residue.txt file generated by
# resTabmakerBatch.pl and creates a new file called
# SecStructList.txt
# each protein is described by secondary structures with a
# pdbID, 2ry structureID (char or \N'), startResidue, endResidue
# Input: residue.txt (this file is the output of resTabmakerBatch.pl)
# Output: secStructList.txt
# usage: secStructList.txt to populate the SecStructure entity
# --------------------------------------------------------------
#Read arguments, print error message if insufficient
if ($#ARGV<0)
{
die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n");
}
my $filename = $ARGV[0];
#if either file not found return error message
if (! -e "$filename")
{
die("\n\nresidue file $filename does not exist!\n\n");
}
# Read residue.txt file, extracting the data of interest - only
# pdb id, resNum, resID, secondaryStructID
#First read file, storing each line in an array 'dssplines' splitting the
data
open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
my @dssplines= split(/\r/, <MYFILE>);
my $arraySize=@dssplines;
close(MYFILE);
#read one line from the originally loaded array dssplines at a time and loop
#over it splitting the values using the tabs
my @dsspdata;
my $dsspdataSize=@dssplines;
my $n=0;
for (my $i=0; $i < $arraySize; $i++)
{
#each line from the array goes into a new dsspline variable
my $dsspline = $dssplines[$i];
for (my $j = 0; $j <=4; $j++)
{
#each time values inside are separated using the tabs
my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
$activesite) = split(/\t/, $dsspline);
# now each value of interest is stored into a new array @dsspdata
$dsspdata[$n][0] = $pdbID;
$dsspdata[$n][1] = $resNo;
$dsspdata[$n][2] = $resID;
$dsspdata[$n][3] = $secStruct;
}
$n++;
}
#my @dsspdata array is now perfect to reformat into a hash analyzing the
value correlation
#initialize the hash and counter
my %dane;
my $k=0;
#loop around the dsspdata array
for (my $i=0; $i < $dsspdataSize; $i++)
{
#split each cell in a row into variables for the hash
for (my $k = 0; $k <=4; $k++)
{
my $pdb = $dsspdata[$i][0];
my $residueNum = $dsspdata[$i][1];
my $secStructure = $dsspdata[$i][3];
push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
}
$k++;
}
#now for each pdbID using the hash keys
foreach my $pdbID ( keys %dane )
{
#check the secondary structure id with pdbID as a key (only if the pdbID
is the same will the values be stored)
foreach my $secID ( keys %{ $dane{$pdbID} } )
{
#finally create an array of residue numbers
my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );
#create a new file with the secondary structures list
open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
#append each line to the new file with tab separated data
print SStruc ("$pdbID \t @resnums \t $secID\n");
}
}
close(SStruc);
If anyone has an idea how to deal with this I would be very grateful for
any suggestions.
Cheers,
Matt
I've been reading the group for quite a while now, but this is the first
time I really need help. This is my second attempt at posting to this
particular server, my first post has gone missing, hope it won't double it.
To summarize things: I'm building a parser which has to consolidate data
based on variables contained in an array.
The source file contains a set of tab-separated-values, and those are
parsed out into an array which contains
pdbID | resNum | resID | secstructID, these are then consolidated into a
file which should contain:
pdbID | startRes | endRes | secstructID
source array with the data for consolidation:
1b6g 1 M \N
1b6g 2 V \N
1b6g 3 N \N
1b6g 4 N H
1b6g 5 N H
1b6g 6 N \N
3hba 7 W H
2cdg 8 N H
2cdg 9 V \N
2cdg 10 M \N
2cdg 11 A B
2cdg 12 M \N
expected result after consolidation, should be:
1b6g 1 3 \N
1b6g 4 5 H
1b6g 6 6 \N
3hba 7 7 H
2cdg 5 6 H
2cdg 7 7 \N
2cdg 8 8 H
2cdg 9 10 H
2cdg 11 11 B
2cdg 12 12 \N
As you can see each pdbID is assigned a secStructuID in a sequential manner
and any interruptions are to be considered as points from which the
assignment starts new.
Each pdbID can thus have multiple occurences of for example \N in different
places of the sequence and they are differentiated by the startRes and
endRes values.
All is wonderful and I have a working code which consolidates the data,
unfortunately it doesn't recognize the occurence of the new secstructID
automatically as the end of the previous one rather it finds the last
possible in the whole sequence for one pdbID and considers that as the end.
and so my result is incorrectly displayed as:
1b6g 4 5 H
1b6g 1 6 \N ---- error here - this should be in fact two separate
"entities" because 4 and 5 do not belong to \N
3hba 7 7 H
2cdg 8 8 H
2cdg 9 12 \N ---- same here (7 and 8 should break this into two)
2cdg 11 11 B
And here's my code:
---------------------------------------------
#!/usr/bin/perl -w
use strict;
use warnings;
# --------------------------------------------------------------
# This script uses the residue.txt file generated by
# resTabmakerBatch.pl and creates a new file called
# SecStructList.txt
# each protein is described by secondary structures with a
# pdbID, 2ry structureID (char or \N'), startResidue, endResidue
# Input: residue.txt (this file is the output of resTabmakerBatch.pl)
# Output: secStructList.txt
# usage: secStructList.txt to populate the SecStructure entity
# --------------------------------------------------------------
#Read arguments, print error message if insufficient
if ($#ARGV<0)
{
die("\n\nUsage: sstruct.pl [residue_table_file.txt]\n\n");
}
my $filename = $ARGV[0];
#if either file not found return error message
if (! -e "$filename")
{
die("\n\nresidue file $filename does not exist!\n\n");
}
# Read residue.txt file, extracting the data of interest - only
# pdb id, resNum, resID, secondaryStructID
#First read file, storing each line in an array 'dssplines' splitting the
data
open (MYFILE,"$filename") or die ("\nERROR: Can't open $filename\n");
my @dssplines= split(/\r/, <MYFILE>);
my $arraySize=@dssplines;
close(MYFILE);
#read one line from the originally loaded array dssplines at a time and loop
#over it splitting the values using the tabs
my @dsspdata;
my $dsspdataSize=@dssplines;
my $n=0;
for (my $i=0; $i < $arraySize; $i++)
{
#each line from the array goes into a new dsspline variable
my $dsspline = $dssplines[$i];
for (my $j = 0; $j <=4; $j++)
{
#each time values inside are separated using the tabs
my ($pdbID, $resNo, $resID, $phi, $psi, $chi1, $chi2, $secStruct,
$activesite) = split(/\t/, $dsspline);
# now each value of interest is stored into a new array @dsspdata
$dsspdata[$n][0] = $pdbID;
$dsspdata[$n][1] = $resNo;
$dsspdata[$n][2] = $resID;
$dsspdata[$n][3] = $secStruct;
}
$n++;
}
#my @dsspdata array is now perfect to reformat into a hash analyzing the
value correlation
#initialize the hash and counter
my %dane;
my $k=0;
#loop around the dsspdata array
for (my $i=0; $i < $dsspdataSize; $i++)
{
#split each cell in a row into variables for the hash
for (my $k = 0; $k <=4; $k++)
{
my $pdb = $dsspdata[$i][0];
my $residueNum = $dsspdata[$i][1];
my $secStructure = $dsspdata[$i][3];
push @{ $dane{$pdb}->{$secStructure} }, $residueNum;
}
$k++;
}
#now for each pdbID using the hash keys
foreach my $pdbID ( keys %dane )
{
#check the secondary structure id with pdbID as a key (only if the pdbID
is the same will the values be stored)
foreach my $secID ( keys %{ $dane{$pdbID} } )
{
#finally create an array of residue numbers
my @resnums = ( $dane{$pdbID}->{$secID}->[0],
$dane{$pdbID}->{$secID}->[-1] );
#create a new file with the secondary structures list
open (SStruc, ">>secStructList.txt") || die "Can't open file: $!";
#append each line to the new file with tab separated data
print SStruc ("$pdbID \t @resnums \t $secID\n");
}
}
close(SStruc);
If anyone has an idea how to deal with this I would be very grateful for
any suggestions.
Cheers,
Matt