how to read file from sub-directories and do an average?

R

Ross

I have the following directories under a dir called raw data-1, and under
each subdir, say, 4601-4.SMP, there is a single file under there. indeed
that single file has a fixed format and i'm going to extract numerical
values there to write to a new file along with two from 4601-4B.SMP and
4601-4C.SMP. Since a user does not follow nomenclature strictly, sometimes
he names a dir, say, 4601-4A.SMP instead of 4601-4.SMP, how could i achieve
extracting 3 files and write to a single file? Finally i would find the
average of the numerical values obtained from the 3 files.

admin/home/admin> ls -al raw\ data-1/
total 96
drwxr-xr-x 23 admin admin 4096 Aug 20 03:59 .
drwxr-xr-x 9 admin admin 4096 Aug 20 03:59 ..
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4594-3.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4594-3B.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4594-3C.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4601-4.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4601-4B.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4601-4C.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4605-5.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4605-5B.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4605-5C.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4612-2.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4612-2B.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4612-2C.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4614-6.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4614-6B.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4614-6C.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4618-1.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4618-1B.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4618-1C.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4620-1.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4620-1B.SMP
drwxr-xr-x 2 admin admin 4096 Aug 20 03:59 4620-1C.SMP
 
P

Paul Lalli

Ross said:
I have the following directories under a dir called raw data-1, and under
each subdir, say, 4601-4.SMP, there is a single file under there. indeed
that single file has a fixed format and i'm going to extract numerical
values there to write to a new file along with two from 4601-4B.SMP and
4601-4C.SMP. Since a user does not follow nomenclature strictly, sometimes
he names a dir, say, 4601-4A.SMP instead of 4601-4.SMP, how could i achieve
extracting 3 files and write to a single file? Finally i would find the
average of the numerical values obtained from the 3 files.

What have you tried so far? What part of this do you need help with?

opening and reading a directory?
perldoc -f opendir
perldoc -f readdir

opening and reading a file?
perldoc -f open
perldoc perlopentut
perldoc perlop ("I/O Operators")

Saving the values in a variable?
perldoc perldata

Counting, summing, and averaging a value?
perldoc perlop

Make an attempt - preferably following the posting guidelines of this
group - and if it doesn't do what you want, feel free to ask this group
for help.

Paul Lalli
 
R

Ross

the problems are:

1) my codes are clumsy,
opendir(WORKINGDIR, $ARGV[0]) || die ("unable to open dir named $ARGV[0]");

while ($inputdirname = readdir(WORKINGDIR)) {
chdir ($ARGV[0]);

opendir(SUBDIR, $inputdirname) || die ("unable to open dir named
$inputdirname");

$inputfilename = readdir(RUBDIR);
chdir ($inputdirname);

open(IN, $inputfilename) || die "Could not open $inputfilename\n";

...
}

2)don't know how to write a file into columns, i separately wrote a script
which can process the input file into the following example tab-delimited
format,

"4594-3A"
Concentration (mg/mL) Normalized Concentration
ASP 15.9789 8.873
THR 5.6596 3.143
SER 27.4199 15.226
GLU 23.0988 12.826
PRO 7.0019 3.888
GLY 10.2960 5.717
ALA 33.3880 18.540
CYS 1.9538 1.085
VAL 6.6856 3.713
MET 1.9792 1.099
ILE 3.4556 1.919
LEU 5.4778 3.041
TYR 2.1671 1.204
PHE 2.5160 1.397
HIS 2.2561 1.253
LYS 21.2256 11.786
ARG 3.9567 2.197
TOTALS 180.0864 100.000


and the final averaged file should be in this way, as you can see, the data
are added in a column fashion.

? 4594-3A 4594-3B 4594-3C average ? average (to 2d.p.)
ASP 15.045 15.082 14.836 14.98767 ? ASP 14.99
THR 2.626 2.577 2.595 2.599333 ? THR 2.60
SER 19.276 19.543 19.499 19.43933 ? SER 19.44
GLU 19.343 19.81 19.651 19.60133 ? GLU 19.60
PRO 2.801 2.866 2.881 2.849333 ? PRO 2.85
GLY 5.031 5.149 5.074 5.084667 ? GLY 5.08
ALA 18.615 18.616 18.533 18.588 ? ALA 18.59
CYS 0 0 0 0 ? CYS 0.00
VAL 2.823 2.722 2.742 2.762333 ? VAL 2.76
MET 1.05 0.785 0.836 0.890333 ? MET 0.89
ILE 1.66 1.642 1.627 1.643 ? ILE 1.64
LEU 2.8 2.534 2.571 2.635 ? LEU 2.64
TYR 1.352 1.153 1.175 1.226667 ? TYR 1.23
PHE 1.298 1.067 1.105 1.156667 ? PHE 1.16
HIS 1.03 0.98 0.989 0.999667 ? HIS 1.00
LYS 0.862 0.808 0.83 0.833333 ? LYS 0.83
ARG 1.739 1.608 1.656 1.667667 ? ARG 1.67
 
X

xhoster

Ross said:
I have the following directories under a dir called raw data-1, and under
each subdir, say, 4601-4.SMP, there is a single file under there. indeed
that single file has a fixed format and i'm going to extract numerical
values there to write to a new file along with two from 4601-4B.SMP and
4601-4C.SMP. Since a user does not follow nomenclature strictly,
sometimes he names a dir, say, 4601-4A.SMP instead of 4601-4.SMP, how
could i achieve extracting 3 files and write to a single file?

IMHO, you can't. Either a user does what he is supposed to, or he doesn't.
If the user doesn't, then all you can do is guess. What if your user
decides to not follow nomenclature stricly by sticking the A before the 2nd
number segment rather than after it? What if he fails to follow
nomenclature strictly by mis-spelling 4601-4A.SMP as 4603-4A.SMP? Either
it is allowed to add a letter after the 2nd number segment, in which case
the user *is* following nomenclature strictly, or we are just playing a
guessing game.

Any, I've handled similar situations something like:

my %set;
##Find the set of all groups to be averaged over
my @files=glob "*SMP";
foreach (@files) {
/^(\d+-\d+)/ or die "invalid format $_";
$set{$_}=();
};

foreach my $group (keys %set) {
foreach my $member ( fgrep /\Q$group\E\D/ , @files) {
##open $member and do whatever you do
};
## do whatever output you need for the group
};

The \D in the regex makes sure that 4594-32.SMP is not considered part
of the set "4594-3".

Xho
 
X

xhoster

Ross said:
I have the following directories under a dir called raw data-1, and under
each subdir, say, 4601-4.SMP, there is a single file under there. indeed
that single file has a fixed format and i'm going to extract numerical
values there to write to a new file along with two from 4601-4B.SMP and
4601-4C.SMP. Since a user does not follow nomenclature strictly,
sometimes he names a dir, say, 4601-4A.SMP instead of 4601-4.SMP, how
could i achieve extracting 3 files and write to a single file?

IMHO, you can't. Either a user does what he is supposed to, or he doesn't.
If the user doesn't, then all you can do is guess. What if your user
decides to not follow nomenclature stricly by sticking the A before the 2nd
number segment rather than after it? What if he fails to follow
nomenclature strictly by mis-spelling 4601-4A.SMP as 4603-4A.SMP? Either
it is allowed to add a letter after the 2nd number segment, in which case
the user *is* following nomenclature strictly, or we are just playing a
guessing game.

Anyway, I've handled similar situations something like:

my %set;
##Find the set of all groups to be averaged over
my @files=glob "*SMP";
foreach (@files) {
/^(\d+-\d+)/ or die "invalid format $_";
$set{$_}=();
};

foreach my $group (keys %set) {
foreach my $member ( fgrep /^\Q$group\E\D/ , @files) {
##open $member and do whatever you do
};
## do whatever output you need for the group
};

The \D in the regex makes sure that 4594-32.SMP is not considered part
of the group "4594-3". (And the ^ makes sure that 594-3.SMP isn't
considered part of the group 4594-3)

Xho
 
B

Brian McCauley

Tad said:
Ross said:
chdir ($ARGV[0]);


You should probably ensure that it is an existing directory
before you try to open it:

die "'$inputdirname' is not a directory" unless -d $inputdirname;

I disagree. It's better to just try the chdir() and die if it fails.
There are numerous reasons why this is better to do with race conditions
and permissions.
Use copy/paste or your editor's "import" function rather than
attempting to type in your code. If you make a typo you will get
followups about your typos instead of about the question you are
trying to get answered.

Fair point.
It is profoundly rude of you to bother thousands of people with
such silliness.

Are you being rude on purpose? It sure looks like it.

Tad, I think such strong statements are a little over the top for a
first offence. (If this wasn't a first offence then they are justified).
 
R

Ross

if i don't sort the dir names, it cannot guarantee to take average
correctly, if so like using:

$curdir = `pwd`;
chop $curdir;

$curdir = $curdir . "/$ARGV[0]";


@inputdirname = readdir(WORKINGDIR);

foreach $inputdirname (sort @inputdirname) {

print $curdir.$inputdirname; <STDIN>;
opendir(SUBDIR, $curdir.$inputdirname) || die ("unable to open dir
named $inputdirname $!");

<processing>

chdir ($curdir);

<counting>
}

.. and .. are taken into account
 
T

Tad McClellan

[code with typos]
Fair point.


Tad, I think such strong statements are a little over the top for a
first offence. (If this wasn't a first offence then they are justified).


I pointed out the futility of "paraphrased" code a month ago:

Message-Id: <[email protected]>

I mentioned providing attributions twice before this third
(unattributed) followup from the OP.

Another poster pointed this OP to the posting guidelines over a month ago.

And I get the feeling that clp.misc is tried before any other resource
(rather than after all other resources.)


I wouldn't go to assuming "on purpose" on a first offence either.
 
T

Tad McClellan

Ross said:
$curdir = `pwd`;
chop $curdir;


You should not use chop() to remove newlines.

You should use chomp() to remove newlines.

@inputdirname = readdir(WORKINGDIR);
foreach $inputdirname (sort @inputdirname) {


There is no need for a temporary array:

foreach $inputdirname (sort readdir WORKINGDIR) {

Or, since you should have "use strict" turned on by now:

foreach my $inputdirname (sort readdir WORKINGDIR) {
 
T

Tad McClellan

Ross said:
$curdir = $curdir . "/$ARGV[0]";
opendir(SUBDIR, $curdir.$inputdirname) || die ("unable to open dir ^^^ no slash character?
named $inputdirname $!");


Don't you need a directory separator character between $curdir
and $inputdirname?

Your diagnostic message is misleading, you should have the same name
there as used in the opendir():

opendir(SUBDIR, "$curdir/$inputdirname") or
die "unable to open dir named '$curdir/$inputdirname' $!";

chdir ($curdir);


You should check the return value to ensure that you actually
got what you asked for:

chdir $curdir or die "could not change to '$curdir' $!";
 
B

Brian McCauley

Tad said:
I wouldn't go to assuming "on purpose" on a first offence either.

OK, I'm satisfied that your criticism of the OP was indeed justified.
However we should avoid giving the new-commers the impression that the
Perl community is an unfreindly and unforgiving place.

Could I humbly suggest it would have been better to have said...

This has been explained to you before. Are you being rude on purpose?
It sure looks like it.
 
R

Ross

Thanks Brian and sorry Tad. Still i can't quite get what you are talking
about. Besides the typo, it seems there are some etiquettes of posting here,
where should i find them? I once encountered this situation before in
another newsgroup, is that every newsgroup having their rules so a newcomer
had better check them up first? if so, where are they?
 
R

Ross

Tad McClellan said:
You should not use chop() to remove newlines.

You should use chomp() to remove newlines.
Thanks for letting me know there is a better (in a sense that's what i want)
function.
 
S

Scott Bryce

Ross said:
it seems there are some etiquettes of posting here, where should i
find them?

I haven't been following this thread, so I don't know if this has been
explained to you.

There are general rules for posting to newsgroups. Some newsgroups are
more lax than others about the rules. In some newsgroups, particularly
the bussier technical newsgroups like this one, you will be expected to
play by the rules. It makes it easier for the people here to help you.

Also, many newsgroups, this one included, have posting guidelines that
you will be expected to follow. Tad posts them to this group about twice
a week. You can also find them here:

http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

This newsgroup gets a lot of traffic. There are some very knowledgeable
people here who devote a lot of time to helping others. People who don't
follow the guidelines take up more of other people's time, and will
eventually get less and less help.
I once encountered this situation before in another newsgroup, is
that every newsgroup having their rules so a newcomer had better
check them up first? if so, where are they?

Every newsgroup is different. It is a good idea to read a couple of
week's worth of posts before you post to a newsgroup for the first time.
That will help you get a feel for what is expected. As you are reading,
look for a link to posting guidelines or a FAQ.
 
R

Ross

Is there any built-in function/parameters in Perl not to take . and .. into
account when opening all the subdirectories?

When i run the code:

the error appears:

unable to open dir named /home/sunlab/AAA/Reb/rawdat/4601-4.SMP No such file
or directory at <the absolute path for this perl>/SMP2XLSAVG2.pl line 42

<the absolute path for this perl> is replaced by me.

indeed when ls -al rawdat

drwxr-xr-x 2 sunlab 4096 Aug 21 12:25 4601-4.SMP/


I've tried both the with and without slash at the end versions.
 
R

Ross

Scott Bryce said:
I haven't been following this thread, so I don't know if this has been
explained to you.

There are general rules for posting to newsgroups. Some newsgroups are
more lax than others about the rules. In some newsgroups, particularly
the bussier technical newsgroups like this one, you will be expected to
play by the rules. It makes it easier for the people here to help you.

Also, many newsgroups, this one included, have posting guidelines that
you will be expected to follow. Tad posts them to this group about twice
a week. You can also find them here:

http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

This newsgroup gets a lot of traffic. There are some very knowledgeable
people here who devote a lot of time to helping others. People who don't
follow the guidelines take up more of other people's time, and will
eventually get less and less help.


Every newsgroup is different. It is a good idea to read a couple of week's
worth of posts before you post to a newsgroup for the first time. That
will help you get a feel for what is expected. As you are reading, look
for a link to posting guidelines or a FAQ.

oh, i traced past messages and find out a term "attribution", now i
understand what it means and thanks for directing me to the link
 
T

Tad McClellan

Ross said:
Is there any built-in function/parameters in Perl not to take . and .. into
account when opening all the subdirectories?


No, but there _is_ a way to avoid processing them. :)

while ( my $item = readdir DIR ) {

next if $item eq '.' or $item eq '..';
# next if $item /^\./; # skip ALL items that start with dot

# process non-dot files here
}
 
A

Anno Siegel

Brian McCauley said:
Tad said:
Ross said:
chdir ($ARGV[0]);


You should probably ensure that it is an existing directory
before you try to open it:

die "'$inputdirname' is not a directory" unless -d $inputdirname;

I disagree. It's better to just try the chdir() and die if it fails.
There are numerous reasons why this is better to do with race conditions
and permissions.

Apart from that, $inputdirname is coming directly out of a readdir().
The possibility that chdir($inputdirname) fails because $inputdirname
isn't a directory is remote.

Anno
 
A

Anno Siegel

Tad McClellan said:
No, but there _is_ a way to avoid processing them. :)

while ( my $item = readdir DIR ) {

next if $item eq '.' or $item eq '..';
# next if $item /^\./; # skip ALL items that start with dot

# process non-dot files here
}

File::Spec can even do that portably (untested):

use File::Spec qw( no_upwards);
for my $item ( no_upwards readdir DIR ) {
# no "." and ".." here
}

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top