Problem with split function

S

Sherm Pendley

Prasanna said:
Iam writing a code in which I have to search using a pattern like
[^V]VV[^V] in a sequence and then I have to see how many times this
pattern occurs...

Use the FAQ, Luke. From "perldoc -q count":

How can I count the number of occurrences of a substring within a string?

sherm--
 
P

Prasanna

Hi

Just abt started learning Perl... am stuck with a problem

Iam writing a code in which I have to search using a pattern like
[^V]VV[^V] in a sequence and then I have to see how many times this
pattern occurs...

I thought I'd use split for the purpose

So I used it like split /$pattern/, $sequence

It gives me a split loop error...

I then tried initializing $pattern='HAHHAJHYRWUYRI' etc
then ran the program , this runs clean...

Actually I am generating pattern by concatenating all the lines of
file...

Iam enclosing the entire code...
#! /usr/bin/perl

use warnings;
$proteinseq="arah1";
init_array();
check_sequence() ;

sub init_array
{
my ($count) =0;

open (PROTEINSEQ,"$proteinseq") || die ("Can't open the proteinseq
:$! ");
while ( defined ( $proteinseq[$count] = <PROTEINSEQ> ) )
{
chomp( $proteinseq[$count]);
$proteinseq[$count] =~ s/ //g;
$count++;
}
}

sub check_sequence
{
@aa_list = ('V');
$count =6;
$pat ='';

foreach $i ( @proteinseq)
{
if (defined $i ) { $pat= $pat . $i; }
}

#$pat =~ s/ //g;
$pat =~ s/\n$//g;
print "\n Protein : $proteinseq ";
print "\n$pat\n";

print "\n Amino Acid TOT II III IV V VI ";
print "\n===========================================================";

foreach $aaname ( @aa_list)
{
print "\n $aaname \t\t";
$match_count =0;

for ($j =1; $j <= $count; $j++)
{
if ( $j != 1 ) {
$check = '[^' . $aaname . ']';
$seq = $check;
if ( $j != 1 ) {
$check = '[^' . $aaname . ']';
$seq = $check;
}
$seq = $seq . ($aaname x $j) ;
if ( $j != 1 ) { $seq = $seq . '[^' . $aaname . ']' ; }

if ( defined $seq )
{
print "\n seq is $seq\n";

@junk = split /$seq/, $pat;
$match_count =0;
foreach $y (@junk) { $match_count++; }

if ( $match_count > 0 ) { $match_count--; }
print "$match_count \t";
$seq = $check;
}
}

print "\n";
}
}

SOS anybody...

thanks for your time
Prasanna
 
A

A. Sinan Unur

Just abt started learning Perl... am stuck with a problem

This, then, is the right time to read the posting guidelines for this
group.
Iam writing a code in which I have to search using a pattern like

Please be precise.
[^V]VV[^V] in a sequence

That can mean a number of things.
and then I have to see how many times this
pattern occurs...

I thought I'd use split for the purpose

So I used it like split /$pattern/, $sequence

It gives me a split loop error...

Always copy and paste exact error messages. I have no idea what a 'split
loop error' is.
I then tried initializing $pattern='HAHHAJHYRWUYRI' etc
then ran the program , this runs clean...

Actually I am generating pattern by concatenating all the lines of
file...

I am sure you meant something other than what you said here.
Iam enclosing the entire code...

Please avoid that as much as possible. Post the *smallest* program that
still runs and exhibits the problem with which you are seeking help.
Again, the posting guidelines show you how to do that.

Incidentally, trying to run your code results in:

D:\Home> prasanna.pl
Missing right curly or square bracket at D:\Home\prasanna.pl line 75, at
end of line
syntax error at D:\Home\prasanna.pl line 75, at EOF
Execution of D:\Home\prasanna.pl aborted due to compilation errors.

Please post code that runs.
#! /usr/bin/perl

use warnings;

use strict;

missing.
$proteinseq="arah1";

You use this string as a filename later on.
init_array();

What array are you initializing here? Looking down, I see that you are
magically creating a global @proteinseq array in init_array. That makes
program flow very hard to figure out, especially when you come back to
look at it in a few months.
check_sequence() ;

sub init_array
{
my ($count) =0;

You don't use count for anything. What's the point?
open (PROTEINSEQ,"$proteinseq") || die ("Can't open the proteinseq
:$! ");

You don't need to quote the name of the file in the open call; see

perldoc -q always

while ( defined ( $proteinseq[$count] = <PROTEINSEQ> ) )
{
chomp( $proteinseq[$count]);
$proteinseq[$count] =~ s/ //g;
$count++;
}
}

It seems like the purpose of this routine is to remove all spaces and
newline characters from the input file. I will assume that is indeed the
objective. If you are doing that, I do not see the point of returning
the contents of the file as an array of lines. See the code after my
comments for an alternative.
sub check_sequence
{
@aa_list = ('V');
$count =6;
$pat ='';

I cannot figure out what this sub is supposed to do. Based on your
description, I am going to assume that you want to count how many times
the literal string '[^V]VV[^V]' occurs in the input file.

#! /usr/bin/perl

use strict;
use warnings;

my $seq = '[^V]VV[^V]';
my $contents;

while(<DATA>) {
chomp;
s/ //g;
$contents .= $_;
}

my $count = 0;
++$count while $contents =~ /\Q$seq\E/g;

print "$seq occurs $count times\n";

__DATA__
[^V]VV[^V] [^V] V V
[^V] xcf [^V] V
V
[^V] t
a
d
f
ggertg
[^V] V V
[^V]

D:\Home> prasanna.pl
[^V]VV[^V] occurs 4 times
 
P

Prasanna

Thanks... perldoc did the trick... I didnt know that something like
perldoc existed.. :)
 
S

Sherm Pendley

Prasanna said:
Thanks... perldoc did the trick... I didnt know that something like
perldoc existed.. :)

This might be a good time to point out the Posting Guidelines that are
posted here every so often. They aren't just a "miss manners" guide,
although they do talk a bit about etiquette. They also have a number of
helpful tips, links and other stuff you might have been missing.

sherm--
 
B

Big and Blue

A. Sinan Unur said:
Prasanna said:
[^V]VV[^V] in a sequence

That can mean a number of things.

Although, in the context of protein sequences, you really can assume
this is a regex to find all non-overlapping instances of "VV".

And, given that the code assumes there is only one sequence per file, this:

while(<DATA>) {
chomp;
s/ //g;
$contents .= $_;
}

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".
 
A

A. Sinan Unur

A. Sinan Unur said:
Prasanna said:
[^V]VV[^V] in a sequence

That can mean a number of things.

Although, in the context of protein sequences, you really can
assume this is a regex to find all non-overlapping instances of
"VV".

Well, I do not know anything about protein sequences.
And, given that the code assumes there is only one sequence per
file, this:

while(<DATA>) {
chomp;
s/ //g;
$contents .= $_;
}

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

I had assumed the former would be more efficient than first reading the
whole file in (although $contents eventually does hold something close the
full file).

Well, it turns out assumptions are dangerous, as my informal testing on
Windows and FreeBSD indicates the join version to be faster.

On the other hand, when using an 80 MB input file, my version ran to
completion on FreeBSD (although it did take about 38 seconds), whereas the
join version was killed by the OS.

For reference, here are the scripts I used:

D:\Home\asu1\UseNet\clpmisc> cat t1.pl
#! /usr/bin/perl

use strict;
use warnings;

open my $f, '<', 'proteinseq.data'
or die $!;

my $contents;
($contents = join("", <$f>)) =~ s/\s+//g;

__END__

versus

D:\Home\asu1\UseNet\clpmisc> cat t2.pl
#! /usr/bin/perl

use strict;
use warnings;

open my $f, '<', 'proteinseq.data'
or die $!;

my $contents;

while(<$f>) {
chomp;
s/ //g;
$contents .= $_;
}

__END__

I generated input files that contained 100_000 and 1_000_000 copies of the
DATA section of the script I posted before.
 
F

Fabian Pilkowski

* A. Sinan Unur said:
I had assumed the former would be more efficient than first reading the
whole file in (although $contents eventually does hold something close the
full file).

I had thought about such an optimization too. But what is Perl doing
here? I don't know for sure but I assume something like: read in the
whole file and concatenate its lines afterwards. If Perl is doing so,
you've to hold the whole file twice in memory (as param to join() and
as its result). Then Sinan's loop is more efficient here, especially
for larger files (and AFAIK protein sequences are mostly large).

Nevertheless we could optimize it since we don't need to read the file
line-by-line. Also, y///d should be faster than s///g for deleting.

{
local $/ = \8192;
y/ \n//d, $contents .= $_ while <DATA>;
}

And for relatively small files (<8kB) its behavior should be similar to
reading in slurp-mode (neither join nor concat is needed).
Well, it turns out assumptions are dangerous, as my informal testing on
Windows and FreeBSD indicates the join version to be faster.

On the other hand, when using an 80 MB input file, my version ran to
completion on FreeBSD (although it did take about 38 seconds), whereas the
join version was killed by the OS.

Perhaps you want to benchmark this again ;-)

regards,
fabian
 
A

A. Sinan Unur

* A. Sinan Unur schrieb:

Perhaps you want to benchmark this again ;-)

Hmmm ... I did:

From top:

load averages: 0.01, 0.01, 0.00 up 113+08:07:57 20:45:56
36 processes: 1 running, 35 sleeping
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 5216K Active, 55M Inact, 77M Wired, 20K Cache, 35M Buf, 110M Free
Swap: 256M Total, 65M Used, 191M Free, 25% Inuse

asu1@recex:~/ttt > perl -v

This is perl, v5.8.6 built for i386-freebsd-64int

asu1@recex:~/ttt > cat t1.pl
#! /usr/bin/perl

use strict;
use warnings;

open my $f, '<', 'proteinseq.data'
or die $!;

my $contents;
($contents = join("", <$f>)) =~ s/\s+//g;

__END__

asu1@recex:~/ttt > time perl t1.pl
Killed

real 0m16.495s
user 0m6.148s
sys 0m2.171s

It reaches 149 MB memory consumption, then is killed. It is probably
because I am running with 256 MB physical memory + 256 MB swap.

Sinan
 
F

Fabian Pilkowski

Ok, now I have done this on my WindowsXP box with ActivePerl 5.8.6
installed. Your version takes 19 minutes and 32 seconds, wow! My system
is swapping around all the time -- not very efficient.

Btw, my version is not more efficient here. I think due to swapping the
advantage is not noticeable, at least on my system.
asu1@recex:~/ttt > time perl t1.pl
Killed

real 0m16.495s
user 0m6.148s
sys 0m2.171s

It reaches 149 MB memory consumption, then is killed. It is probably
because I am running with 256 MB physical memory + 256 MB swap.

I am running with 512 MB of each. But that seems to be not enough for
this task -- after around 12 min it uses 312 MB physical and 450 MB
swap. Oddly enough it has read in the whole 80 MB of the input file in
this time but cannot complete (it swaps around over and over until it is
killed by the OS).

thanks for that,
fabian
 
A

Anno Siegel

[...]
can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

Anno
 
J

John W. Krahn

Anno said:
[...]
can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

The \s character class also includes "\f" so using tr/ \r\n\t\f//d would be
the equivalent of using s/\s+//g. :)


John
 
A

Anno Siegel

John W. Krahn said:
Anno said:
[...]
can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

The \s character class also includes "\f" so using tr/ \r\n\t\f//d would be
the equivalent of using s/\s+//g. :)

Thanks. I was too lazy to look it up, hence my hedging comment.

Anno
 
J

John W. Krahn

Anno said:
John W. Krahn said:
Anno said:
[...]

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

The \s character class also includes "\f" so using tr/ \r\n\t\f//d would be
the equivalent of using s/\s+//g. :)

Thanks. I was too lazy to look it up, hence my hedging comment.

You're welcome. Now who says we can't be civil here? ;>)


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,942
Members
47,476
Latest member
blackwatermelon

Latest Threads

Top