Problem with split function

Sherm Pendley · Jun 17, 2005

Prasanna said:
Iam writing a code in which I have to search using a pattern like
[^V]VV[^V] in a sequence and then I have to see how many times this
pattern occurs...

Use the FAQ, Luke. From "perldoc -q count":

How can I count the number of occurrences of a substring within a string?

sherm--

Prasanna · Jun 17, 2005

Hi

Just abt started learning Perl... am stuck with a problem

Iam writing a code in which I have to search using a pattern like
[^V]VV[^V] in a sequence and then I have to see how many times this
pattern occurs...

I thought I'd use split for the purpose

So I used it like split /$pattern/, $sequence

It gives me a split loop error...

I then tried initializing $pattern='HAHHAJHYRWUYRI' etc
then ran the program , this runs clean...

Actually I am generating pattern by concatenating all the lines of
file...

Iam enclosing the entire code...
#! /usr/bin/perl

use warnings;
$proteinseq="arah1";
init_array();
check_sequence() ;

sub init_array
{
my ($count) =0;

open (PROTEINSEQ,"$proteinseq") || die ("Can't open the proteinseq
:$! ");
while ( defined ( $proteinseq[$count] = <PROTEINSEQ> ) )
{
chomp( $proteinseq[$count]);
$proteinseq[$count] =~ s/ //g;
$count++;
}
}

sub check_sequence
{
@aa_list = ('V');
$count =6;
$pat ='';

foreach $i ( @proteinseq)
{
if (defined $i ) { $pat= $pat . $i; }
}

#$pat =~ s/ //g;
$pat =~ s/\n$//g;
print "\n Protein : $proteinseq ";
print "\n$pat\n";

print "\n Amino Acid TOT II III IV V VI ";
print "\n===========================================================";

foreach $aaname ( @aa_list)
{
print "\n $aaname \t\t";
$match_count =0;

for ($j =1; $j <= $count; $j++)
{
if ( $j != 1 ) {
$check = '[^' . $aaname . ']';
$seq = $check;
if ( $j != 1 ) {
$check = '[^' . $aaname . ']';
$seq = $check;
}
$seq = $seq . ($aaname x $j) ;
if ( $j != 1 ) { $seq = $seq . '[^' . $aaname . ']' ; }

if ( defined $seq )
{
print "\n seq is $seq\n";

@junk = split /$seq/, $pat;
$match_count =0;
foreach $y (@junk) { $match_count++; }

if ( $match_count > 0 ) { $match_count--; }
print "$match_count \t";
$seq = $check;
}
}

print "\n";
}
}

SOS anybody...

thanks for your time
Prasanna

A. Sinan Unur · Jun 17, 2005

Just abt started learning Perl... am stuck with a problem

This, then, is the right time to read the posting guidelines for this
group.

Iam writing a code in which I have to search using a pattern like

Please be precise.

[^V]VV[^V] in a sequence

That can mean a number of things.

and then I have to see how many times this
pattern occurs...

I thought I'd use split for the purpose

So I used it like split /$pattern/, $sequence

It gives me a split loop error...

Always copy and paste exact error messages. I have no idea what a 'split
loop error' is.

I then tried initializing $pattern='HAHHAJHYRWUYRI' etc
then ran the program , this runs clean...

Actually I am generating pattern by concatenating all the lines of
file...

I am sure you meant something other than what you said here.

Iam enclosing the entire code...

Please avoid that as much as possible. Post the *smallest* program that
still runs and exhibits the problem with which you are seeking help.
Again, the posting guidelines show you how to do that.

Incidentally, trying to run your code results in:

D:\Home> prasanna.pl
Missing right curly or square bracket at D:\Home\prasanna.pl line 75, at
end of line
syntax error at D:\Home\prasanna.pl line 75, at EOF
Execution of D:\Home\prasanna.pl aborted due to compilation errors.

Please post code that runs.

#! /usr/bin/perl

use warnings;

use strict;

missing.

$proteinseq="arah1";

You use this string as a filename later on.

init_array();

What array are you initializing here? Looking down, I see that you are
magically creating a global @proteinseq array in init_array. That makes
program flow very hard to figure out, especially when you come back to
look at it in a few months.

check_sequence() ;

sub init_array
{
my ($count) =0;

You don't use count for anything. What's the point?

open (PROTEINSEQ,"$proteinseq") || die ("Can't open the proteinseq
:$! ");

You don't need to quote the name of the file in the open call; see

perldoc -q always

while ( defined ( $proteinseq[$count] = <PROTEINSEQ> ) )
{
chomp( $proteinseq[$count]);
$proteinseq[$count] =~ s/ //g;
$count++;
}
}

It seems like the purpose of this routine is to remove all spaces and
newline characters from the input file. I will assume that is indeed the
objective. If you are doing that, I do not see the point of returning
the contents of the file as an array of lines. See the code after my
comments for an alternative.

sub check_sequence
{
@aa_list = ('V');
$count =6;
$pat ='';

I cannot figure out what this sub is supposed to do. Based on your
description, I am going to assume that you want to count how many times
the literal string '[^V]VV[^V]' occurs in the input file.

#! /usr/bin/perl

use strict;
use warnings;

my $seq = '[^V]VV[^V]';
my $contents;

while(<DATA>) {
chomp;
s/ //g;
$contents .= $_;
}

my $count = 0;
++$count while $contents =~ /\Q$seq\E/g;

print "$seq occurs $count times\n";

__DATA__
[^V]VV[^V] [^V] V V
[^V] xcf [^V] V
V
[^V] t
a
d
f
ggertg
[^V] V V
[^V]

D:\Home> prasanna.pl
[^V]VV[^V] occurs 4 times

Prasanna · Jun 17, 2005

Thanks... perldoc did the trick... I didnt know that something like
perldoc existed..

Sherm Pendley · Jun 17, 2005

Prasanna said:
Thanks... perldoc did the trick... I didnt know that something like
perldoc existed..

This might be a good time to point out the Posting Guidelines that are
posted here every so often. They aren't just a "miss manners" guide,
although they do talk a bit about etiquette. They also have a number of
helpful tips, links and other stuff you might have been missing.

sherm--

Big and Blue · Jun 17, 2005

A. Sinan Unur said:
Prasanna said:

[^V]VV[^V] in a sequence

Click to expand...

That can mean a number of things.

Although, in the context of protein sequences, you really can assume
this is a regex to find all non-overlapping instances of "VV".

And, given that the code assumes there is only one sequence per file, this:

while(<DATA>) {
chomp;
s/ //g;
$contents .= $_;
}

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

A. Sinan Unur · Jun 17, 2005

A. Sinan Unur said:
A. Sinan Unur said:

Prasanna said:

[^V]VV[^V] in a sequence

Click to expand...

That can mean a number of things.

Click to expand...

Although, in the context of protein sequences, you really can
assume this is a regex to find all non-overlapping instances of
"VV".

Well, I do not know anything about protein sequences.

And, given that the code assumes there is only one sequence per
file, this:

while(<DATA>) {
chomp;
s/ //g;
$contents .= $_;
}

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

I had assumed the former would be more efficient than first reading the
whole file in (although $contents eventually does hold something close the
full file).

Well, it turns out assumptions are dangerous, as my informal testing on
Windows and FreeBSD indicates the join version to be faster.

On the other hand, when using an 80 MB input file, my version ran to
completion on FreeBSD (although it did take about 38 seconds), whereas the
join version was killed by the OS.

For reference, here are the scripts I used:

D:\Home\asu1\UseNet\clpmisc> cat t1.pl
#! /usr/bin/perl

use strict;
use warnings;

open my $f, '<', 'proteinseq.data'
or die $!;

my $contents;
($contents = join("", <$f>)) =~ s/\s+//g;

__END__

versus

D:\Home\asu1\UseNet\clpmisc> cat t2.pl
#! /usr/bin/perl

use strict;
use warnings;

open my $f, '<', 'proteinseq.data'
or die $!;

my $contents;

while(<$f>) {
chomp;
s/ //g;
$contents .= $_;
}

__END__

I generated input files that contained 100_000 and 1_000_000 copies of the
DATA section of the script I posted before.

Fabian Pilkowski · Jun 18, 2005

* A. Sinan Unur said:
I had assumed the former would be more efficient than first reading the
whole file in (although $contents eventually does hold something close the
full file).

I had thought about such an optimization too. But what is Perl doing
here? I don't know for sure but I assume something like: read in the
whole file and concatenate its lines afterwards. If Perl is doing so,
you've to hold the whole file twice in memory (as param to join() and
as its result). Then Sinan's loop is more efficient here, especially
for larger files (and AFAIK protein sequences are mostly large).

Nevertheless we could optimize it since we don't need to read the file
line-by-line. Also, y///d should be faster than s///g for deleting.

{
local $/ = \8192;
y/ \n//d, $contents .= $_ while <DATA>;
}

And for relatively small files (<8kB) its behavior should be similar to
reading in slurp-mode (neither join nor concat is needed).

Well, it turns out assumptions are dangerous, as my informal testing on
Windows and FreeBSD indicates the join version to be faster.

On the other hand, when using an 80 MB input file, my version ran to
completion on FreeBSD (although it did take about 38 seconds), whereas the
join version was killed by the OS.

Perhaps you want to benchmark this again ;-)

regards,
fabian

A. Sinan Unur · Jun 18, 2005

* A. Sinan Unur schrieb:

Perhaps you want to benchmark this again ;-)

Hmmm ... I did:

From top:

load averages: 0.01, 0.01, 0.00 up 113+08:07:57 20:45:56
36 processes: 1 running, 35 sleeping
CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 5216K Active, 55M Inact, 77M Wired, 20K Cache, 35M Buf, 110M Free
Swap: 256M Total, 65M Used, 191M Free, 25% Inuse

asu1@recex:~/ttt > perl -v

This is perl, v5.8.6 built for i386-freebsd-64int

asu1@recex:~/ttt > cat t1.pl
#! /usr/bin/perl

use strict;
use warnings;

open my $f, '<', 'proteinseq.data'
or die $!;

my $contents;
($contents = join("", <$f>)) =~ s/\s+//g;

__END__

asu1@recex:~/ttt > time perl t1.pl
Killed

real 0m16.495s
user 0m6.148s
sys 0m2.171s

It reaches 149 MB memory consumption, then is killed. It is probably
because I am running with 256 MB physical memory + 256 MB swap.

Sinan

Fabian Pilkowski · Jun 18, 2005

Ok, now I have done this on my WindowsXP box with ActivePerl 5.8.6
installed. Your version takes 19 minutes and 32 seconds, wow! My system
is swapping around all the time -- not very efficient.

Btw, my version is not more efficient here. I think due to swapping the
advantage is not noticeable, at least on my system.

asu1@recex:~/ttt > time perl t1.pl
Killed

real 0m16.495s
user 0m6.148s
sys 0m2.171s

It reaches 149 MB memory consumption, then is killed. It is probably
because I am running with 256 MB physical memory + 256 MB swap.

I am running with 512 MB of each. But that seems to be not enough for
this task -- after around 12 min it uses 312 MB physical and 450 MB
swap. Oddly enough it has read in the whole 80 MB of the input file in
this time but cannot complete (it swaps around over and over until it is
killed by the OS).

thanks for that,
fabian

Anno Siegel · Jun 18, 2005

[...]

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

Anno

John W. Krahn · Jun 18, 2005

Anno said:
[...]

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

Click to expand...

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

The \s character class also includes "\f" so using tr/ \r\n\t\f//d would be
the equivalent of using s/\s+//g.

John

Anno Siegel · Jun 18, 2005

John W. Krahn said:
Anno said:

[...]

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

Click to expand...

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

Click to expand...

The \s character class also includes "\f" so using tr/ \r\n\t\f//d would be
the equivalent of using s/\s+//g.

Thanks. I was too lazy to look it up, hence my hedging comment.

Anno

John W. Krahn · Jun 19, 2005

Anno said:
John W. Krahn said:

Anno said:

[...]

can be replaced with:

($contents = join("", <DATA>)) =~ s/\s+//g;

which is more obviously(?) "everything with the whitespace removed".

If speed matters,

($contents = join("", <DATA>)) =~ tr/ \r\n\t//d; # or similar

The difference is significant for long strings.

Click to expand...

The \s character class also includes "\f" so using tr/ \r\n\t\f//d would be
the equivalent of using s/\s+//g.

Click to expand...

Thanks. I was too lazy to look it up, hence my hedging comment.

You're welcome. Now who says we can't be civil here? ;>)

John

Help me with this task, please.	3	Mar 22, 2023
Graph of quadratic function with CanvasRenderingContext2D	2	May 9, 2024
Need Assistance With A Coding Problem	0	Aug 26, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Using the split function	15	Jul 17, 2007
Data saving in condition of changing reality	0	Apr 29, 2022
Problem with split	14	Jul 4, 2006
Fun with function argument counts	0	Feb 12, 2014

Problem with split function

Sherm Pendley

Prasanna

A. Sinan Unur

Prasanna

Sherm Pendley

Big and Blue

A. Sinan Unur

Fabian Pilkowski

A. Sinan Unur

Fabian Pilkowski

Anno Siegel

John W. Krahn

Anno Siegel

John W. Krahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads