B
Bryan
Hi,
I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.
The large sequence file is in fasta format, and I read it in without a
problem.
The subsequence file is a table, which I read in using the Data::Table
module:
my $subseqs= Data::Table::fromTSV("subseq.txt");
Then I loop through the $subseq table and do a pattern match for my
sequence like this:
if ($sequence =~ m/$subseq/g) {
# Match!
}
Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!
I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.
But in vi, I use :set list to show all command characters. Nothing
unusual is there.
I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for read\n\n";
my @data = <INFILE>;
close INFILE;
foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}
Does anyone see anything wrong with this, or my pattern match that may
explain the unexplainable?
Thanks,
Bryan
I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.
The large sequence file is in fasta format, and I read it in without a
problem.
The subsequence file is a table, which I read in using the Data::Table
module:
my $subseqs= Data::Table::fromTSV("subseq.txt");
Then I loop through the $subseq table and do a pattern match for my
sequence like this:
if ($sequence =~ m/$subseq/g) {
# Match!
}
Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!
I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.
But in vi, I use :set list to show all command characters. Nothing
unusual is there.
I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for read\n\n";
my @data = <INFILE>;
close INFILE;
foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}
Does anyone see anything wrong with this, or my pattern match that may
explain the unexplainable?
Thanks,
Bryan