Marc said:
I assume there is a particular reason why you are looking for sequence
features using regexp? Because it seems somewhat inefficient - which of
course got you here in the first place. There is no way that you can use
distance measures of sequences to cluster them, for example (like Blast
+ MCL)?
First i cannot use any of the local alignment programs to do the
clustering since my sequences are known to undergo lots of
recombination, secondly i question the idea of BLAST BLOSUM matrices
that have been derived from unbiased genomes. Am working with a pathogen
that is famous for its high A/T content in its genome.
Then why not use adjusted matrices? Yes we can, but then The high
recombinations in the sequences renders that approach unusable. MCL is
good for grouping proteins but also depends on pairwise local alignments
and may be applicable for looking at unsupervised clustering or
grouping. Secondly there is the issue of a relevant BLAST e-cut off as
well as Inflation parameters to use with MCL.(Actually I have tried that
approach already with diff' I values) and we are analysing the results.
Also, databases
like ProDom do basically just that - looking for particular sequence
features in protein sequences. Sure, they focus on domains, but
depending of the nature of your regexps, their tools may be applicable
regardless.
Their tools may be applicable but we already know the motifs that we are
searching for and they are not in ProDom or swissprot domains.
Then there is the new CS-BLAST, which uses scoring matrices
- which may perhaps be derivable from your regexps, dunno.
Again BLAST and any alignment method for highly recombining sequences is
not of much use. Instead we are using alignment free approaches to infer
relationships in the sequences.
The exact patterns that we have, are known(from our experimental
evidence) to be associated with pathogenicity of a particular type and
that is why am looking for exact matches. Else i could just have aligned
the patterns and generated a HMM profile which i can again use to search
in a generic way for the groups. Actually that is the next step. But
first, I have opted to go for simple approaches first.
ruby may really not be the ideal approach
True Ruby may not be suited for some jobs and applications but again, so
is C++, java,php etc. The most important thing i think is the approach
and not the choice of programming language. Secondly i don't like the
idea of black boxes that i don't understand.
Please feel free to contact me at georgkam address hosted at google's
email domain. We don't want to hijack this excellent mailing list that
is meant for Ruby with molecular biology discussions. I have re-posted
this thread to bioruby and we can discuss it there.
For now Robert's suggestion seems to have generated something that i
wanted, a non redundant motif that is free from backtracking.
Thank you.