Perl bioinformatics

C

ccc31807

I'm not changing jobs, but I've been contacted about some contract
opportunities that (reportedly) are difficult but seem easy enough to
me, manipulating genome files to produce various kinds of reports,
graphs, etc. I have zero experience in this, so I'm just wondering ...

1. What are the career opportunities in bioinformatics using Perl?

2. Looking for books, I found the following:
a. Beginning Perl for Bioinformatics by James Tisdall
b. Mastering Perl for Bioinformatics by James D. Tisdall
c. Building Bioinformatics Solutions: with Perl, R and MySQL by
Conrad Bessant**
d. Perl Programming for Biologists by D. Curtis Jamison
e. Genomic Perl: From Bioinformatics Basics to Working Code by Rex A.
Dwyer

Looking at the tables of contents, reviews, and reader comments, I
believe that c. is probably the best value, but it's real hard to tell
without buying and reading the book. Anybody have any experiences
using any of these books? I'd like to conserve both time and money by
starting with the 'best' book.

Thanks, CC.
 
J

Jürgen Exner

ccc31807 said:
I'm not changing jobs, but I've been contacted about some contract
opportunities that (reportedly) are difficult but seem easy enough to
me, manipulating genome files to produce various kinds of reports,
graphs, etc. I have zero experience in this, so I'm just wondering ...

The usual problem is the huge volume of data that needs processing.
Therefore typically the standard algorithms don't work any more and you
need a really strong background in data processing.
Perl is not necessariy the best choice here. Perl's powerful features
make it easy to write code that seems to do the job, but it won't scale
from the small test samples to the huge actual data set where you really
need special methods and optimizations.

A little while ago there was someone posting questions here regularly
about how to deal with genom sequences. If don't know if he is still
around, but maybe you can check the archives and contact him.

jue
 
B

Bradley K. Sherman

Looking at the tables of contents, reviews, and reader comments, I
believe that c. is probably the best value, but it's real hard to tell
without buying and reading the book. Anybody have any experiences
using any of these books? I'd like to conserve both time and money by
starting with the 'best' book.

The 'best' book is the one that engages you. It's hard to
predict.

For $22.95 you can get access to *all* the O'Reilly books
<http://my.safaribooksonline.com/>
including several on bioinformatics. There's a free trial!

You might want to check the used book stores for a textbook like
_The Molecular Biology of the Gene_, so that you can pick up some
biology.

--bks
 
B

Bradley K. Sherman

...
The usual problem is the huge volume of data that needs processing.
Therefore typically the standard algorithms don't work any more and you
need a really strong background in data processing.
Perl is not necessariy the best choice here. Perl's powerful features
make it easy to write code that seems to do the job, but it won't scale
from the small test samples to the huge actual data set where you really
need special methods and optimizations.
...

This is not really fair. Most of bioinformatics is data wrangling
and Perl is exactly the right choice for that.

See, e.g.
<http://www.foo.be/docs/tpj/issues/vol1_2/tpj0102-0001.html>

--bks
 
C

ccc31807

This is not really fair.  Most of bioinformatics is data wrangling
and Perl is exactly the right choice for that.

In my day job, I deal with data files on the order of several hundred
thousand records. The scripts I write to produce reports from these
data files sometimes take a second (or several seconds) to run. The
data file I have for the bioinformatics project is much larger, but is
a lot simpler (it's a dotplot file).

Sometimes, data files can be so huge that the script just breaks.
Sometimes, the script just runs longer than you might expect.
Obviously, the longer time really isn't a problem ... there's no
difference between a script that runs in microseconds and one that
runs in minutes (say, between 60 and 120) ... as long as the script
runs to completion.

I'm sympathetic to jue's observation about the scaling problem, but
after having looked at the data, the fact that it's genomic or
biological is totally irrelevant. It's really the amount of data
rather than the kind of data that seems to be significant.

You seem to have a handle on what's going on. Is using Perl for
bioinformatics totally off the wall, or a reasonable option for data
mangling?

CC
 
U

Uri Guttman

JE> The usual problem is the huge volume of data that needs processing.
JE> Therefore typically the standard algorithms don't work any more and you
JE> need a really strong background in data processing.
JE> Perl is not necessariy the best choice here. Perl's powerful features
JE> make it easy to write code that seems to do the job, but it won't scale
JE> from the small test samples to the huge actual data set where you really
JE> need special methods and optimizations.

JE> A little while ago there was someone posting questions here regularly
JE> about how to deal with genom sequences. If don't know if he is still
JE> around, but maybe you can check the archives and contact him.

i will disagree on this. first off, perl is major in the biotech world
for several reasons. one it is the best at text processing and most
large genetic files are just plain text formats. secondly, there is
large package called bioperl (with its own mailing list and community)
that does tons of standard things on those files and more. finally, if
you look back a bit, there is a great article called 'how perl saved the
human genome project'. when that project was initially running it was
distributed over many labs worldwide. and they created many new
incompatible file formats for the data. the author of cgi.pm (who is
really an MD and genetic researcher) designed perl modules to convert
those formats to a common set of core formats so they could easily
exchange data. so perl has a strong tie to the biotech industry that is
not likely to be broken for a long while.

as for jobs, i don't see many leads in that industry but they are
usually looking for direct experience in it (hard to get from the
outside) and/or higher degrees in related fields because you would be
working in such an environment where you need it.

so if the OP can learn enough from books and practice to get a job in
the field, i say go for it. there many be other hurdles to jump but i
can't predict what they will be.

uri
perlhunter.com (so i know something about the perl job market)
 
B

Bradley K. Sherman

...
You seem to have a handle on what's going on. Is using Perl for
bioinformatics totally off the wall, or a reasonable option for data
mangling?

I think that Perl is the primary language for bioinformatics.
I can't back that up with numbers but I have been working in
bioinformatics since 1992. Some of the younger bioinformaticians
might want to make a case for Python, but I'm skeptical.

My philosophy is to use Perl until it becomes necessary to
write something in C. It rarely becomes necessary.

Learning databases and statistics are also of great importance.

--bks
 
J

Jochen Lehmeier

You seem to have a handle on what's going on. Is using Perl for
bioinformatics totally off the wall, or a reasonable option for data
mangling?

I have no idea about bioinformatics, but Perl is easy enough that you
should be able to get a book, jot down a quick & dirty test script and
just sic it on your biggest and meanest data set.

Then you get a quick handle on how long basic stuff takes. If it works
fast enough, fine; if not, feel free to ask here. And if you find that
it's just not the right tool, then you won't have lost much.

IMO, the deal breaker will be if you have to handle data in an O(n^2)
fashion (or worse), i.e. where one would really use some very special
index structure, especially if the whole data set does not fit into RAM.

Good luck!
 
K

Keith Bradnam

I'm not changing jobs, but I've been contacted about some contract
opportunities that (reportedly) are difficult but seem easy enough to
me, manipulating genome files to produce various kinds of reports,
graphs, etc. I have zero experience in this, so I'm just wondering ...

1. What are the career opportunities in bioinformatics using Perl?

2. Looking for books, I found the following:
 a. Beginning Perl for Bioinformatics by James Tisdall
 b. Mastering Perl for Bioinformatics by James D. Tisdall
 c. Building Bioinformatics Solutions: with Perl, R and MySQL by
Conrad Bessant**
 d. Perl Programming for Biologists by D. Curtis Jamison
 e. Genomic Perl: From Bioinformatics Basics to Working Code by Rex A.
Dwyer

Looking at the tables of contents, reviews, and reader comments, I
believe that c. is probably the best value, but it's real hard to tell
without buying and reading the book. Anybody have any experiences
using any of these books? I'd like to conserve both time and money by
starting with the 'best' book.

Thanks, CC.

I co-teach a Unix & Perl course at UC Davis that is aimed at teaching
graduate students how to learn the basics of Perl in a biological
context. We have specifically tried to assume no prior knowledge of
programming as many people who take our course are new to this.

We have made our course materials (data & documentation) freely
available to anyone else who is interested:

http://korflab.ucdavis.edu/Unix_and_Perl/index.html

There is a corresponding Google Group for discussion of issues arising
from the course. We also make regular updates to the documentation.
Hope this might be of use to you.

Keith
 
X

Xho Jingleheimerschmidt

Jürgen Exner said:
The usual problem is the huge volume of data that needs processing.
Therefore typically the standard algorithms don't work any more and you
need a really strong background in data processing.

Isn't that exactly Perl's strength?
Perl is not necessariy the best choice here. Perl's powerful features
make it easy to write code that seems to do the job, but it won't scale
from the small test samples to the huge actual data set where you really
need special methods and optimizations.

If you think about scalability as you write the code, Perl will not
present any special scalability issues versus other languages. If you
do not think about scalability, no language choice will protect you.

I certainly would not implement a heavy duty multiple alignment
algorithm directly in Perl, but I certainly might (and have) implement
things like that in Inline::C or just link pre-existing C code in via
XS, using Perl to handle the book-keeping, memory management, IPC,
pre-processing and parsing, post-processing, packing, unpacking, etc.

Based on the description of "produce various kinds of reports", I
wouldn't think they expect this to cover Smith-Waterman type of things
anyway, but only the kind of reports that are very similar to what you
would find in non-bioinformatics type work.

Xho
 
D

Dr.Ruud

Keith said:
I co-teach a Unix & Perl course at UC Davis that is aimed at teaching
graduate students how to learn the basics of Perl in a biological
context. We have specifically tried to assume no prior knowledge of
programming as many people who take our course are new to this.

We have made our course materials (data & documentation) freely
available to anyone else who is interested:

http://korflab.ucdavis.edu/Unix_and_Perl/index.html

There is a corresponding Google Group for discussion of issues arising
from the course. We also make regular updates to the documentation.
Hope this might be of use to you.

I Like It.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,150
Members
46,696
Latest member
BarbraOLog

Latest Threads

Top