DNA String Compression For Storing in Data Structure

G

Gundala Viswanath

Hi all,

I am new in C/C++. I am wondering if there is any
existing implementation to compress such string in
shorter format (e.g. 64 base).

AAAAAAAAAAAAGTCGCGCCGCCGCGGGGAGGAA

The reason I want to do this is because there are ~10millions of such
tags I want to process forming a matrix. There fore I need
to compress such a string for handling.

For example the implementation in R will give this:
seq2id("AAAAAAAAAAAAGTCGCGCCGCCGCGGGGAGGAA")
[1] "IAAAAtmWWaooA

The R code can be viewed here: http://dpaste.com/110009/

But I am not sure how to implement this in C/C++.
Thanks before hand.


- GV
 
G

Gert-Jan de Vos

I am new in C/C++. I am wondering if there is any
existing implementation to compress such string in
shorter format (e.g. 64 base).

AAAAAAAAAAAAGTCGCGCCGCCGCGGGGAGGAA

I am no expert in DNA but I understand there are only 4 possible
symbols: A,C,G,T. In that case 2 bits are enough to encode each
symbol. This would make a 2 bit encoded sequence 4 times smaller than
the equivalent char string. A fixed 2 bit/symbol also makes it quite
easy to index a sequence at random positions and insert/extract
symbols. I suggest you make a class that uses a vector<unsigned> to
store the encoded symbol bits and give it a vector like interface to
index individual symbols as plain chars.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,164
Messages
2,570,898
Members
47,439
Latest member
shasuze

Latest Threads

Top