M
mstagliamonte
Dear All,
Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
Fasta file:
headers file:
header1|info1:info2_info3
header2|info1:info2_info3
I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.
I have done a script which seems to work, but with a couple of 'side effects'
Here is my script:
-------------------------------------------------------------------
import re
class Extractor():
def __init__(self,headers_file, fasta_file,output_file):
with open(headers_file,'r') as inp0:
counter0=0
container=''
inp0_bis=inp0.read().split('\n')
for x in inp0_bis:
container+=x.replace(':','_').replace('|','_')
with open(fasta_file,'r') as inp1:
inp1_bis=inp1.read().split('>')
for i in inp1_bis:
i_bis= i.split('\n')
match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
if match:
counter0+=1
with open(output_file,'at') as out0:
out0.write('>'+i)
print '{} sequences were found'.format(counter0)
-------------------------------------------------------------------
Side effects:
1) The very first header is written as >>header1 rather than >header1
2) the number of sequences found is 1 more than the ones actually found!
Have you got any thoughts about causes/solutions?
Thanks for your time!
P.S.: I think I have removed the double posting... not sure...
Max
Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
Fasta file:
general textheader1|info1:info2_info3 general text
header2|info1:info2_info3
headers file:
header1|info1:info2_info3
header2|info1:info2_info3
I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.
I have done a script which seems to work, but with a couple of 'side effects'
Here is my script:
-------------------------------------------------------------------
import re
class Extractor():
def __init__(self,headers_file, fasta_file,output_file):
with open(headers_file,'r') as inp0:
counter0=0
container=''
inp0_bis=inp0.read().split('\n')
for x in inp0_bis:
container+=x.replace(':','_').replace('|','_')
with open(fasta_file,'r') as inp1:
inp1_bis=inp1.read().split('>')
for i in inp1_bis:
i_bis= i.split('\n')
match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
if match:
counter0+=1
with open(output_file,'at') as out0:
out0.write('>'+i)
print '{} sequences were found'.format(counter0)
-------------------------------------------------------------------
Side effects:
1) The very first header is written as >>header1 rather than >header1
2) the number of sequences found is 1 more than the ones actually found!
Have you got any thoughts about causes/solutions?
Thanks for your time!
P.S.: I think I have removed the double posting... not sure...
Max