Issue values dictionary

claire morandin · Jun 5, 2013

I have two text file with a bunch of transcript name and their corresponding length, it looks like this:
ERCC.txt
ERCC-00002 1061
ERCC-00003 1023
ERCC-00004 523
ERCC-00009 984
ERCC-00012 994
ERCC-00013 808
ERCC-00014 1957
ERCC-00016 844
ERCC-00017 1136
ERCC-00019 644
blast.tx
ERCC-00002 1058
ERCC-00003 1017
ERCC-00004 519
ERCC-00009 977
ERCC-00019 638
ERCC-00022 746
ERCC-00024 134
ERCC-00024 126
ERCC-00024 98
ERCC-00025 445

I want to compare the length of the transcript and see if the length in blast.txt is at least 90% of the length in ERCC.txt for the corresponding transcript name ( I hope I am clear!)
So I wrote the following script:
ercctranscript_size = {}
for line in open('ERCC.txt'):
columns = line.strip().split()
transcript = columns[0]
size = columns[1]
ercctranscript_size[transcript] = int(size)

unknown_transcript = open('Not_sequenced_ERCC_transcript.txt', 'w')
blast_file = open('blast.txt')
out_file = open ('out.txt', 'w')

blast_transcript = {}
blast_file.readline()
for line in blast_file:
blasttranscript = columns[0].strip()
blastsize = columns[1].strip()
blast_transcript[blasttranscript] = int(blastsize)

blastsize = blast_transcript[blasttranscript]
size = ercctranscript_size[transcript]
print size
if transcript not in blast_transcript:
unknown_transcript.write('{0}\n'.format(transcript))
else:
size = ercctranscript_size[transcript]
if blastsize >= 0.9*size:
print >> out_file, transcript, True
else:
print >> out_file, transcript, False

But I have a problem storing all size length to the value size as it is always comes back with the last entry.
Could anyone explain to me what I am doing wrong and how I should set the values for each dictionary? I am really new to python and this is my first script

Thanks for your help everybody!

alex23 · Jun 5, 2013

But I have a problem storing all size length to the value size as it is always comes back with the last entry.
Could anyone explain to me what I am doing wrong and how I should set thevalues for each dictionary?

Your code has two for loops, one that reads ERCC.txt into a dict, and
one that reads blast.txt into a dict. The first assigns to
`transcript`, the second to `blasttranscript`. When the loops are
finished, you're using the _last_ value set for both `transcript` and
`blasttranscript`. So, really, you want _three_ loops: two to load the
files into dicts, then another to compare the two of them. If the
transcripts in blast.txt are guaranteed to be a subset of ERCC.txt,
then you could get away with two loops:

# convenience function for splitting lines into values
def get_transcript_and_size(line):
columns = line.strip().split()
return columns[0].strip(), int(columns[1].strip())

# read in blast_file
blast_transcripts = {}
with open('transcript_blast.txt') as blast_file:
# this is a context manager, it'll close the file when it's
finished
for line in blast_file:
blasttranscript, blastsize = get_transcript_and_size(line)
blast_transcripts[blasttranscript] = blastsize

# read in ERCC and compare to blast
with open('transcript_ERCC.txt') as ercc_file, \
open('Not_sequenced_ERCC_transcript.txt', 'w') as
unknown_transcript, \
open('transcript_out.txt', 'w') as out_file:
# this is called a _nested_ context manager, and requires 2.7+
or 3.1+
for line in ercc_file:
ercctranscript, erccsize = get_transcript_and_size(line)
if ercctranscript not in blast_transcripts:
print >> unknown_transcript, ercctranscript
else:
is_ninety_percent = blast_transcripts[ercctranscript]

= 0.9*erccsize

print >> out_file, ercctranscript, is_ninety_percent

I've cleaned up your code a bit, using more similar naming schemes and
the same open/write procedures for all file access. Generally, any
time you're repeating code, you should stick it into a function and
use that instead, like the `get_transcript_and_size` func. If the
columns in your two files are separated by tabs, or always by the same
number of spaces, you can simplify this even further by using the csv
module: http://docs.python.org/2/library/csv.html

Hope this helps.

claire morandin · Jun 5, 2013

@alex23 I can't thank you enough this really helped me so much, not only fixing my issue but also understanding where was my original error

Thanks a lot

Peter Otten · Jun 5, 2013

alex23 said:
def get_transcript_and_size(line):
columns = line.strip().split()
return columns[0].strip(), int(columns[1].strip())

You can remove all strip() methods here as split() already strips off any
whitespace from the columns.

Not really important, but the nitpicker in me keeps nagging

alex23 · Jun 5, 2013

You can remove all strip() methods here as split() already strips off any
whitespace from the columns.

Not really important, but the nitpicker in me keeps nagging

Thanks, I really should have checked but just pushed the OPs code into
a function, I didn't want to startle them with completely different
code

As I mentioned, I would've used the csv module for this anyway, which
is why I never remember the split/strip behaviour.

Nitpickery can be a virtue in this field

Speed up creation of combo box options	2	Mar 10, 2006
OK ... AGE FUNCTION TEST RESULTS ...	1	Jan 11, 2004
How can I make a better program from the following one	1	Jun 14, 2008
sendmail won't send me email but will the person filling out form	0	Nov 6, 2005

Issue values dictionary

claire morandin

alex23

claire morandin

Peter Otten

alex23

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads