deduping

dirknbr · Jun 21, 2010

Hi

I have 2 files (done and outf), and I want to chose unique elements
from the 2nd column in outf which are not in done. This code works but
is not efficient, can you think of a quicker way? The a=1 is just a
redundant task obviously, I put it this way around because I think
'in' is quicker than 'not in' - is that true?

done_={}
for line in done:
done_[line.strip()]=0

print len(done_)

universe={}
for line in outf:
if line.split(',')[1].strip() in universe.keys():
a=1
else:
if line.split(',')[1].strip() in done_.keys():
a=1
else:
universe[line.split(',')[1].strip()]=0

Dirk

Thomas Lehmann · Jun 21, 2010

universe={}

for line in outf:
if line.split(',')[1].strip() in universe.keys():
a=1
else:
if line.split(',')[1].strip() in done_.keys():
a=1
else:
universe[line.split(',')[1].strip()]=0

I can not say too much because I don't see what is processed
but what I can say is: "line.split(',')[1].strip()" might be
called three times so I would do it once only. And I would write
it like this:

for line in outf:
key = line.split(',')[1].strip()
if not (key in universe.keys()):
if not (key in done_.keys()):
universe[key] = 0

Peter Otten · Jun 21, 2010

dirknbr said:
Hi

I have 2 files (done and outf), and I want to chose unique elements
from the 2nd column in outf which are not in done. This code works but
is not efficient, can you think of a quicker way? The a=1 is just a
redundant task obviously, I put it this way around because I think
'in' is quicker than 'not in' - is that true?

done_={}
for line in done:
done_[line.strip()]=0

print len(done_)

universe={}
for line in outf:
if line.split(',')[1].strip() in universe.keys():
a=1
else:
if line.split(',')[1].strip() in done_.keys():
a=1
else:
universe[line.split(',')[1].strip()]=0

Instead of

if key in some_dict.keys():
#...

which converts the keys in the dictionary to a list and then performs an
O(N) lookup on that list you should use

if key in some_dict:
#...

which doesn't build a list and looks up the key in constant time.

Peter

python · Jun 21, 2010

Use a set instead of a dictionary for done keys?

Malcolm

Dave Angel · Jun 21, 2010

dirknbr said:
Hi

I have 2 files (done and outf), and I want to chose unique elements
from the 2nd column in outf which are not in done. This code works but
is not efficient, can you think of a quicker way? The a=1 is just a
redundant task obviously, I put it this way around because I think
'in' is quicker than 'not in' - is that true?

done_={}
for line in done:
done_[line.strip()]=0

print len(done_)

universe={}
for line in outf:
if line.split(',')[1].strip() in universe.keys():
a=1
else:
if line.split(',')[1].strip() in done_.keys():
a=1
else:
universe[line.split(',')[1].strip()]=0

Dirk

Where you have a=1, one would normally use the "pass" statement. But
you're wrong that 'not in' is less efficient than 'in'. If there's a
difference, it's probably negligible, and almost certainly less than the
extra else clause you're forcing here.

When doing an 'in', do *not* use the keys() method, as you're replacing
a fast lookup with a slow one, not to mention the time it takes to build
the keys() list each time.

In both these cases, you can use a set, rather than a dict. And there's
no need to test whether the item is already in the set, just put it in
again.

Changing all that, you'll wind up with something like (untested)

done_set = set()
universe = set()
for line in done:
done_set.add(line.strip())
for line in outf:
item = line.split(',')[1].strip()
if item not in done_set
universe.add(item)

DaveA

Paul Rubin · Jun 21, 2010

dirknbr said:
done_={}
for line in done:
done_[line.strip()]=0
...

Maybe you mean:

done_ = set(line.strip() for line in done)
outf_ = set(line.split(',')[1] for line in outf)
universe = done_ & outf # this finds the set intersection

Top 10 players minheap sort - need help	1	Oct 3, 2022
C# problem	1	Sep 11, 2024
I want to code whatsapp phone number validator. The script will	1	Feb 20, 2023
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
writing fortran equivalent binary file using python	3	Nov 14, 2013
adding values from a csv column and getting the mean. beginner help	10	Dec 11, 2013
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
How to speed this code	3	Nov 16, 2022

deduping

dirknbr

Thomas Lehmann

Peter Otten

python

Dave Angel

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads