Finding empty columns. Is there a faster way?

nn · Apr 21, 2011

time head -1000000 myfile >/dev/null

real 0m4.57s
user 0m3.81s
sys 0m0.74s

time ./repnullsalt.py '|' myfile
0 1 Null columns:
11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68

real 1m28.94s
user 1m28.11s
sys 0m0.72s

import sys
def main():
with open(sys.argv[2],'rb') as inf:
limit = sys.argv[3] if len(sys.argv)>3 else 1
dlm = sys.argv[1].encode('latin1')
nulls = [x==b'' for x in next(inf)[:-1].split(dlm)]
enum = enumerate
split = bytes.split
out = sys.stdout
prn = print
for j, r in enum(inf):
if j%1000000==0:
prn(j//1000000,end=' ')
out.flush()
if j//1000000>=limit:
break
for i, cur in enum(split(r[:-1],dlm)):
nulls |= cur==b''
print('Null columns:')
print(', '.join(str(i+1) for i,val in enumerate(nulls) if val))

if not (len(sys.argv)>2):
sys.exit("Usage: "+sys.argv[0]+
" <delimiter> <filename> <limit>")

main()

Jon Clements · Apr 21, 2011

time head -1000000 myfile >/dev/null

real 0m4.57s
user 0m3.81s
sys 0m0.74s

time ./repnullsalt.py '|' myfile
0 1 Null columns:
11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68

real 1m28.94s
user 1m28.11s
sys 0m0.72s

import sys
def main():
with open(sys.argv[2],'rb') as inf:
limit = sys.argv[3] if len(sys.argv)>3 else 1
dlm = sys.argv[1].encode('latin1')
nulls = [x==b'' for x in next(inf)[:-1].split(dlm)]
enum = enumerate
split = bytes.split
out = sys.stdout
prn = print
for j, r in enum(inf):
if j%1000000==0:
prn(j//1000000,end=' ')
out.flush()
if j//1000000>=limit:
break
for i, cur in enum(split(r[:-1],dlm)):
nulls |= cur==b''
print('Null columns:')
print(', '.join(str(i+1) for i,val in enumerate(nulls) if val))

if not (len(sys.argv)>2):
sys.exit("Usage: "+sys.argv[0]+
" <delimiter> <filename> <limit>")

main()

What's with the aliasing enumerate and print??? And on heavy disk IO I
can hardly see that name lookups are going to be any problem at all?
And why the time stats with /dev/null ???

I'd probably go for something like:

import csv

with open('somefile') as fin:
nulls = set()
for row in csv.reader(fin, delimiter='|'):
nulls.update(idx for idx,val in enumerate(row, start=1) if not
val)
print 'nulls =', sorted(nulls)

hth
Jon

nn · Apr 22, 2011

time head -1000000 myfile >/dev/null

Click to expand...

real 0m4.57s
user 0m3.81s
sys 0m0.74s

Click to expand...

time ./repnullsalt.py '|' myfile
0 1 Null columns:
11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68

Click to expand...

real 1m28.94s
user 1m28.11s
sys 0m0.72s

Click to expand...

import sys
def main():
with open(sys.argv[2],'rb') as inf:
limit = sys.argv[3] if len(sys.argv)>3 else 1
dlm = sys.argv[1].encode('latin1')
nulls = [x==b'' for x in next(inf)[:-1].split(dlm)]
enum = enumerate
split = bytes.split
out = sys.stdout
prn = print
for j, r in enum(inf):
if j%1000000==0:
prn(j//1000000,end=' ')
out.flush()
if j//1000000>=limit:
break
for i, cur in enum(split(r[:-1],dlm)):
nulls |= cur==b''
print('Null columns:')
print(', '.join(str(i+1) for i,val in enumerate(nulls) if val))

Click to expand...

if not (len(sys.argv)>2):
sys.exit("Usage: "+sys.argv[0]+
" <delimiter> <filename> <limit>")

Click to expand...

main()

Click to expand...

What's with the aliasing enumerate and print??? And on heavy disk IO I
can hardly see that name lookups are going to be any problem at all?
And why the time stats with /dev/null ???

I'd probably go for something like:

import csv

with open('somefile') as fin:
nulls = set()
for row in csv.reader(fin, delimiter='|'):
nulls.update(idx for idx,val in enumerate(row, start=1)if not
val)
print 'nulls =', sorted(nulls)

hth
Jon

Thanks, Jon
aliasing is a common method to avoid extra lookups. The time stats for
head is giving the pure I/O time. So of the 88 seconds the python
program takes 5 seconds are due to I/O, so there is quite a bit of
overhead.

I ended up with this, not super fast so I probably won't be running it
against all 350 million rows of my file but faster than before:

time head -1000000 myfile |./repnulls.py
nulls = [11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68]

real 0m49.95s
user 0m53.13s
sys 0m2.21s

import sys
def main():
fin = sys.stdin.buffer
dlm = sys.argv[1].encode('latin1') if len(sys.argv)>1 else b'|'
nulls = set()
nulls.update(i for row in fin for i, val in
enumerate(row[:-1].split(dlm), start=1) if not val)
print('nulls =', sorted(nulls))
main()

Is there a faster way to do this?	7	Aug 5, 2008
This is a mess...	3	Jul 16, 2009
A faster way of finding historical highs/lows	6	Jun 11, 2004
About a value error called 'ValueError: A value in x_new is below theinterpolation range'	0	Feb 6, 2013
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
A better webpage filter	6	Mar 24, 2007
is there a way to speed up this tablesort code?	4	Jan 3, 2006
inspected console	2	May 8, 2007

Finding empty columns. Is there a faster way?

nn

Jon Clements

nn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads