Simple py script to calc folder sizes

C

Caleb Hattingh

Hi everyone

[Short version: I put a some code below: what changes can make it run
faster?]

Unless you have a nice tool handy, calculating many folder sizes for
clearing disk space can be a click-fest nightmare. Looking around, I
found Baobab (gui tool); the "du" linux/unix command-line tool; the
extremely impressive tkdu: http://unpythonic.net/jeff/tkdu/ ; a python
script I didn't really understand at
http://vsbabu.org/webdev/zopedev/foldersize.html (are these "folder
objects" zope thingies?); there are also tools that can add a
"foldersize" column into Explorer on Windows
(foldersize.sourceforge.net, for example); the superb freeCommander
file-manager (win32) has the functionality built in, and so on.

"du" is closest to what I was looking for, but is not immediately
cross-platform: I know I can probably get it through Cygwin, and there
is probably a win32 binary or clone around somewhere, but I thought a
simple python solution would be great. Maybe there already is one, but
I couldn't find it with a modest amount of searching.

Anyway, I made one that will produce a list of only the folders in the
current folder, along with their sizes. I am posting it for two
reasons: it might be useful for someone else, and I want to know if it
can be made faster (but in a cross-platform way); maybe you spot
something in the code that is obviously sub-optimal.

# Python script to list sizes of folders in current folder

import os, os.path

rootfolders = os.listdir('.')
rootfolders = [i for i in rootfolders if os.path.isdir(i)]

class counter:
def __init__(self,rootfolder):
self.count = 0
self.rootfolder = rootfolder
def inc(self,num):
self.count = self.count + num
def __str__(self):
if self.count<1024.:
unit = ' bytes'
scaler = 1.
elif self.count<1024.*1024.:
unit = ' KB'
scaler = 1/1024.
elif self.count<1024.*1024.*1024.:
unit = ' MB'
scaler = 1/1024./1024.
else:
unit = ' GB'
scaler = 1/1024./1024./1024.
return '%-20s -
%8.2f%s'%(self.rootfolder,self.count*scaler,unit)

def visitfun(cntObj,dirname,names):
for i in names:
fullname = os.path.join(dirname,i)
if os.path.isfile(fullname):
cntObj.inc( os.path.getsize(fullname) )
return None

foldersizeobjects = []
for i in rootfolders:
cntObj = counter(i)
os.path.walk(i,visitfun,cntObj)
foldersizeobjects.append(cntObj)

def cmpfunc(a,b):
if a.count > b.count:
return 1
elif a.count == b.count:
return 0
else:
return -1

foldersizeobjects.sort(cmpfunc)

tot=0
for foldersize in foldersizeobjects:
tot=tot+foldersize.count
print foldersize
print 'Total: %.2f MB'%(tot/1024./1024.)

# End

regards
Caleb
 
B

Ben Cartwright

Caleb said:
Unless you have a nice tool handy, calculating many folder sizes for
clearing disk space can be a click-fest nightmare. Looking around, I
found Baobab (gui tool); the "du" linux/unix command-line tool; the
extremely impressive tkdu: http://unpythonic.net/jeff/tkdu/ ; a python
script I didn't really understand at
http://vsbabu.org/webdev/zopedev/foldersize.html (are these "folder
objects" zope thingies?); there are also tools that can add a
"foldersize" column into Explorer on Windows
(foldersize.sourceforge.net, for example); the superb freeCommander
file-manager (win32) has the functionality built in, and so on.

You also might want to take a look at KDirStat
(http://kdirstat.sourceforge.net/) and its win32 counterpart,
WinDirStat (http://windirstat.sourceforge.net/).
"du" is closest to what I was looking for, but is not immediately
cross-platform: I know I can probably get it through Cygwin, and there
is probably a win32 binary or clone around somewhere

Try http://unxutils.sourceforge.net/ ... much quicker to set up than
Cygwin.

A pure Python port of du (and other unix utilities) would be cool,
though.

--Ben
 
J

John Zenger

Caleb said:
Hi everyone

[Short version: I put a some code below: what changes can make it run
faster?]

On my slow notebook, your code takes about 1.5 seconds to do my
C:\Python24 dir. With a few changes my code does it in about 1 second.

Here is my code:

import os, os.path, math

def foldersize(fdir):
"""Returns the size of all data in folder fdir in bytes"""
root, dirs, files = os.walk(fdir).next()
files = [os.path.join(root, x) for x in files]
dirs = [os.path.join(root, x) for x in dirs]
return sum(map(os.path.getsize, files)) + sum(map(foldersize, dirs))

suffixes = ['bytes','kb','mb','gb','tb']
def prettier(bytesize):
"""Convert a number in bytes to a string in MB, GB, etc"""
# What power of 1024 is less than or equal to bytesize?
exponent = int(math.log(bytesize, 1024))
if exponent > 4:
return "%d bytes" % bytesize
return "%8.2f %s" % (bytesize / 1024.0 ** exponent, suffixes[exponent])

rootfolders = [i for i in os.listdir('.') if os.path.isdir(i)]
results = [ (foldersize(folder), folder) for folder in rootfolders ]

for size, folder in sorted(results):
print "%s\t%s" % (folder, prettier(size))

print
print "Total:\t%s" % prettier(sum ( size for size, folder in results ))

# End

The biggest change I made was to use os.walk rather than os.path.walk.
os.walk is newer, and a bit easier to understand; it takes just a single
directory path as an argument, and returns a nice generator object that
you can use in a for loop to walk the entire tree. I use it in a
somewhat unconventional way here. Look at the docs for a more
conventional application.

The "map(os.path.getsize, files)" code should run a bit faster than a
for loop, because map only has to look up the getsize function once.

I use log in the "prettier" function rather than your chain of ifs. The
chain of ifs might actually be faster. But I spent so long studying
math in school that I like to use it whenever I get a chance.

Some other comments on your code:
def cmpfunc(a,b):
if a.count > b.count:
return 1
elif a.count == b.count:
return 0
else:
return -1

This could be just "return a.count - b.count". Cmp does not require -1
or +1, just a positive, negative, or zero.
foldersizeobjects.sort(cmpfunc)

You could also use the key parameter; it is usually faster than a cmp
function. As you can see, I used a tuple; the sort functions by default
sort on the first element of the tuples. Of course, sorting is not a
serious bottleneck in either program.
tot=0
for foldersize in foldersizeobjects:
tot=tot+foldersize.count
print foldersize

"tot +=" is cooler than tot = tot + . And perhaps a bit faster.
 
C

Caleb Hattingh

Thanks John

I will use your code :) 30% improvement is not insignificant, and
that's what I was looking for.

I find the log function a little harder to read, but I guess that is a
limitation of me, not your code.

Caleb
 
C

Caleb Hattingh

Hi John

Your code works on some folders but not others. For example, it works
on my /usr/lib/python2.4 (the example you gave), but on other folders
it terminates early with StopIteration exception on the
os.walk().next() step.

I haven't really looked at this closely enough yet, but it looks as
though there may be an issue with permissions (and not having enough)
on subfolders within a tree.

I don't want you to work too hard on what is my problem, but are there
any ideas that jump out at you?

Regards
Caleb
 
B

Ben Cartwright

Caleb said:
Your code works on some folders but not others. For example, it works
on my /usr/lib/python2.4 (the example you gave), but on other folders
it terminates early with StopIteration exception on the
os.walk().next() step.

I haven't really looked at this closely enough yet, but it looks as
though there may be an issue with permissions (and not having enough)
on subfolders within a tree.

You're quite correct. Here's a version of John's code that handles
such cases:

import warnings
def foldersize(fdir):
"""Returns the size of all data in folder fdir in bytes"""
try:
root, dirs, files = os.walk(fdir).next()
except StopIteration:
warnings.warn("Could not access " + fdir)
return 0
files = [os.path.join(root, x) for x in files]
dirs = [os.path.join(root, x) for x in dirs]
return sum(map(os.path.getsize, files)) + sum(map(foldersize, dirs))

There's also another bug in the prettier() function that barfs on empty
directories, as it's taking the log of 0. The fix:

exponent = int(math.log(max(1, bytesize), 1024))

--Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top