emulating du with os.walk

K

Kirk Job-Sluder

Hrm, I'm a bit stumped on this.

I want to write a script lists a file directory hierarchy and gives me a
sorted list showing cumulative directory size. The example code for
os.walk gets me half-way there, but I can't quite figure out how to do
the hierarchal sum. Here is the output I'm getting:

/home/kirk/.gconf/apps/ggv/layout consumes 228 bytes in 1 non-directory
files
/home/kirk/.gconf/apps/ggv consumes 0 bytes in 1 non-directory files
/home/kirk/.gconf/apps consumes 0 bytes in 1 non-directory files

However, what I want is:

/home/kirk/.gconf/apps/ggv/layout consumes 228 bytes in 1 non-directory
files
/home/kirk/.gconf/apps/ggv consumes 228 bytes in 1 non-directory files
/home/kirk/.gconf/apps consumes 228 bytes in 1 non-directory files

There should be an easy way to get around this, or perhaps I'm better
off just parsing the output of du.
 
D

Dan Perl

Firs of all, I don't know how much you already know about os.walk, but it
can traverse trees either top-down or bottom-up (it has an argument
'topdown'). The default is topdown=True. What you probably need in your
case is a bottom-up traversal (so pass topdown=False).

Then you have to keep track of all the directories (I can suggest a data
structure if you want) and add the du values of all the children directories
plus the sizes of all the files to determine the du value of a parent
directory.

Without seeing your code, I'm guessing you are not doing one of these
things.

Dan
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Kirk said:
There should be an easy way to get around this, or perhaps I'm better
off just parsing the output of du.

I suggest that you don't use os.path.walk, but write a recursive
function yourself. You should find that the entire problem can
be solved in 12 lines of Python code.

Regards,
Martin
 
K

Kirk Job-Sluder

I suggest that you don't use os.path.walk, but write a recursive
function yourself. You should find that the entire problem can
be solved in 12 lines of Python code.

Yeah, I finally solved it with a recursive function. Took me 16
including the bookeeping.
 
G

Gerrit

Martin v. Löwis said:
I suggest that you don't use os.path.walk, but write a recursive
function yourself. You should find that the entire problem can
be solved in 12 lines of Python code.

There are some nasty little problems which make it difficult.

First, what do you do with hardlinks? Suppose directory a/a, a/b and a/c
all contain the same 100 MiB file. Directory a/ only has 100 MiB, but a
naive script will report 300 MiB.

Most of the time, you'll want to stay in one filesystem.

You don't want to get stuck in recursive symlinks. If a/b is a symlink
to a/, you quickly get into an infinite loop.

Directories have a size too.

What do we do with files we can't read?

In /proc, even stranger subtleties exist which I don't understand -
ENOENT although listed by listdir() and that sort of thing.

Together with more options, human-readable file sizes and documentation,
it took be ~200 LOC at
http://topjaklont.student.utwente.nl/creaties/dkus.py

Note that du doesn't solve these problems either.

yours,
Gerrit.

--
Weather in Twenthe, Netherlands 28/09 08:55:
15.0°C mist overcast wind 4.0 m/s SW (57 m above NAP)
--
In the councils of government, we must guard against the acquisition of
unwarranted influence, whether sought or unsought, by the
military-industrial complex. The potential for the disastrous rise of
misplaced power exists and will persist.
-Dwight David Eisenhower, January 17, 1961
 
K

Kirk Job-Sluder

There are some nasty little problems which make it difficult.

First, what do you do with hardlinks? Suppose directory a/a, a/b and a/c
all contain the same 100 MiB file. Directory a/ only has 100 MiB, but a
naive script will report 300 MiB.

Well, that is a good question. The primary goal of this script is to
construct lists of files that can be passed to cpio in order to make
multiple volumes of a certain size. (In my case, efficiently pack
CD-ROM or CD-RW disks.) The other goal is to minimize splitting of
directory heirarchies between volumes where possible. So for example,
given a list of directories:

foo 500M
bar 400M
baz 100M
rab 200M

the script should construct file lists for two volumes:
volume1: foo baz
volume2: bar rab

(Of course, the actual volumes will be larger than 600M to allow for
compression.)

Since each volume should be independent of other volumes, it makes sense
to treat hard links as regular files. Even though foo/a.txt and
bar/b.txt point to the same file. A full copy of a.txt and b.txt is
required.
Most of the time, you'll want to stay in one filesystem.

You don't want to get stuck in recursive symlinks. If a/b is a symlink
to a/, you quickly get into an infinite loop.

Good point. I should check for that.
Directories have a size too.

What do we do with files we can't read?

At the moment, throw an error and move on.
In /proc, even stranger subtleties exist which I don't understand -
ENOENT although listed by listdir() and that sort of thing.

Together with more options, human-readable file sizes and documentation,
it took be ~200 LOC at
http://topjaklont.student.utwente.nl/creaties/dkus.py
Thanks!

Note that du doesn't solve these problems either.

True, but I'm willing to sacrifice some precision for the sake of getting
it done. Getting volume sizes in the ballpark is good enough.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,208
Messages
2,571,082
Members
47,682
Latest member
TrudiConna

Latest Threads

Top