how to remove oldest files up to a limit efficiently

linuxnow · Jul 8, 2008

I need to mantain a filesystem where I'll keep only the most recently
used (MRU) files; least recently used ones (LRU) have to be removed to
leave space for newer ones. The filesystem in question is a clustered
fs (glusterfs) which is very slow on "find" operations. To add
complexity there are more than 10^6 files in 2 levels: 16³ dirs with
equally distributed number of files inside.

My first idea was to "os.walk" the filesystem, find oldest files and
remove them until I reach the threshold. But find proves to be too
slow.

My second thought was to run find -atime several times to remove the
oldest ones, and repeat the process with most recent atime until
threshold is reached. Again, this needs several walks through the fs.

Then I thought about tmpwatch, but it needs, as find, a date to start
removing.

The ideal way is to keep a sorted list if files by atime, probably in
a cache, something like updatedb.
This list could be also be built based only on the diratime of the
first level of dirs, seek them in order and so on, but it still seems
expensive to get his first level of dir sorted.

Any suggestions of how to do it effectively?

Dan Stromberg · Jul 9, 2008

I need to mantain a filesystem where I'll keep only the most recently
used (MRU) files; least recently used ones (LRU) have to be removed to
leave space for newer ones. The filesystem in question is a clustered fs
(glusterfs) which is very slow on "find" operations. To add complexity
there are more than 10^6 files in 2 levels: 16Â³ dirs with equally
distributed number of files inside.

My first idea was to "os.walk" the filesystem, find oldest files and
remove them until I reach the threshold. But find proves to be too slow.

My second thought was to run find -atime several times to remove the
oldest ones, and repeat the process with most recent atime until
threshold is reached. Again, this needs several walks through the fs.

Then I thought about tmpwatch, but it needs, as find, a date to start
removing.

The ideal way is to keep a sorted list if files by atime, probably in a
cache, something like updatedb.
This list could be also be built based only on the diratime of the first
level of dirs, seek them in order and so on, but it still seems
expensive to get his first level of dir sorted.

Any suggestions of how to do it effectively?

os.walk once.

Build a list of all files in memory.

Sort them by whatever time you prefer - you can get times from os.stat.

Then figure out how many you need to delete from one end of your list,
and delete them.

If the filesystem is especially slow (or the directories especially
large), you might cluster the files to delete into groups by the
directories they're contained in, and cd to those directories prior to
removing them.

Terry Reedy · Jul 9, 2008

Dan said:
os.walk once.

Build a list of all files in memory.

Sort them by whatever time you prefer - you can get times from os.stat.

Since you do not need all 10**6 files sorted, you might also try the
heapq module. The entries into the heap would be (time, fileid)

linuxnow · Jul 9, 2008

os.walk once.

Build a list of all files in memory.

I was thinking of reuising updatedb but it does not contain atime.
Reimplementing it seems overkill to only remove a few files
regularily. Keeping this list easily would help a lot as old files
would be always updated, the daily run (the one used to reupdate the
db) would only add new ones which, in this case, are not interesting.

Sort them by whatever time you prefer - you can get times from os.stat.

Then figure out how many you need to delete from one end of your list,
and delete them.

If the filesystem is especially slow (or the directories especially
large), you might cluster the files to delete into groups by the
directories they're contained in, and cd to those directories prior to
removing them.

4096 dirs with equally distributed number of files inside. I'd
probably play trick with diratime and then search inside in order and
remove until threshold is reached, sorting seems too expensive, at the
end this will run often and it should only need to remove a few tenths/
hundreths of files.

linuxnow · Jul 9, 2008

Since you do not need all 10**6 files sorted, you might also try the
heapq module. The entries into the heap would be (time, fileid)

I'll look into it: probably sorting dirs by atime and adding the files
inside to the heapq until I can remove enough of them would work very
efficiently.

Thanks
Pau

How to remove subset from a file efficiently?	22	Jan 12, 2006
How to speed up XML reading	11	Sep 11, 2012
How to set up a fast correct java build?	41	Jan 8, 2010
how to remove code duplication	23	Aug 11, 2008
How to write a daemon program to monitor symbolic links?	5	Oct 23, 2009
How to speed up this slow part of my program	14	Mar 28, 2012
FAQ 4.41 How can I remove duplicate elements from a list or array?	0	Mar 1, 2011
10 Easy Steps to Speed Up Your Computer - Without Upgrading	0	Jun 8, 2009

how to remove oldest files up to a limit efficiently

linuxnow

Dan Stromberg

Terry Reedy

linuxnow

linuxnow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads