how to remove oldest files up to a limit efficiently

L

linuxnow

I need to mantain a filesystem where I'll keep only the most recently
used (MRU) files; least recently used ones (LRU) have to be removed to
leave space for newer ones. The filesystem in question is a clustered
fs (glusterfs) which is very slow on "find" operations. To add
complexity there are more than 10^6 files in 2 levels: 16³ dirs with
equally distributed number of files inside.

My first idea was to "os.walk" the filesystem, find oldest files and
remove them until I reach the threshold. But find proves to be too
slow.

My second thought was to run find -atime several times to remove the
oldest ones, and repeat the process with most recent atime until
threshold is reached. Again, this needs several walks through the fs.

Then I thought about tmpwatch, but it needs, as find, a date to start
removing.

The ideal way is to keep a sorted list if files by atime, probably in
a cache, something like updatedb.
This list could be also be built based only on the diratime of the
first level of dirs, seek them in order and so on, but it still seems
expensive to get his first level of dir sorted.

Any suggestions of how to do it effectively?
 
D

Dan Stromberg

I need to mantain a filesystem where I'll keep only the most recently
used (MRU) files; least recently used ones (LRU) have to be removed to
leave space for newer ones. The filesystem in question is a clustered fs
(glusterfs) which is very slow on "find" operations. To add complexity
there are more than 10^6 files in 2 levels: 16³ dirs with equally
distributed number of files inside.

My first idea was to "os.walk" the filesystem, find oldest files and
remove them until I reach the threshold. But find proves to be too slow.

My second thought was to run find -atime several times to remove the
oldest ones, and repeat the process with most recent atime until
threshold is reached. Again, this needs several walks through the fs.

Then I thought about tmpwatch, but it needs, as find, a date to start
removing.

The ideal way is to keep a sorted list if files by atime, probably in a
cache, something like updatedb.
This list could be also be built based only on the diratime of the first
level of dirs, seek them in order and so on, but it still seems
expensive to get his first level of dir sorted.

Any suggestions of how to do it effectively?

os.walk once.

Build a list of all files in memory.

Sort them by whatever time you prefer - you can get times from os.stat.

Then figure out how many you need to delete from one end of your list,
and delete them.

If the filesystem is especially slow (or the directories especially
large), you might cluster the files to delete into groups by the
directories they're contained in, and cd to those directories prior to
removing them.
 
T

Terry Reedy

Dan said:
os.walk once.

Build a list of all files in memory.

Sort them by whatever time you prefer - you can get times from os.stat.

Since you do not need all 10**6 files sorted, you might also try the
heapq module. The entries into the heap would be (time, fileid)
 
L

linuxnow

os.walk once.

Build a list of all files in memory.

I was thinking of reuising updatedb but it does not contain atime.
Reimplementing it seems overkill to only remove a few files
regularily. Keeping this list easily would help a lot as old files
would be always updated, the daily run (the one used to reupdate the
db) would only add new ones which, in this case, are not interesting.
Sort them by whatever time you prefer - you can get times from os.stat.

Then figure out how many you need to delete from one end of your list,
and delete them.

If the filesystem is especially slow (or the directories especially
large), you might cluster the files to delete into groups by the
directories they're contained in, and cd to those directories prior to
removing them.

4096 dirs with equally distributed number of files inside. I'd
probably play trick with diratime and then search inside in order and
remove until threshold is reached, sorting seems too expensive, at the
end this will run often and it should only need to remove a few tenths/
hundreths of files.
 
L

linuxnow

Since you do not need all 10**6 files sorted, you might also try the
heapq module.  The entries into the heap would be (time, fileid)

I'll look into it: probably sorting dirs by atime and adding the files
inside to the heapq until I can remove enough of them would work very
efficiently.

Thanks
Pau
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top