Twice this fall I’ve worked on code that takes a group of files and ensures that the total size of the files are less than a given size. The operation is pretty simple: identify all the files and their size (recursively, or not but accounting for the size of directories,) sort them, and and delete files from the front or back of the populated list. When you’ve reached the desired size.
If you have a cache and you’re constantly adding content to it, eventually you will either need an infinite amount of storage or you’ll have to delete something.
But what to delete? And how?
Presumably you use some items in the cache more often than others, and some files that change very often while others change very rarely, and in many cases, use and change frequency are orthogonal.
For the cases that I’ve worked on, the first case, frequency of use, is the property that we’re interested in. If we haven’t used a file in a while relative to the other files, the chances are its safe to delete.
The problem is that access time (atime
) is that while most file
systems have a concept of atime
, most of them don’t update it.
Which makes sense: if every time you read a file you have to update the
metadata, then every read operation becomes a write operations, and
everything becomes slow.
Relative access time or, relatime
, helps some. Here atime
is
updated, but only if you’re writing to the file or if it’s been more
than 24 hours since your last update. The problem, of course, is that if
cache are write-once-read-many and operates with a time granularity of
less than a day, then relatime
is often just creation time. That’s no
good.
The approach I’ve been taking is to use the last modification time,
(mtime
), and to intentionally update mtime
(e.g. using touch
or a
similar operation,) after cache access. It’s slightly less elegant than
it could be, but it works really well and requires very little overhead.
Armed with these decisions all you need is a thing that crawls a file
system, collects objects and stores their size and time, so we know how
large the cache is, and can maintain an ordered list of file objects by
mtime
. The ordered lists of files should be a heap, but the truth is
that you build and sort the structure once, and then just remove the
“lowest” (oldest) items until the cache is the right size and then
throwing it all away, so you’re not really doing many heap-ish
operations.
Therefore, I present lru. Earlier this summer I wrote a less generic implementation of the same principal, and was elbows deep into another project when I realized I needed another cache pruning tool. Sensing a trend, I decided to put a little more time into the project and built it out as a library that other people can use, though frankly I’m mostly concerned about my future self.
The package has two types, a Cache type that incorporates the core functionality and FileObject which represents items in the cache.
Operation is simple. You can construct and add items to the cache
manually, or you can use
DirectoryContents
or
TreeContents
which build caches from a starting file system point.
DirectoryContents
looks at the contents of a single directory
(skipping sub-directories optionally) and returns a Cache
object with
those contents. If you do not skip directories, each directory has, in
the cache the total size of its contents.
TreeContents
recurses through the tree and ignores directories, and
returns a Cache
object with all of those elements. TreeContents
does
not clean up empty directories.
Once you have a Cache
object, use its
Prune method
with the maximum size of the cache (in bytes), any objects to exclude,
and an optional dry-run flag, to prune the cache down until it’s less
than or equal to the max size.
Done.
I’m not planning any substantive changes to the library at this time as it meets most of my needs but there are some obvious features:
- a daemon mode where the cache object can “watch” a file system (using ionotify or similar) and add items to or update existing items in the cache. Potentially using fsnotify.
- an option to delete empty directories encountered during pruning.
- options to use other time data from the file system when possible, potentially using the times library.
With luck, I can go a little while longer without doing this again. With
a little more luck, you’ll find lru
useful.