Twice this fall I've worked on code that takes a group of files and ensures that the total size of the files are less than a given size. The operation is pretty simple: identify all the files and their size (recursively, or not but accounting for the size of directories,) sort them, and and delete files from the front or back of the populated list. When you've reached the desired size.
If you have a cache and you're constantly adding content to it, eventually you will either need an infinite amount of storage or you'll have to delete something.
But what to delete? And how?
Presumably you use some items in the cache more often than others, and some files that change very often while others change very rarely, and in many cases, use and change frequency are orthogonal.
For the cases that I've worked on, the first case, frequency of use, is the property that we're interested in. If we haven't used a file in a while relative to the other files, the chances are its safe to delete.
The problem is that access time (atime) is that while most file systems have a concept of atime, most of them don't update it. Which makes sense: if every time you read a file you have to update the metadata, then every read operation becomes a write operations, and everything becomes slow.
Relative access time or, relatime, helps some. Here atime is updated, but only if you're writing to the file or if it's been more than 24 hours since your last update. The problem, of course, is that if cache are write-once-read-many and operates with a time granularity of less than a day, then relatime is often just creation time. That's no good.
The approach I've been taking is to use the last modification time, (mtime), and to intentionally update mtime (e.g. using touch or a similar operation,) after cache access. It's slightly less elegant than it could be, but it works really well and requires very little overhead.
Armed with these decisions all you need is a thing that crawls a file system, collects objects and stores their size and time, so we know how large the cache is, and can maintain an ordered list of file objects by mtime. The ordered lists of files should be a heap, but the truth is that you build and sort the structure once, and then just remove the "lowest" (oldest) items until the cache is the right size and then throwing it all away, so you're not really doing many heap-ish operations.
Therefore, I present lru. Earlier this summer I wrote a less generic implementation of the same principal, and was elbows deep into another project when I realized I needed another cache pruning tool. Sensing a trend, I decided to put a little more time into the project and built it out as a library that other people can use, though frankly I'm mostly concerned about my future self.
The package has two types, a Cache type that incorporates the core functionality and FileObject which represents items in the cache.
Operation is simple. You can construct and add items to the cache manually, or you can use DirectoryContents or TreeContents which build caches from a starting file system point. DirectoryContents looks at the contents of a single directory (skipping sub-directories optionally) and returns a Cache object with those contents. If you do not skip directories, each directory has, in the cache the total size of its contents.
TreeContents recurses through the tree and ignores directories, and returns a Cache object with all of those elements. TreeContents does not clean up empty directories.
Once you have a Cache object, use its Prune method with the maximum size of the cache (in bytes), any objects to exclude, and an optional dry-run flag, to prune the cache down until it's less than or equal to the max size.
Done.
I'm not planning any substantive changes to the library at this time as it meets most of my needs but there are some obvious features:
- a daemon mode where the cache object can "watch" a file system (using ionotify or similar) and add items to or update existing items in the cache. Potentially using fsnotify.
- an option to delete empty directories encountered during pruning.
- options to use other time data from the file system when possible, potentially using the times library.
With luck, I can go a little while longer without doing this again. With a little more luck, you'll find lru useful.