The Future of File Organization and Security

I was having a conversation with a (now former) coworker (a while ago) about the future of shared file systems, unstructured organization and management, and access control. What follows are a collection of notes and thoughts on the subject that have stuck with me.

Let's start with some general assumptions, premises, ideas:

  • File system hierarchies are dead or dying. To have a useful file system hierarchy the following qualities are essential:

  • Every piece of data needs to belong in one location and only one location.

  • Every container (e.g. directory or folder) needs to hold at least two objects.

  • Hierarchy depth ought to be minimized. Every system can use two levels. After the second level, each additional level should only be added a very large number of additional objects are added to the system. If you have 3 functional levels and less than 1000 objects, you might be in trouble.

    As you might imagine, this is very difficult to achieve, and the difficulty is compounded by huge amounts of legacy systems, and the fact that "good enough is good enough," particularly given that file organization is secondary to most people's core work.

    While there are right ways to build hierarchical structure for file system data, less structure is better than more structure, and I think that groups will tend toward less over time.

  • Access control is a lost cause. Legacy data and legacy practices will keep complex ACL-based systems for access control in place for a long time, but I think it's pretty clear that for any sort of complex system, access control isn't an effective paradigm. In some ways, access control is the last really good use of file system hierarchies. Which is to say, by now the main use of strong containers (as opposed to tags) is access control.

    I don't think that "enterprise content management"-style tools are there, yet. I suspect that the eventual solution to "how do I control access to content" will either: be based on a an cryptography key system which will control access and file integrity, or there will be a class of application, a la ECMS, with some sort of more advanced abstracted file system interface that's actually use-able.

I'm betting on encryption.

  • Tagging and search are the ways forward. In many cases, the location of files in hierarchy help determine the contents of those files. If there are no hierarchies then you need something more useful and more flexible to provide this level of insight.

  • Great search is a necessity. Luckily it's also easy. Apache Solr/Lucene, Xapian, and hell Google Search Appliances make great search really easy.

  • Some sort of tagging system. In general, only administrators should be able to create tags, and I think single tag-per object (i.e. categories) versus multiple tags per object should be configurable on a collection-by-collection.

    Tag systems would be great for creating virtualized file system interfaces, obviating the need for user-facing links, and leveraging existing usage patterns and interfaces. It's theoretically possible to hang access control off of tag systems but that's significantly more complicated.

    One of the biggest challenges with tag systems is avoiding recapitulating the problems with hierarchical organization.

The most difficult (and most interesting!) problem in this space is probably the access control problems. The organizational practices will vary a lot and there aren't right and wrong answers. This isn't true in the access control space.

Using public key infrastructure to encrypt data may be an effective access control method. It's hard replicate contemporary access control in encryption schemes. Replicating these schemes may not be desirable either. Here are some ideas:

  • By default all files will be encrypted such that only the creator can read it. All data can then be "world readable," as far as the storage medium and underlying file systems are concerned.

  • The creator can choose to re-encrypt objects such that other users and groups of users can access the data. For organizations this might mean a tightly controlled central certificate authority-based system. For the public internet, this will either mean a lot of duplicated encrypted data, or a lot of key chains.

  • We'll need to give up on using public keys as a method of identity testing and verification. Key signing is cool, but at present it's complex, difficult to administer, and presents a significant barrier to entry. Keys need to be revocable, particularly group keys within organizations.

    For the public internet, a some sort of social capital or network analysis based certification system will probably emerge to supplement for strict-web-of-trust based identity testing.

  • If all data is sufficiently encrypted, VPNs become obsolete, at least as methods for securing file repositories. Network security is less of a concern when content is actually secure. Encryption overhead, for processing isn't a serious concern on contemporary hardware.


Issue Tracking and the Health of Open Source Software

I read something recently that suggested that the health of an open source project and its community could be largely assessed by reviewing the status of the bug tracker. I'm still trying to track down the citation for this remark. This basically says that vital active projects have regularly updated bugs that are clearly described and that bugs be easy to search and easy to submit.

I'm not sure that free software communities and projects can be so easily assessed or that conventional project management practices are the only meaningful way to judge a project's health. While we're at it, I don't know that it's terribly useful to focus too much attention or importance on project management. Having said that, the emergence of organizational structure is incredibly fascinating, and could probably tolerate more investigation.

As a starting point, I'd like to offer two conjectures:

  • First, that transparent issue tracking is a reasonably effective means of "customer service," or user support. If the bug tracking contains answers to questions that people encounter during use, and provide a way to resolve issues with the software that's productive and helps with support self-service. Obviously some users and groups of users are better at this than others.
  • Second, issue tracking is perhaps the best way to do bottom-up project/product management and planning in the open, particularly since these kinds or projects lack formal procedures and designated roles to do this kind of organizational work.

While the overriding goal of personal task management is to break things into the smallest manageable work units, the overriding goal of issue tracking systems is to track the most intellectually discrete issues within a single project through the development process. Thus, issue tracking systems have requirements that are either much less important in personal systems or actively counter-intuitive for other uses. They are:

  • Task assignment, so that specific issues can be assigned different team members. Ideally this gets a specific developer can "own" a specific portion of the project and actually be able to work and coordinate efforts on the project.
  • Task prioritization, so that less important or crucial issues get attention before "nice to have," items are addressed.
  • Issue comments and additional attached information, to track progress and support information sharing among teams, particularly over long periods of time with asynchronous elements.

While it's nice to be able to integrate tasks and notes (this is really the core of org-mode's strength) issue tracking systems need to be able to accommodate error output and discussion from a team on the best solution, as well as discussion about the ideal solution.

The truth is that a lot of projects don't do a very good job of using issue tracking systems, despite how necessary and important bug trackers. The prefabricated systems can be frustrating and difficult to use, and most of the minimalist systems [1] are hard to use in groups. [2] The first person to write a fully featured, lightweight, and easy to use issue tracking system will be incredibly successful. Feel free to submit a patch to this post, if you're aware of a viable systems along these lines.

[1]I'm thinking about using ikiwiki or org-mode to track issues, but ditz suffers from the same core problem.
[2]Basically, they either sacrifice structure or concurrency features or both. Less structured systems rely on a group of people to capture the same sort of information in a regular way (unlikely) or they capture less information, neither option is tenable. Without concurrency (because they store things in single flat files) people can't use them to manage collaboration, which make them awkward personal task tracking systems.

Create Better Task Items

I was paging through a list of things that I made for myself during a call I was in a few weeks ago, and was utterly dismayed by how useless the items were on the list. I wasn't sure what needed to be done, I couldn't remember what things meant, and I was left with the sinking suspicion that I had forgotten something crucial. I write this, in part, as a lesson to my past self on how to write good task list items.

Hopefully you'll find it useful.

Task items must be actionable. You need to be able to read the subject or summary and know: what the project is, what kind of work it is, what needs to be done, and what very next thing you need to do is.

Tasks cannot be open ended. It's really tempting to write tasks in the form of "work on a project" or "make progress on email backlog," but don't. How do you know if you've done the task? Is all progress the same? Is the actual work activity plainly obvious from an open ended task description?

Tasks need to concise. I'm a big fan of including some sort of status information and some sort of instruction and context with your tasks, but you need to be able to look at a task list and triage what to do next without thinking very much and without spending more than a few seconds deciphering messages from your past self. Write good summaries.

Try to organize your projects and tasks so that most of your task items are not dependent upon other items. Sometime dependencies are unavoidable, but I find if you're clever, you can chop things up into parallel tasks that are easier to work on but that accomplish the same goal. In some cases, long strings of dependent tasks can be just as troublesome as large open tasks, because in the moment they amount to clutter.

Also your feedback and suggestions from your own experience may be of interest to all of us! I look forward to hearing from you!

Saved Searches and Notmuch Organization

I've been toying around with the Notmuch Email Client which is a nifty piece of software that provides a very minimalist and powerful email system that's inspired by the organizational model of Gmail.

Mind you, I don't think I've quite gotten it.

Notmuch says, basically, build searches (e.g. "views") to filter your email so you can process your email in the manner that makes the most sense to you, without needing to worry about organizing and sorting email. It has the structure for "tagging," which makes it easy to mark status for managing your process (e.g. read/unread, reply-needed), and the ability to save searches. And that's about it.

Functionally tags and saved searches work the way that mail boxes in terms of the intellectual organization of mailboxes. Similarly the ability to save searches, makes it possible to do a good measure of "preprocessing." In the same way that Gmail changes the email paradigm by saying "don't think about organizing your email, just do what you need to do," not much says "do less with your email, don't organize it, and trust that the machine will be able to help you find what you need when the time comes."

I've been saying variations of the following for years, but I think on some level it hasn't stuck for me. Given contemporary technology, it doesn't make sense to organize any kind of information that could conceivably be found with search tools. Notmuch proves that this works, and although I've not been able to transfer my personal email over, I'm comfortable asserting that notmuch is a functional approach to email. To be fair, I don't feel like my current email processing and filtering scheme is that broken, so I'm a bad example.

The questions that this raises, which I don't have a particularly good answers for, are as follows:

  • Are there good tools for the "don't organize when you can search crew," for non-email data? And I'm not just talking about search engines themselves (as there are a couple: xapian, namazu), or ungainly desktop GUIs (which aren't without utility,) but the proper command-line tools, emacs interfaces, and web based interfaces?
  • Are conventional search tools the most expressive way of specifying what we want to find when filtering or looking for data? Are there effective improvements that can be made?
  • I think there's intellectual value created by organizing and cataloging information "manually," and "punting to search" seems like it removes the opportunity to develop good and productive information architectures (if we may be so bold.) Is there a solution that provides the ease of search without giving up the benefits that librarianism brings to information organization?

The Overhead of Management

Every resource, every person, every project, every machine you have to manage comes with an ongoing cost. This is just as true of servers as is it is of people who work on projects that you're in charge of or have some responsibility for, and while servers and teammates present very different kinds of management challenges, working effectively and managing management costs across contexts is (I would propose) similar. Or at least similar enough to merit some synthetic discussion.

There's basically only one approach to managing "systems administration costs," and that's to avoid it as much as possible. This isn't to say that sys admins avoid admining, but rather we work very hard to ensure that systems don't need administration. We write operating systems that administer themselves, we script procedures to automate most tasks as much as possible (the Perl programing language was developed and popularized for use of easing the administration of UNIX systems,) and we use tools manage larger systems more effectively.

People, time, and other resources cannot be so easily automated, and I think in response there are two major approaches (if we can create a somewhat false dichotomy for a moment:)

On the one hand there's the school of thought that says "admit and assess management costs early, and pay them up front." This is the corporate model in many ways. Have (layers upon layers of) resources dedicated to managing management costs, and then let this "middle management" make sure that things get done in spite of the management burden. On servers this is spending a lot of time choosing tools, configuring the base system, organizing the file system proactively, and constructing a healthy collection of "best practices."

By contrast, the other perspective suggests that management costs should only be paid when absolutely necessary. make things, get something working and extant and then if something needs to be managed later, do it then and only as you need. On some level this is inspiring philsophy behind the frequent value of favoring "working code" over "great ideas" in the open source world. [1] Though I think they phrase it differently, this is the basic approach that many hacker-oriented start ups have taken, and it seems to work for them. On the server, this approach is the "get it working," approach, and these administrators aren't bothered by having to go in every so often to "redo" how things are configured, and I think on some level this kind of approach to "management overhead" grows out of the agile world and the avoidance of "premature optimizations."

But like all "somewhat false dichotomies," there are flaws in the above formulation. Mostly the "late management" camp is able to delay management most effectively by anticipating their future needs (either by smarts or by dumb luck) early and planning around that. And the "early management" camp has to delay some management needs or else you'd be drowned in overhead before you started: and besides, the MBA union isn't that strong.

We might even cast the "early management" approach as being "top down," and the "late management" camp as being "bottom up." If you know, we were into that kind of thing. It's always, particularly in the contemporary moment to look at the bottom-up approach and say "that's really innovative and awesome, that's better," and view "top-down" organizations as "stodgy and old world," when neither does a very good job of explaining what's going on and there isn't inherent radicalism or stodginess in either organization. But it is interesting. At least mildly.

Thoughts? Onward and Upward!

[1]Alan Cox's Cathedrals, Bazaars and the Town Council

File System Metaphors

The file system is dead. Long live the File system.

We live in an interesting time. There are two technologies that aim to accomplish two very goals. On the one hand we have things like Amazon's S3, Hadoop, NoSQL, and a host of technologies that destroy the file system metaphor as we know it today. The future, if you believe it, lays in storing all data in some sort of distributed key/value store-based system. And then, on the other hand we have things like "FUSE" that attempt to translate various kinds of interfaces and data systems onto the file system metaphor.

Ok, so the truth is that the opposition between the "lets replace file systems" with non-file based data stores folks and the "lets use the file system as a metaphor for everything," is totally contrived. How data is stored and how we interact with data are very different (and not always connected) problems.

Let's lay down some principals:

  • There are (probably) more tools to interact with, organize, manage, and manipulate files and file system objects than there are for any other data storage system in contemporary technology.

  • Most users of computers have some understanding of file systems and how they work, though clearly there are a great diversity of degrees here.

  • In nearly every case, only one system can have access to a given file system at a time. In these days of such massive parallel computing, the size of computer networks, (and the associated latency) this has become a rather substantial limitation.

  • From the average end user's perspective, it's probably the case that file systems provide too much flexibility, and can easily become disorganized.

  • There are all sorts of possible problems regarding consistency, backups, and data corruption that all data storage systems must address, but that present larger problems as file systems need to scale to handle bigger sets of data, more users, and attach to systems that are more geographically disparate.

    Given these presumptions, my personal biases and outlook, and a bit of extrapolation here's a basic feature set for "information storage system." These features will transcend the storage engine/interface boundary a bit. You've been warned.

  • Multiple people and systems need to be able to access and edit the same objects concurrently.

  • Existing tools need to be able to work in some capacity. Perhaps using FUSE-like systems. File managers, mv, ls, and cp should just work, etc.

  • There ought to be some sort of off-network capability so that a user can loose a network connection without loosing access to his or her data.

  • Search indexing and capabilities should be baked into the lowest levels of the system so that people can easily find information.

  • There ought to be some sort of user facing meta-data system which can affect not just sort order, but also attach to actions, to create notifications, or manipulate the data for easier use.

These sorts sorts of features are of course not new ideas. My sygn project is one example, as is haven, as is this personal information management proposal.

Now all we need to do is figure some way to build it.

The Things I'm Going To Do Today

Ok, so not really.

This post is mostly about playing a head game with yourself, in an effort to get more organized. But not "head games" in a bad way. On my to do list for the past few weeks I've had something like "write a blog post about todo list item titles," because in light of this post about org-mode it seems like a topic in need of further definition. Basically my goal is to explore, the best way to think about what we have to do, to allow us to accomplish what we want to. The GTD system which so many people are enamored of present a few ideas on the topic, and while the GTD way is a good place to start thinking, it's not a good place to stop thinking.

We've all done it. Made a todo list that we didn't end up using for one reason or another. Todo lists, need to be useful: they should help us organize our day, and help us keep track of all the things wee need to accomplish. In a lot of ways, maintaining focus over our day and all of the tasks that nag at us are contradictory, so todo lists are failed by design.

The first, and frequent issue in my own organization are lists and plans that go too far and list too many "actionable items." This divides your time and actions into too many little pieces, leading to a number of outcomes. The first risk is that you might start to ignore the list entirely because it's too long and complicated, even if it's an illusion caused by the size of the items on your list. Ergo, the total length of the list you "work off of," needs to be manageable and comprehensible.

The second risk, is that, an overactive todo list is one where you over plan for yourself, such that your list--while accurate and comprehensible--isn't useful. Beyond simply providing "outboard memory," the best todo lists allow us to structure and make plans for our working time. When working (e.g. writing, at the computer, etc.), I like to have my projects chopped up into pieces that can conceivably get done in the time I have to work on them, but that give me the time and freedom to bury myself on a side project, or follow inspiration or a train of thought to its completion when needed. While effective todo lists help you structure your time, flexibility is still valuable.

There are issues on the other end of the spectrum as well: when lists are too short, and the "actionable items" on a list are too conceptually large, the effectiveness of lists is degraded as well. A reminder to "write a novel," even a specific novel, is less than helpful for helping you accomplish something in the moment. Even a dozen items, on a list where you end up checking something off once every day or two doesn't help you figure out "Ok, what do I need to work on now." Besides the chances are, if the items are too large and the list is too small, that you probably have it memorized anyway.

Right? Other strategies?

Managing Management Costs

Every system that requires your attention and responsibility comes with some sort of "management cost," this includes servers that run websites and email, as well as the notes you take and--in my case--the novels you avoid writing.

This post, and really the last one as well, grows out of my interest and desire to stay organized, to work effectively without spending too much time and energy thinking about organization. Except of course that I write a bunch about this sort of thing on the blog, so maybe I'm a bad example of success. At the end of the day we're all just folk', I guess.

The argument at the present moment revolves around consolidation rather than an approach to design or organization. And the basic premise is: "no matter how complex your organizational problem is, you can probably accomplish what you need to by doing less."

  • Feel like you spend too much time reading email, or have too many email inboxes to check (personal email, work email, special project email, listserv email, facebook email, etc.)? Forward your email into one box and filter the hell out of it so that you only read what you really have to and it's manageable.
  • Feel like you have too many todo lists? Compile them into a single list and use some sort of tag system to organize it.
  • Feel like your notes and documents are scared in too many places? Combine them and use some sort of search tool to find things when you need them.

And so forth. In the analog information world (i.e. with papers, notebooks, and books) we often take the approach of sorting things into distinct piles of similar sorts of things, and arranging things physically in our worlds to reflect this basic sorting. For instance, "the science fiction books will be on the first three shelves, the 20Th century philosophy on the next three, college textbooks on the next, and [...]" These habits, combined with unfortunate conventions like referring to hierarchical organizational units of a file system (e.g. directories) "folders," encourages us to translate these real-world conventions to our digital existences. This is undoubtedly a bad idea.

The more data you pile together in one place, even dissimilar data, the more powerful it becomes. Say you have a PDF collection of articles on the anthropology of death and dying, post-colonial literature, and linguistics hanging out in different directories of your file system, and you begin to do research for a story you want to write set in the 1930s in India, where do you look? What if there are relevant articles in all three folders. What if you have a dozen or two dozen folders? What if you have a number of hierarchical organizational trees, and you store your notes, the actual text of what you're working on, and your reference materials separately with parallel hierarchies? [1] Quite suddenly you're over-organized and disorganized all at the same time,

The more "system" you have the more difficult it is to manage. The key to success, or part of it at any rate, is being minimalist about your organization. Recognize that adding responsibilities, projects, directories, lists, email accounts, and so forth all come with a cost. And sometimes, being a little less organized means that you're able to get more done, if that makes sense.

If your experiences reflect this (or run contrary to this logic,) I'd be very interested in hearing about how you have solved, and have continued to solve the issue.

[1]This kind of system actually makes a lot of sense in the paper world, but is borderline absurd in the digital systems.