Denormalize Access Control

Access control is both immensely useful and incredibly broken.

Access control, or the ability to constrain access to data and programs in a shared system is the only way that we, as users of shared systems, can maintain our identities, personal security, and privacy. Shared systems include: databases, file servers, social networking sites, virtualized computing systems, vendor accounts, control panels, management tools, and so forth all need robust, flexible, granular, and scalable access control tools.

Contemporary access control tools--access control lists (ACL,) and access control groups--indeed the entire conceptual framework for managing access to data and resources, don't work. From a theoretical practice, ACLs that express a relationship between users or groups of users and data or resources, represent a parsimonious solution to the "access control problem:" if properly deployed only those with access grants will have access to a given resource.

In practice these these kinds of relationships do not work. Typically relationships between data and users is rich and complex and different users need to be able to do different things with different resources. Some users need "read only" access, others need partial read access, some need read and write access but only to a subset of a resource. While ACL systems can impose these kinds of restrictions, the access control abscration doesn't match the data abstraction or the real-world relationships that it supposedly reflects.

Compounding this problem are two important factors:

  1. Access control needs change over time in response to social and cultural shifts among the users and providers of these resources.
  2. There are too many pieces of information or resources in any potential shared system to allocate access on a per-object or per-resource basis, and the volume of objects and resources is only increasing.

Often many objects or resources have the same or similar access control patterns, which leads to the "group" abstraction. Groups make it possible to describe a specific access control pattern that apply to a number of objects, and connect this pattern with specific resources.

Conceptual deficiencies:

  • There's a volume problem. Access control data represents a many-to-many-to-many relationship. There are many different users and (nested) groups, many different kinds of access controls that systems can grant, and many different (nested) resources. This would be unmanageably complex without the possibility for nesting, but nesting means that the relationships between resources and between groups and users are also important. With the possibility for nesting access control is impossible.

  • ACLs and group-based access control don't account for the fact that access must be constantly evolving, and current systems don't contain support for ongoing maintenance. (we need background threads that go through and validate access control consistency.) Also all access control grants must have some capacity for automatic expiration.

  • Access control requirements and possibilities shift as data becomes more or less structured, and as data use patterns change. The same conceptual framework that works well for access control in the context of a the data stored in a relational database, doesn't work so when the data in question is a word processing document, an email folder, or a spread sheet.

    The fewer people that need access to a single piece of data, the easier the access control system can be. While this seems self evident, it also means that access control systems are difficult to test in the really large complex systems in which they're used.

  • Group-based access control systems, in effect, normalize data about access control, in an effort to speed up data access times. While this performance is welcome, in most cases granting access via groups leads to an overly liberal distribution of access control rights. At once, its too difficult to understand "who has access to what" and too easy to add people to groups that give them more access than they need.

So the solution:

  1. Denormalize all access control data,
  2. don't grant access to groups, and
  3. forbid inheritance.

This is totally counter to the state of the art. In most ways, normalized access control data, with role/group-based access control, and complex inheritance are the gold standard. Why would it work?

  • If you have a piece of data, you will always be able to determine who has access to data, without needing to do another look-up.

  • If you can deactivate credentials, then a background process can go through and remove access without causing a large security problem. (For partial removes, you would freeze an account, let the background process modify access control and then unfreeze the account.)

    The down side is that, potentially, in a large system, it may take a rather long time for access grants to propagate to users. Locking user accounts makes the system secure/viable, but doesn't make the process any more quick.

    As an added bonus, these processes could probably be independent and wouldn't require any sort of shared state or lock, which means many such operation could run in parallel, and they could stop and restart at will.

  • The inheritance option should be fuzzy. Some sort of "bucket-based" access control should be possible, if there's a lot of data with the same access control rules and users.

    Once things get more complex, buckets are the wrong metaphor, you should use granular controls everywhere.

Problems/Conclusion:

  • Denormalization might fix the problems with ACLs and permissions systems, but it doesn't fix the problems with distributed identity management.

    As a counterpoint, this seems like a cryptography management problem.

  • Storing access control information with data means that it's difficult to take a user and return a list of what these credentials have access to.

    In truth, centralized ACL systems are subject to this flaw as well.

  • A huge part of the problem with centralized ACL derives from nesting, and the fact that we tend to model/organize data in tree-like structures, that often run counter to the organization of access control rights. As a result access control tools must be arbitrary.

The Structured and Unstructured Data Challenge

The Debate

Computer programmers want data to be as structured as possible. If you don't give users a lot of room to do unpredictable things, it's easier to write software that does cool things. Users on the other hand, want (or think that they want) total control over data and the ability to do whatever they want.

The problem is they don't. Most digital collateral, even the content stored in unstructured formats, is pretty structured. While people may want freedom, they don't use it, and in many cases users go through a lot of effort to recreate structure within unstructured forms.

Definitions

Structured data are data that is stored and represented in a tabular form or as some sort of hierarchical tree that is easily parsed by computers. By contrast, unstructured data, are things like files that have data and where all of the content is organized manually in the file and written to durable storage manually.

The astute among you will recognize that there's an intermediate category, where largely unstructured data is stored in a database. This happens a lot in content management systems, in mobile device applications, and in a lot of note taking and project management applications. There's also a parallel semi-structured form, where people organize their writing, notes, content in a regular and structured manner even though the tools they're using don't require it. They'd probably argue that this was "best practice," rather than "semi-structured" data, but it probably counts.

The Impact

The less structured content or data is the less computer programs are able to do with the data, and the more people have to work to make the data useful for them. So while we as users want freedom, that freedom doesn't get us very far and we don't really use it even when we have it. Relatedly, I think we could read the crux of the technological shift in Web 2.0 as a move toward more structured forms, and the "mash up" as the celebration of a new "structured data."

The lines around "semi-structured" data are fuzzy. The trick is probably to figure out how to give people just enough freedom so that they don't feel encumbered by the requirements of the form, but so much freedom that the software developers are unable to do really smart things behind the scene. That's going to be difficult to figure out how to implement, and I think the general theme of this progress is "people can handle and developers should err on the side of stricture."

Current Directions

Software like org-mode and twiki are attempts to leverage structure within unstructured forms, and although the buzz around enterprise content management (ECM) has started to die down, there is a huge collection of software that attempts to impose some sort of order on the chaos of unstructured documents and information. ECM falls short probably because it's not structured enough: it mandates a small amount of structure (categories, some meta-data, perhaps validation and workflow,) which doesn't provide significant benefit relative to the amount of time it takes to add content to these repositories.

There will be more applications that bridge the structure boundary, and begin to allow users to work with more structured data in a productive sort of way.

On a potentially orthogonal note, I'm working on cooking up a proposal for a LaTeX-based build system for non-technical document production that might demonstrate--at least hypothetically--how much structure can help people do awesome things with technology. I'm calling it "A LaTeX Build System."

I'd love to hear what you think, either about this "structure question," or about the LaTeX build system!

Saved Searches and Notmuch Organization

I've been toying around with the Notmuch Email Client which is a nifty piece of software that provides a very minimalist and powerful email system that's inspired by the organizational model of Gmail.

Mind you, I don't think I've quite gotten it.

Notmuch says, basically, build searches (e.g. "views") to filter your email so you can process your email in the manner that makes the most sense to you, without needing to worry about organizing and sorting email. It has the structure for "tagging," which makes it easy to mark status for managing your process (e.g. read/unread, reply-needed), and the ability to save searches. And that's about it.

Functionally tags and saved searches work the way that mail boxes in terms of the intellectual organization of mailboxes. Similarly the ability to save searches, makes it possible to do a good measure of "preprocessing." In the same way that Gmail changes the email paradigm by saying "don't think about organizing your email, just do what you need to do," not much says "do less with your email, don't organize it, and trust that the machine will be able to help you find what you need when the time comes."


I've been saying variations of the following for years, but I think on some level it hasn't stuck for me. Given contemporary technology, it doesn't make sense to organize any kind of information that could conceivably be found with search tools. Notmuch proves that this works, and although I've not been able to transfer my personal email over, I'm comfortable asserting that notmuch is a functional approach to email. To be fair, I don't feel like my current email processing and filtering scheme is that broken, so I'm a bad example.

The questions that this raises, which I don't have a particularly good answers for, are as follows:

  • Are there good tools for the "don't organize when you can search crew," for non-email data? And I'm not just talking about search engines themselves (as there are a couple: xapian, namazu), or ungainly desktop GUIs (which aren't without utility,) but the proper command-line tools, emacs interfaces, and web based interfaces?
  • Are conventional search tools the most expressive way of specifying what we want to find when filtering or looking for data? Are there effective improvements that can be made?
  • I think there's intellectual value created by organizing and cataloging information "manually," and "punting to search" seems like it removes the opportunity to develop good and productive information architectures (if we may be so bold.) Is there a solution that provides the ease of search without giving up the benefits that librarianism brings to information organization?

Creating Useful Archives

I've done a little tweaking to the archives for dialectical futurism recently, including creating a new archive for science fiction and writing and being who I am this has inspired a little of thought regarding the state and use of archives of blogs.

The latest iteration of this blog has avoided the now common practice of having large endless lists of posts organized by publication month or by haphazardly assigned category and tag systems. While these succeed at providing a complete archive of every post written, they don't add any real value to a blog or website. I'm convinced that one feature of successful blogs moving forward will be archives that are curated and convey additional value beyond the content of the site.

Perhaps blogs as containers for a number of posts will end up being to ephemeral than I'm inclined to think about them, and will therefore not require very much in the way of archives, Perhaps, Google's index will be sufficient for most people's uses. Maybe. I remain unconvinced.

Heretofore, I have made archives for tychoish as quasi-boutique pieces: collections of the best posts that address a given topic. This is great from the perspective of thinking about blog posts as a collection of essays, but I've started to think that this may be less less useful if we think of blogs as a collection of resources that people might want to have access to beyond it's initial ephemeral form.

Right now my archives say "see stuff from the past few months, and several choice topics on which I wrote vaguely connected sequences of posts." The problem with the list of posts from the last few months is that beyond date, there's not a lot of useful information beyond the title and the date. The problem with the topical archives is that they're not up to date, their not comprehensive even for recent posts, and there's little "preview" of a given post beyond it's title. In the end I think the possibility of visiting a topical archive looking for a specific post and not finding it is pretty large.

In addition to editorial collecting, I think archives, guides, or indexes of a given body of information ought to, provide some sort of comprehensive method for accessing information. There has to be some middle ground.

I think the solution involves a lot of hand mangling of content, templates, and posts. I'm fairly certain that my current publication system is probably not up for the task without a fair amount of mangling and beating. As much as I want to think that this is an problem in search of the right kind of automation, I'm not sure that's really the case. I'm not opposed to editing things by hand, but it would increase the amount of work in making any given post significantly.

There is, I suspect, no easy solution here.

Organize Your Thoughts More Betterly

I've been working with a reader and friend on a project to build a tool for managing information for humanities scholars and others who deal with textual data, and I've been thinking about the problem of information management a bit more seriously. Unlike numerical, or more easily categorized information data, how to take a bunch of textual information--either of your own production or a library of your own collection--is far from a solved problem.

The technical limitation--from a pragmatic perspective--is that you need to have an understanding not only of the specific tasks in front of you, but a grasp of the entire collection of information you work with in order to effectively organize, manage, and use the texts as an aggregate.

"But wait," you say. "Google solved this problem a long time ago, you don't need a deterministic information management tool, you need to brute force the problem with enough raw data, some clever algorithms, and search tools," you explain. And on some level you'd be right. The problem is of course, you can't create knowledge with Google.

Google doesn't give us the ability to discover information that's new, or powerful. Google works best when we know exactly what we're looking for, the top results in Google are most likely to be the resources that the most people know and are familiar. Google's good, and useful and a wonderful tool that more people should probably use but Google cannot lead you into novel territory.

Which brings us back to local information management tools. When you can collect, organize, and manipulate data in your own library you can draw novel conclusions, When the information is well organized, and you can survey a collection in useful and meaningful ways, you can see holes and collect more, you can search tactically, and within subsets of articles to provide. I've been talking for more than a year about the utility of curation in the creation of value on-line. and fundamentally I think the same holds true for personal information collections.

Which brings us back to the ways we organize information. And my firm conclusion that we don't have a really good way of organizing information. Everything that I'm aware of either relies on search, and therefore only allows us to find what we already know we're looking for, or requires us to understand our final conclusions during the preliminary phase of our investigations.

The solution to this problem is thus two fold: First, we need tools that allow us to work with and organize the data for our projects, full stop. Wiki's, never ending text files, don't really address all of the different ways we need to work with and organize information. Secondly we need tool tools that are tailored to the way researchers who deal in text work with information from collection and processing to quoting and citation, rather than focusing on the end stage of this process. These tools should allow our conceptual framework for organizing information to evolve as the project evolves.

I'm not sure what that looks like for sure, but I'd like to find out. If you're interested, do help us think about this!

(Also, see this post `regarding the current state of the Cyborg Institute <http://www.cyborginstitute.com/2010/06/a-report-from-the-institute/>`_.)

Strategies for Organizing Wiki Content

I've been trying to figure out wikis for a long time. It always strikes me that the wiki is probably the first truly unique (and successful) textual form of the Internet age. And there's a lot to figure out. The technological innovation of the wiki is actually remarkably straightforward, [1] and while difficult the community building aspects of wikis are straightforward. [2] The piece of the wiki puzzle that I can't nail down in a pithy sentence or two is how to organize information effectively on a wiki.

That's not entirely true.

The issue, is I think that there are a number of different ways to organize content for a wiki, and no one organizational strategy seems to be absolutely perfect, and I've never been able to settle on a way of organizing wiki pages that I am truly happy with. The goals of a good wiki "information architecture" (if I may be so bold) are as follows:

  • Clarity: It should be immediately clear to the readers and writers of a wiki where a page should be located in the wiki. If there's hierarchy, it needs to fit your subject area perfectly and require minimal effort to grok. Because you want people to focus on the content rather than the organization, and we don't tend to focus on organizational systems when they're clear.
  • Simplicity: Wikis have a great number of internal links and can (and are) indexed manually as needed, so as the proprietor of a wiki you probably need to do a lot less "infrastructural work" than you think you need to. Less is probably more in this situation.
  • Intuitive: Flowing from the above, wikis ought to strive to be intuitive in their organization. Pages should answer questions that people have, and then provide additional information out from there. One shouldn't have to dig in a wiki for pages, if there are categories or some sort of hierarchy there pages there shouldn't be overlap at the tips of various trees.

Strategies that flow from this are:

  • In general, write content on a very small number of pages, and expand outward as you have content for those pages (by chopping up existing pages as it makes sense and using this content to spur the creation of new pages.
  • Use one style of links/hierarchy (wikish and ciwiki fail at this.) You don't want people to think: Should this be a camel case link? Should this be a regular one word link? Should this be a multiple word link with dash separated words or underscore separated words? One convention to rule them all.
  • Realize that separate hierarchies of content within a single wiki effectively create separate wikis and sites within a single wiki, and that depending on your software, it can be non-intuitive to link between different hierarchies.
  • As a result: use as little hierarchy and structure as possible. hierarchy creates possibilities where things can go wrong and where confusion can happen. At some point you'll probably need infrastructure to help make the navigation among pages more intuitive, but that point is always later than you think it's going to be.
  • Avoid reflexivity. This is probably generalizable to the entire Internet, but in general people aren't very interested in how things work and the way you're thinking about your content organization. They're visiting your wiki to learn something or share some information, not to think through the meta crap with you. Focus on that.
  • Have content on all pages, and have relatively few pages which only serve to point visitors at other pages. Your main index page is probably well suited as a traffic intersection without additional content, but in most cases you probably only need a very small number of these pass through pages. In general, make it so your wikis have content everywhere.

... and other helpful suggestions which I have yet to figure out. Any suggestions from wiki maintainers?

[1]There are a number of very simple and lightweight wiki engines, including some that run in only a few lines of Perl. Once we had the tools to build dynamic websites (CGI, circa 1993/1994), the wiki became a trivial implementation.
[2]The general Principal of building a successful community edited wiki is basically to pay attention to the community in the early stages. Your first few contributors are very important, and contributions have to be invited and nurtured, and communities don't just happen. In the context of wikis, in addition to supporting the first few contributors, the founders also need to construct a substantive seed of content.

fact files

I wrote a while back about wanting to develop a "fact file" or some way of creating a database of notes and clippings that wouldn't (need to be) project specific research, but that I would none the less like the keep track of. Part of the notion was that I felt like I was gathering lots of information and reading lots of stuff, that I didn't really have any good way of retaining this information beyond whatever I could recall based on what I just happen to remember.

I should note that this post is very org-mode focused, and I've not subtitled very much. You've been warned.

Ultimately I developed an org-remember template, and I documented that in the post linked to above.

Since then, however, I've changed things a bit, and I wanted to publish that updated template.

(setq org-remember-templates'(
  ("annotations" ?a
    "* %^{Title} %^g \n  :PROPERTIES:\n  :date: %^t\n  :cite-key: %^{cite-key}\n  :link: %^{link}\n  :END:\n\n %?"
    "~/org/data.org" "Annotations and Notes")
  ("web-clippings" ?w
    "* %^{Title} %^g \n  :PROPERTIES:\n  :date: %^t\n  :link: %^{link}\n  :END:\n\n %x %?"
    "~/org/data.org" "Web Clippings")
  ("fact-file" ?f
    "* %^{Title} %^g \n  :PROPERTIES:\n  :date: %^t\n  :link: %^{link}\n  :END:\n\n %x %?"
    "~/org/data.org" "Fact File")
  ))

What this does, reflects something I noticed in the way I was using the original implementation. I noticed that I was collecting quotes from both a variety of Internet sources and published sources. Not everything had a cite-key (a key that tracks the information in my bibtex database,) and I found that I also wanted to save copies of blog posts and other snippets that I found useful and interesting, but that still didn't seem to qualify as a "fact file entry."

So now there are three templates:

  • First, annotations of published work, all cross referenced against cite-keys in the bibtex database.
  • Second, web clippings, this is where I put blog posts, and other articles which I think will be interesting to revisit and important to archive independently for offline/later reading. Often if I respond to a blogpost on this blog, the chances are that post has made it into this section of the file.
  • Third, miscellaneous facts, these are just quotes, in general. Interesting facts that I pull from wikipedia/wherever, but nothing teleological, particularly. It's good to have a place to collect unstructured information, and I've found the collection of information in this section of the file to be quite useful.

General features:

  • Whatever text I select (and therefore add to the X11 clipboard) is automatically inserted into the remember buffer (with the %? tag)
  • I make copious use of tags and tag compleation which makes it easier to use the "sparse tree by tag" functionality in org-mode to just display heading which are tagged in a certain way.) So that I can see related content easily. Tags include both subject and project-related information for super-cool filtering.
  • All "entires" exist on the second level of the file. I'm often sensative to using too much hierarchy, at the expense of clarity or ease of searching. This seems to be particularly the case in org-mode, given the power of sparse trees for filtering content.

So that's what I'm doing. As always, alternate solutions feedback are more than welcome.

Pragmatic Library Science

Before I got started down my current career path--that would be the information management/work flow/web strategy/technology and cultural analyst path--I worked in a library.

I suppose I should clarify somewhat as the image you have in your mind is almost certainly not accurate, both of what my library was like and of the kind of work I did.

I worked in a research library at the big local (private) university, and I worked not in the part of library where students went to get their books, but in the "overflow area" where the special collections, book preservation unit, and the catalogers all worked. What's more, the unit I worked with had an archival collection of film/media resources from a few documentary film makers/companies, so we didn't really have books either.

Nevertheless it was probably one of the most instructive experiences I've had. There are things about the way Archives work, particularly archives with difficult collections, that no one teaches you in those "how to use the library" and "welcome to library of congress/dewy decimal classification systems" lessons you get in grade school/college. The highlights?

  • Physical and Intellectual Organization While Archives keep track of, and organize all sorts of information about their collections, the organization of this material "on the shelf" doesn't always reflect this.

    Space is a huge issue in archives, and as long as you have a record or "where" things are, there's a lot of incentive to store things in the way that will take up the least amount of space physically. Store photographs, separately from oversized maps, separately from file boxes, separately from video cassettes, separately from CDs (and so forth.)

  • "Series" and intellectual cataloging - This took me a long time to get my head around, but Archivists have a really great way of taking a step back and looking at the largest possible whole, and then creating an ad-hoc organization and categorization of this whole, so as to describe in maximum detail, and make finding particular things easier. Letters from a specific time period. Pictures from another era.

  • An acceptance that perfection can't be had. Perhaps this is a symptom of working with a collection that had only been archived for several years, or working with a collection that had been established with one large gift, rather than as a depository for a working collection. In any case, our goal--it seemed--was to take what we had and make it better: more accessible, more clearly described, easier to process later, rather than to make the whole thing absolutely perfect. It's a good way to think about organizational project.

In fact, a lot of what I did was to take files that the film producers had on their computers and make them useful. I copied disks off of old media, I took copies of files and (in many cases, manually) converted them to use-able file formats, I created index of digital holdings. Stuff like that. No books were harmed or affected in these projects, and yet, I think I was able to make a productive contribution to the project as a whole.

The interesting thing, I think, is that when I'm looking through my own files, and helping other people figure out how to manage all the information--data, really--they have, I find that it all boils down to the same sorts of problems that I worked with in the library: How to balance "work-spaces" with storage spaces. How to separate intellectual and physical organizations. How to create usable catalogs and indices's of a collection. How to lay everything down so that you can, without "hunting around" for a piece of paper lay your hands on everything in your collection in a few moments, and ultimately how to do this without spending very much energy on "upkeep."

Does it make me a dork that I find this all incredibly interesting and exciting?