Beyond SQL and Database Technology

People have been thinking about databases recently. Even I’ve been thinking about databases, and I’m not particularly prone to thinking about databases. It’s fair given the ongoing drama of the Oracle/Sun, and even mainstream press of the NoSQL Movement. I’d like to take a step back and think a bit more honestly and holistically about the database application, aboth this “NoSQL” phenomena, and about the evolving role of relational database management systems in our technology “ecosystems.”

(Seriously folks this is what I think about for fun in my free time.)

I’ve been milling over the notion that databases, like MySQL and PostgreSQL and Oracle’s RDBM products, are not particularly “Unix-like.” Sure they run on Unix systems, and look and feel like Unix applications, but the niche fulfill--providing quick access to structured data with a specialized query language, doesn’t jive with the Unix philosophies: small specialized tools for precise tasks. “Plain text” as lingua franca of system tools, and so forth.

Databases solve a problem. Indeed they solve a problem in a very functional and workable manner. I don’t want to suggest that the relational database model is somehow broken; however, I would like to suggest that industrial strength database systems are over utilized, and have become the go-to solution for storing and interacting with data of any kind, even in cases where they’re not a good fit for the job at hand.

I’m not the first person to suggest this, not by a long shot. The NoSQL “movement,” addresses this issue from a couple different direction. It’s true that NoSQL refers to a collection of practices and approaches related to providing systems for storing data that goes above and beyond the type and model of a database system. In the end NoSQL is about addressing the scaling problem: what happens when we have so much data that it can’t easily fit in one database system, or in situations where centralized model is untmaintable for any number of reasons. I think NoSQL is also relevant as we think about storing data that doesn’t easily fit into RDBMs’es: I’ve seen a lot of very poorly architected database systems, that suffer from a “square peg in round hole” problem.

Indeed, as we try and put all of our data in these RDBMs systems, particularly data that doesn’t fit very well, these databases loose their ability to scale. The complex logic required to pull more complex data back out of a database and reassemble it for use and analysis is computationally expensive and doesn’t scale particularly well.

But let’s focus for a moment on the scaling question, apart from the data modeling and storage question. The real problem at the core of the scaling question is: we need a way, a thing, that allows multiple systems to access a shared data store in a reliable and consistent manner.

The ongoing work around clustered file systems seems to address this issue from a much different direction, and perhaps a more interesting perspective. Beyond a certain point--and its a fuzzy point--database systems basically become file system replacements. So rather than work on making databases more like file systems, the thought is (I assume) lets make file systems a bit more “database like.” Like I said, I don’t know a lot about the ins-and-outs of clustered file systems, but I think, in addition to worrying and thinking the future of current database systems, we need to also think about the future of these very scalable and clustered manner.

I’m not sure what the next-generation data storage technology really looks like, the NoSQL stuff is a step in the right direction, but I’m not sure if it’s a large enough step in a lot of ways, as its focus is a bit narrow. To be honest, I’m not incredibly familiar with the work that’s going on in the clustered file system space. Nonetheless, I think it’s important to not just think about the future of the relational database platforms as such, but the model and the underlying problems that these kinds of data storage methods address, and to think about other possible ways of addressing the original issues.

file system databases

Joe has remarked that he finds it ironic that--in this blog--I sing the praises of using emacs and storing one’s data in plain text files, largely as part of a crusade against databases. I also am an ardent supporter of his haven project, which is basically a database project.

While I don’t think this is that contradictory, I do understand how one could make that inference, so I think it might be wise to address this issue explicitly. Lets first do a little bit of recapping:

  1. Reasons Why I don’t like databases:
    • Inflexible for many kinds of data, and require users to adapt to structure, rather than the other way around.
    • Databases require too much overhead, both during operation and programming to be totally worthwhile except in some large-scale edge cases.
    • Databases abstract control over data from the owner/user of the data to systems administrators and programmers, rather than leaving data in a form that everyone can access and manage
  2. Reasons why I like text files:
    • Everyone and every machine can read text files. They’re a lingua-franca.
    • We have many highly sophisticated options for editing and munging data in plain text files.
    • Plain text files are infinitely flexible, both in structure, and in the kinds of data they can store.
  3. Caveats
    • There are some kinds of data that are best stored in database systems.
    • Structure in plain text files is dependent upon the self control and education of the users, which may be a risky situation.
  4. Reason why I like Haven:
    • It combines numerous features that I think are really powerful and key to the development of how we use computers: cryptographic security, flexibly structured data; distributed computing/data storage; versioned data stores; collaborative systems; non-hierarchical organization of data; etc.
    • Joe is awesome.
    • It expands and improves on the Project Xanadu idea.

My response to Joe’s question: how does plain text coexist with haven, in your mind.

The answer is pretty simple, really.

At its core, haven isn’t so much a database, as it is a file system. We don’t think “I’ll set up a haven repository/system for this project,” but rather “Hang on, I can put my data for this, into the haven system.” Haven isn’t a bucket that can be designed to hold anything, it’s a total system that’s meant to hold everything.

And it’s just a low level system. Joe’s work on haven is focused on a server application, and an API. Everything else are just applications that use haven. One such application would (inevitably) be a FUSE-driver which would expose a Haven system as a file system. So your objects in a haven database would be, basically plain text files.

Which kind of rocks.

Now Haven is just a concept right now, but, in general, FUSE is one of those technologies with amazing possibilities because we have so many amazing tools and mature technologies for manipulating data in file systems. FUSE abstracts the mechanics of file systems, and makes it easy to “think about” data in terms of files, even if it doesn’t make a lot of sense to store said data in files. That’s really, quite cool, and powerful for the rest of us.

I’ve seen fuse drivers for Wikipedia, a nonhierarchial file system, http (ie. the web), blogger, and structured data like RSS and other xml, all of which are really cool. I’m not sure if any or all of these systems are done, and I’m not sure that any of these creative uses for FUSE are ready for prime time, but I think it’s a step in the right direction, generally.