Wednesday, November 9, 2011

NoSQL - Where's It Going? Where Should It Go?

The NoSQL movement is either saving web platforms or a major nuisance, depending on what kind of developer you happen to be. Either way, the way we store data is shifting. There are the old stand-by Relational DBs like MySQL, Oracle and SQL Server, and then there's all the crazy new wave stores - BigTable, HBase, Cassandra, SimpleDB, etc, all falling under the general category of "NoSQL."

Each of these has their design features and focus, and have less in common than maybe an umbrella like NoSQL should allow. Then again, by defining them by what they are not, I suppose a ham sandwich could also be eligible for the NoSQL category.

In the simplest cases, all of these solutions allow you to get a basic job done: store rows of similar-ish data in a list. After that, things get crazy.

The first general category they differ by is performance. Some favor write performance. Some favor low-latency consistency. Some favor read performance. Some favor availability. It's unfortunate my choice of data store for my entire app impacts these factors. Really these are all strategies I'll need in varying amounts for different tasks my app performs.

The second general category they differ by is how you read and write data. Each generally has its own new-fangled API for accessing it. Some have libraries that let you pretend you're still using SQL, and tend to throw a lot of errors when you do anything interesting, let alone fancy. For example, Amazon's SimpleDB, while indeed simple, cannot handle relating data between 2 tables (which it calls domains). While it has a SQL-like interface that sits on top of its API, most of the SQL you're used to using will throw an error (like JOIN). What a nice cage to build an app inside of.

This failing of SimpleDB and some of the other NoSQL options seems to be from a bit of confusion about what NoSQL means. Although it generally means it's not a Relational Database, that doesn't mean data never has relationships. It simply means that it has no explicitly defined relationships in the database itself - really what people want is to never think about a Foreign Key again in their lives. Put another way, relationships are business logic that belong in code, not the database. Fetching data and how that's performed is the responsibility of the data provider, not the app.

It seems there needs to be a general standard for NoSQL databases that's defined by what it is rather than what it is not, and here's what I see developers really looking for:

  • Named tables
  • Allows you to submit arbitrary data into tables ("schemaless")
  • Does not enforce data relationships (not a relational DB), but...
  • Allows you to join tables in queries
  • Allows sorting and filtering
  • Scalable
That's what people are really looking for in a NoSQL database. It eliminates the upfront cost of schemas, and eliminates a lot of the performance cost of storing all those rows in a scalable way. The one big burden hanging out there is handling joins - but it's still something that can be accomplished with scalability in mind, and it can be done at the data layer so a master in the data provider can service requests for any data of any shape.

I put scalable last because the truth is that a lot of apps being built with NoSQL solutions are just hopeful. They have no need for something more scalable than a typical MySQL or even SQL Server Express instance can provide. But they do want to be done with schema management, and they want to design their app so it can handle the big time if it gets there.

There are some further features I'd like to see ideally, but don't have to be there to fulfill the basics of what these modern DBs ought to be:
  • Ideally: Allows querying with SQL
  • Ideally: Handles indexing of multiple columns, preferably in response to queries
  • Ideally: Lets you specify the strategy for a specific table's storage ("the engine" by today's terms)
  • Ideally: Handles sharding etc strategies for you so you can store all data in a single named table, even if it's broken down into many smaller tablets under the hood
I don't have a great name for where these solutions all seem to be headed. Schemaless Joinable Tables?