NoSQL meet SQL: The Schema-With Strike Back

Thursday, April 15, 2010 - Jerry Sievert

SQL is everywhere. Believe it or not there are legacy relational "schema-with" databases filled with data all over the internet. Chances are even your own office has at least one SQL database lurking in a closet somewhere.

So, how do you leverage your existing "schema-with" databases and still be able to use the power of Map/Reduce? Introducing MR SQL: A Map/Reduce Front-End to SQL.

The idea is pretty simple: run a SQL query and pass the results to map and reduce functions. I decided to use JavaScript as it seems to be the language of choice: it's easy to deal with, incredible mutable, and provides for a dead simple object notation (JSON).

I had a couple of thoughts as to how to do this; I could write a language plugin for PostgreSQL adding a JavaScript interpreter courtesy of SpiderMonkey, or write some quick middleware to proxy the SQL to SpiderMonkey. I took the easy way out and with the help of a couple of cocktails quickly had a working prototype banged out in my favorite glue language.

It was fairly easy to marry perl's DBI to JavaScript, allowing for pretty much any SQL database to be accessed and the results dealt with in a uniform manner. Once you have a query and results, the rest is easy: convert the rows into data that can be dealt with in JavaScript (JSON), and run the provided map and reduce functions on the results.

I kept the interface simple; JSON once again proved its utility: input and output. Pass the driver details, a SQL query, a map function, and a reduce function: the results magically get returned. The SQL query can be as complicated as you like, the database is handling the heavy lifting of getting the data out. Once you have the data you can deal with it as needed. Each row gets fed into the map function as a document, the map function emits the results which are then fed into a reduce function, and the final results are spit out as a JSON object.

Let's look at a couple of examples:

SELECT s.symbol,
       s.name,
       d.trade_day,
       d.open,
       d.close
  FROM stock_symbol s
 INNER JOIN stock_day d ON (d.stock_symbol_id = s.stock_symbol_id)
 WHERE d.trade_day BETWEEN '2009/03/01' AND '2009/03/31'
Stock symbol and name, including open and close prices for each trade day in March 2009

function(doc) {
  var date = new Date(doc.trade_day);
  if (date.getDate() % 2) {
    emit(doc);
  }
}
Filter out all even numbered days

The results are pretty straightforward: a mess of JSON.

Adding a reduce function helps out a little bit, collating the data into a nice data structure:

function (rows) {
  var out = { };
    
  for (var i in rows) {
    if (out[rows[i].symbol]) {
      out[rows[i].symbol][rows[i].trade_day] = [ rows[i].open, rows[i].close ];
    } else {
      out[rows[i].symbol] = { };
      out[rows[i].symbol][rows[i].trade_day] = [ rows[i].open, rows[i].close ];
      out[rows[i].symbol].name = rows[i].name;
    }
  }
  return out;
}
A simple collate function

The idea is simple. We live in a data driven world. We get bits and pieces, tids and bits from so many disparate sources that we can never guarantee consistency: SQL databases, XML feeds, JSON, Twitter, we can query almost anything for data. What we have control of are the filters that we use to deal with the stream. It's not about how we can store our data but how we can deal with it, how we transform it.

As usual, downloads of all code and examples are available:
mrsql.tar.gz