Building a MongoDB Clone in Postgres: Part 1

Monday, June 4, 2012 - Jerry Sievert

I had a crazy thought. Don't all good ideas start with that phrase? Well, this one was suitably crazy: why not build my own version of MongoDB right on top of Postgres? It sounds a little far-fetched, but in all honesty it's pretty simple.

The Postgres community hasn't been sitting around twiddling its thumbs while the NoSQL movement has taken off. There has been continued development all throughout the Postgres ecosystem, including a couple of standouts that caught my eye: integrated JSON support and PLV8. PLV8 brings the V8 Javascript engine into Postgres which makes Javascript a first-class language. Having a JSON type makes it a bit easier to deal with JSON (it validates).

To play along at home, you will need a couple of things:

The lowest level of MongoDB is the collection. Collections can be boiled down to a very straightforward table:

    CREATE TABLE some_collection (
      some_collection_id SERIAL NOT NULL PRIMARY KEY,
      data JSON
    );

Stringified JSON is stored in the data column, which essentially converts a Postgres table into a cheap copy of a document store (for now).

The next step is to automate the creation of collections. Collections can be defined and stored in a collections table:

    CREATE TABLE collection (
      collection_id SERIAL NOT NULL PRIMARY KEY,
      name VARCHAR
    );

    -- make sure the name is unique
    CREATE UNIQUE INDEX idx_collection_constraint ON collection (name);

Once the table has been created, it's easy to write a stored procedure to automate the creation of collections. The idea is to insert into the collection table, create the table itself, and setup a sequence to use.

    CREATE OR REPLACE FUNCTION create_collection(collection varchar) RETURNS
    boolean AS $$
      var plan1 = plv8.prepare('INSERT INTO collection (name) VALUES ($1)', [ 'varchar' ]);
      var plan2 = plv8.prepare('CREATE TABLE col_' + collection +
        ' (col_' + collection + '_id INT NOT NULL PRIMARY KEY, data JSON)');
      var plan3 = plv8.prepare('CREATE SEQUENCE seq_col_' + collection);
    
      var ret;
    
      try {
        plv8.subtransaction(function () {
          plan1.execute([ collection ]);
          plan2.execute([ ]);
          plan3.execute([ ]);
      
          ret = true;
        });
      } catch (err) {
        ret = false;
      }
    
      plan1.free();
      plan2.free();
      plan3.free();
    
      return ret;
    $$ LANGUAGE plv8 IMMUTABLE STRICT;

Once the stored procedure is created, creating a collection becomes simple:

    SELECT create_collection('my_collection');

Since we've solved how collections are stored, the next step is to figure out how to do deep packet inspection the MongoDB way. MongoDB uses dot notation which makes it fairly simple to step through the object:

    CREATE OR REPLACE FUNCTION find_in_obj(data json, key varchar) RETURNS
    VARCHAR AS $$
      var obj = JSON.parse(data);
      var parts = key.split('.');
    
      var part = parts.shift();
      while (part && (obj = obj[part]) !== undefined) {
        part = parts.shift();
      }
    
      // this will either be the value, or undefined
      return obj;
    $$ LANGUAGE plv8 STRICT;

This function returns a VARCHAR, which is not necessarily optimal for all comparisons, but for strings a deep inspection WHERE clause is straightforward:

    SELECT data
      FROM col_my_collection
     WHERE find_in_obj(data, 'some.element') = 'something cool'

Along with string comparisons, MongoDB also does numeric comparisons and exposes an exists keyword. These are variations of the find_in_obj() function:

    CREATE OR REPLACE FUNCTION find_in_obj_int(data json, key varchar) RETURNS
    INT AS $$
      var obj = JSON.parse(data);
      var parts = key.split('.');
    
      var part = parts.shift();
      while (part && (obj = obj[part]) !== undefined) {
        part = parts.shift();
      }
    
      return Number(obj);
    $$ LANGUAGE plv8 STRICT;
    
    CREATE OR REPLACE FUNCTION find_in_obj_exists(data json, key varchar) RETURNS
    BOOLEAN AS $$
      var obj = JSON.parse(data);
      var parts = key.split('.');
    
      var part = parts.shift();
      while (part && (obj = obj[part]) !== undefined) {
        part = parts.shift();
      }
    
      return (obj === undefined ? 'f' : 't');
    $$ LANGUAGE plv8 STRICT;

The next installment will be about querying the data. We will implement a basic find() using everything that has been built so far.