Building a MongoDB Clone in Postgres: Part 2

Friday, June 15, 2012 - Jerry Sievert

In the first installment we dealt with creating collections and deep inspection of the JSON object once it was inserted. In this installment, we will be covering saving the data and building WHERE clauses from MongoDB queries in order to retrieve the data that we've written.

Saving the data to a collection is straightforward. First, we need to inspect the JSON object and look for an _id value. The code is being extremely naive and assuming that if there is an _id this is meant to be an update, otherwise it is meant as an insert. Please note that we are not currently creating an ObjectId, and are using a sequence in its place for the time being:

    CREATE OR REPLACE FUNCTION save(collection varchar, data json) RETURNS
    BOOLEAN AS $$
      var obj = JSON.parse(data);

      var id = obj._id;

      // if there is no id, naively assume an insert
      if (id === undefined) {
        // get the next value from the sequence for the ID
        var seq = plv8.prepare("SELECT nextval('seq_col_" +
            collection + "') AS id");
        var rows = seq.execute([ ]);
      
        id = rows[0].id;
        obj._id = id;

        seq.free();
    
        var insert = plv8.prepare("INSERT INTO col_" + collection +
            "  (col_" + collection + "_id, data) VALUES ($1, $2)",
            [ 'int', 'json']);

        insert.execute([ id, JSON.stringify(obj) ]);
        insert.free();
      } else {
        var update = plv8.prepare("UPDATE col_" + collection +
          " SET data = $1 WHERE col_" + collection + "_id = $2",
         [ 'json', 'int' ]);

        update.execute([ data, id ]);
      }

      return true;
    $$ LANGUAGE plv8 IMMUTABLE STRICT;

At this point, we can build some simple documents for insertion:

    {
      "name": "Jane Doe",
      "address": {
        "street": "123 Fake Street",
        "city": "Portland",
        "state": "OR"
      },
      "age": 33
    }
    
    {
      "name": "Sarah Smith",
      "address": {
        "street": "456 Real Ave",
        "city": "Seattle",
        "state": "WA"
      }
    }
    
    {
      "name": "James Jones",
      "address": {
        "street": "789 Infinity Way",
        "city": "Oakland",
        "state": "CA"
      },
      "age": 23
    }

Let's create a collection and get some data inserted into it:

    work=# SELECT create_collection('data');
     create_collection
    -------------------
     t
    (1 row)
    
    work=# SELECT save('data', '{ our object }');
     save
    ------
     t
    (1 row)

You should be able to see the object by checking the contents of the 'col_data' table.

Now that we have some data, let's query against it. Suppose we want to find everyone that lives in either Oregon or Washington that is older than 30 using a MongoDB style find():

    {
      "$or": [
        {
          "address.state": "OR"
        },
        {
          "address.state": "WA"
        }
      ],
      "age": {
        "$gt": 30
      }
    }

Given that we built some deep packet inspection last time, it is easy to create a query from this that should return Jane Doe:

    SELECT data
      FROM col_data
     WHERE find_in_obj_int(data, 'age') > 30
       AND (
             find_in_obj(data, 'address.state') = 'OR'
           OR
             find_in_obj(data, 'address.state') = 'WA'
           )

I took the approach of writing a recursively-called function to build the WHERE clause. It's pretty long, so instead of including it here, I am pointing to it on GitHub. Once the find() stored procedure has been created, we can query using it. We should see Jane Doe being returned:

    work=# SELECT find('data', '{ "$or": [ { "address.state": "OR" }, { "address.state": "WA" } ], "age": { "$gt": 30 } }');

It works: it's not elegant, but it works. This is a proof of concept, and not nearly as optimal as it could be. That said, I've been asked why I didn't use HSTORE. While you can store nested HSTORE and ARRAY values, it's still not JSON, and not quite as easy to manipulate via PLV8. There would need to be a serialization from HSTORE to JSON any time a request was made to return it in the format that MongoDB would, nor be quite as easy to deal with in Javascript. This is sub-optimal since, after all, we are trying to build a copy of MongoDB on top of Postgres.

The source can be found on GitHub: fork it, play with it, contribute.