Building a MongoDB Clone in Postgres: Part 1
Monday, June 4, 2012I had a crazy thought. Don't all good ideas start with that phrase? Well, this one was suitably crazy: why not build my own version of MongoDB right on top of Postgres? It sounds a little far-fetched, but in all honesty it's pretty simple.
The Postgres community hasn't been sitting around twiddling its thumbs while the NoSQL movement has taken off. There has been continued development all throughout the Postgres ecosystem, including a couple of standouts that caught my eye: integrated JSON support and PLV8. PLV8 brings the V8 Javascript engine into Postgres which makes Javascript a first-class language. Having a JSON type makes it a bit easier to deal with JSON (it validates).
To play along at home, you will need a couple of things:
- Postgres 9.2+ (as of this blog entry, 9.2 is in beta) - http://www.postgresql.org/ftp/source/
- V8 - https://github.com/v8/v8
- PLV8 - http://code.google.com/p/plv8js/wiki/PLV8
The lowest level of MongoDB is the collection. Collections can be boiled down to a very straightforward table:
CREATE TABLE some_collection ( some_collection_id SERIAL NOT NULL PRIMARY KEY, data JSON );
Stringified JSON is stored in the data column, which essentially converts a Postgres table into a cheap copy of a document store (for now).
The next step is to automate the creation of collections. Collections can be defined and stored in a collections table:
CREATE TABLE collection ( collection_id SERIAL NOT NULL PRIMARY KEY, name VARCHAR ); -- make sure the name is unique CREATE UNIQUE INDEX idx_collection_constraint ON collection (name);
Once the table has been created, it's easy to write a stored procedure to automate the creation of collections. The idea is to insert into the collection table, create the table itself, and setup a sequence to use.
CREATE OR REPLACE FUNCTION create_collection(collection varchar) RETURNS boolean AS $$ var plan1 = plv8.prepare('INSERT INTO collection (name) VALUES ($1)', [ 'varchar' ]); var plan2 = plv8.prepare('CREATE TABLE col_' + collection + ' (col_' + collection + '_id INT NOT NULL PRIMARY KEY, data JSON)'); var plan3 = plv8.prepare('CREATE SEQUENCE seq_col_' + collection); var ret; try { plv8.subtransaction(function () { plan1.execute([ collection ]); plan2.execute([ ]); plan3.execute([ ]); ret = true; }); } catch (err) { ret = false; } plan1.free(); plan2.free(); plan3.free(); return ret; $$ LANGUAGE plv8 IMMUTABLE STRICT;
Once the stored procedure is created, creating a collection becomes simple:
SELECT create_collection('my_collection');
Since we've solved how collections are stored, the next step is to figure out how to do deep packet inspection the MongoDB way. MongoDB uses dot notation which makes it fairly simple to step through the object:
CREATE OR REPLACE FUNCTION find_in_obj(data json, key varchar) RETURNS VARCHAR AS $$ var obj = JSON.parse(data); var parts = key.split('.'); var part = parts.shift(); while (part && (obj = obj[part]) !== undefined) { part = parts.shift(); } // this will either be the value, or undefined return obj; $$ LANGUAGE plv8 STRICT;
This function returns a VARCHAR, which is not necessarily optimal for all comparisons, but for strings a deep inspection WHERE clause is straightforward:
SELECT data FROM col_my_collection WHERE find_in_obj(data, 'some.element') = 'something cool'
Along with string comparisons, MongoDB also does numeric comparisons and exposes an exists keyword. These are variations of the find_in_obj() function:
CREATE OR REPLACE FUNCTION find_in_obj_int(data json, key varchar) RETURNS INT AS $$ var obj = JSON.parse(data); var parts = key.split('.'); var part = parts.shift(); while (part && (obj = obj[part]) !== undefined) { part = parts.shift(); } return Number(obj); $$ LANGUAGE plv8 STRICT; CREATE OR REPLACE FUNCTION find_in_obj_exists(data json, key varchar) RETURNS BOOLEAN AS $$ var obj = JSON.parse(data); var parts = key.split('.'); var part = parts.shift(); while (part && (obj = obj[part]) !== undefined) { part = parts.shift(); } return (obj === undefined ? 'f' : 't'); $$ LANGUAGE plv8 STRICT;
The next installment will be about querying the data. We will implement a basic find() using everything that has been built so far.