JSON v JSONB: A Dive into PostgreSQL
Friday, March 22, 2019JSON, or JavaScript Object Notation, is a lightweight data interchange format that is easy for both computers and humans to deal with. The rules are pretty simple:
- a JSON document is an
object
- an
object
consists of akey
and avalue
- a
key
is astring
- a
value
can be astring
,number
,boolean
,array
,object
ornull
So, when dealing with data, JSON has become fairly popular: it's the default document storage type for databases like CouchDB and MongoDB (though MongoDB technically uses a JSON-like format), and available in PostgreSQL as the JSON and JSONB data types.
So, why two data types? And which should you use? The first answer is pretty straightforward, and dives a little bit into the history of PostgreSQL.
The JSON data type was first introduced in version 9.2, released in September of 2012, and is a thin veneer around the TEXT type: when inserting or updating a JSON typed field, validation of the document occurs to make sure that it is valid. Otherwise, the document itself is not very complicated. Early JSON operations were limited to converting to and from arrays, but were expanded fairly quickly to include operators to retrieve elements of a document by key as TEXT.
The JSONB data type was introduced later, in version 9.4, released in December
of 2014. It is based on the HSTORE data type, which is a single-dimensional
key/value type. JSONB expanded on HSTORE by adding another dimension via the
addition of arrays and objects. Where HSTORE was only able to be flat, JSONB
is able to be as deep as you need, embedding objects in arrays inside of objects
inside of array, as deep and complex as you need the document to be. In addition,
JSONB adds operators that allow for manipulation of JSONB data, such as ||
for
concatenation, -
and #-
for deletion. Together, these make JSONB into a much
more powerful data type than JSON.
So, why use JSON when JSONB exists? Isn't it better? Well, the answer to that can be complicated, based on your use cases.
Let's first take a look at storage. Let's say we have a 30k JSON document (28168 bytes to be exact): storing this as TEXT is pretty straightforward, but what happens when we store it as JSON? Storing 1,000 copies ends up being 5,660,672 bytes. Let's compare that to JSONB: 6,340,608 bytes. That's just over 1.12 times the amount of storage to gain the ability to do simple direct manipulation and searches using operators.
What other tradeoffs are we making? Let's take a look at query times. It's easy to execute queries by creating some simple benchmarks.
For the record, and so you can play along at home, you can find a GitHub repository with all of the benchmark code, tables, and stored procedures for setting everything up on GitHub. Note that all tests were run on a Macbook Pro running MacOS, against PostgreSQL 10.5.
Let's go ahead and put a single record into a storage table:
=# INSERT INTO source_document (doc) VALUES (json_obj()::TEXT);
And do some benchmarking for insertions:
=# SELECT test_json_insert(10000, doc::JSON) FROM source_document;
On my laptop, that gives me a solid 5787.718 ms (or just over 5 seconds). That's not too bad, given what we're doing: casting a TEXT value as JSON, which is validating the TEXT as a full JSON document, 10,000 times.
Let's compare that to the same thing in JSONB:
=# SELECT test_jsonb_insert(100000, doc::JSONB) FROM source_document;
This runs in 9725.226 ms (or just under 10 seconds). This isn't quite as good, at almost twice the amount of time to insert, but whether or not it matters comes down to how intend to use our data.
Speaking of using our data, let's take a look at how we would be able to retrieve our data. First, we'll create a simple test query in PL/PGSQL to help normalize any query times:
CREATE OR REPLACE FUNCTION plbench(query TEXT, n INT) RETURNS VOID AS $$
BEGIN
FOR i IN 1 .. n LOOP
EXECUTE query;
END LOOP;
END;
$$ LANGUAGE plpgsql;
This allows us to execute a query some large number of times and get a fairly accurate representation of the average amount of time to execute a single query, and the aggregate of all of the queries is quite helpful.
Since we'll be running and discarding the results, but want to make sure that
we're doing a realistic test of the amount of time a real query would take to
output to some client (either psql
or the language driver of your choice),
we're going to CAST
our results to TEXT. Let's again start with JSON, remembering
that we should still have 10,000 rows waiting for us:
=# SELECT plbench('SELECT doc::TEXT FROM test_json', 20);
A nice round 20 (remember, this comes out to 200,000 rows) gives me 11728.708 ms (or just under 12 seconds). Let's go ahead and compare that to JSONB:
=# SELECT plbench('SELECT doc::TEXT FROM test_jsonb', 20);
Another 20 (or 200,000 rows), but not nearly as fast at 43733.030 ms (or just under 44 seconds). This is nearly one quarter the speed as using straight JSON.
So far, there has yet to be a compelling reason to use JSONB over JSON, but we've only just started looking at the possibilities. We've inserted, we've done document level querying, but what about direct querying of the data itself?
Let's start out with JSON, again sticking with 20 iterations, since we're dealing with 10,000 documents:
=# SELECT plbench('SELECT doc -> ''_id'' FROM test_json', 20);
This returns in 38927.170 ms (just under 39 seconds). Not great, but how will JSONB fare?
=# SELECT plbench('SELECT doc -> ''_id'' FROM test_jsonb', 20);
7504.617 ms (7.5 seconds)! Huzzah! Now we're starting to see where JSONB can shine. Because the JSON object has been converted into a data structure that is easy to iterate through, finding or modifying elements of the object should be much faster, as we are seeing.
There are a ton of other operations that JSONB gives us, that are not available to JSON objects without either casting, or using a stored procedure in a language that understands JSON natively. For these tests, I've chosen PLV8, specifically version 3.0alpha.
First let's look at concatenation. JSONB has an operator available ||
. This
time, instead of casting to TEXT, we're just going to be benchmarking the operation
directly, as it can be used for UPDATE or DELETE, as well as SELECT. Let's
start with JSONB:
=# SELECT plbench('SELECT doc || ''{ "foo": "bar" }''::JSONB FROM test_jsonb', 20);
Again, we're sticking to 20 iterations, since we're dealing with 10,000 documents, which finishes in 41390.995 ms (just over 41 seconds). We can do the same thing with JSON by casting:
=# SELECT plbench('SELECT doc::JSONB || ''{ "foo": "bar" }''::JSONB FROM test_json', 20);
This finishes in 96797.946 ms (just under 97 seconds), taking nearly 2.5 times longer. This is pretty significant.
=# SELECT plbench('SELECT plv8_concat_jsonb(doc, ''{ "foo": "bar" }''::JSONB) FROM test_jsonb', 20);
This is not fast in comparison. Completing the 20 iterations takes 348327.778 ms (over 5 minutes).
Just to be complete, let's do the same test against JSON using PLV8:
=# SELECT plbench('SELECT plv8_concat_json(doc, ''{ "foo": "bar" }''::JSON) FROM test_json', 20);
Again, not fast, clocking in at 131241.480 ms (2 minutes and 11 seconds), but much better than the same functionality in PLV8 using JSONB, but still slower than casting JSON to JSONB, or just using JSONB to begin with. But, remember that JSONB operators are useful for the most simplistic cases, so you may need to resort to stored procedures to get the most out of both JSON and JSONB.
Now that you know a little more about JSON and JSONB and how they both perform in PostgreSQL, you can make some data backed decisions on which is more appropriate for your day to day usage. Of course no article like this would be complete without a picture of performance, so here goes. Hopefully it was worth reading until the end to see it.