Postgres JSONb meets MongoDB…

Jim Blackhurst
13 min readJun 18, 2022

--

The author’s garden

Preface

One of the reasons I hear given by developers for not wanting to entertain the idea that MongoDB is a viable choice for their operational database is that PostgreSQL’s JSON support is so good that there is no reason to look elsewhere. I’ve often wondered about this and decided that in order to really build any kind of opinion, I need to roll up my sleeves and get involved.

In this sequence of posts, I’m going to look at the specific JSON based capabilities of Postgres (which I’m not yet that familiar with) and compare them to MongoDB’s JSON capabilities (which I should know really well, seeing as I work there). I’m hoping to learn a lot about Postgres and how it can be used, ultimately forming my own opinion on the question, Is MongoDB needed anymore now Postgres supports JSON?

What I will not be doing is looking at Postgres as an RDBMS. Postgres is a fine choice for those who really want to model data using patterns laid down in the 1970s. If you only ever want to model your relationships through foreign keys, and don’t mind remapping all your results into JSON anyway through some brittle ORM, then be my guest, Postgres is undoubtedly the best RDBMS for you.

However, like many modern developers, if you recognize that building applications and services efficiently requires the use of a storage layer that can store data in the way we want to consume it, the question of MongoDB vs Postgres’ JSONb support is very prescient.

Spending a lot of time researching these posts, and reading through pages of questions and comments about JSONb, I kept seeing one question come up, time and time again on Postgres message boards and forums “Why shouldn’t I just store all my data in a single JSONb column?”. If you are familiar with MongoDB, this question will raise an eyebrow.

Let us begin.

Part One. Basic Queries.

As I write this, it’s a warm sunny day in England, and I’m sitting at a table in my small suburban garden. We’ll start by building some really quick examples of using JSON data in PG JSONb and then compare the same thing in MongoDB.

For Postgres:

You’d expect the same thing to be easier in MongoDB, and of course, being a native JSON document (object) store, it is.

You don’t need any create table definitions (a schema), and the id field in the Postgres definition is replaced by an auto-generated objectID called _id in MongoDB.

Let’s pull both records back to check they look as intended.

Postgres SELECT
MongoDB find()

Finding a document by an attribute is also fairly easy. For MongoDB it’s just:

In Postgres, it’s still pretty basic stuff, although because SQL wasn’t built with JSON in mind, we have to introduce one of several new operators for working with JSON data. (Side note, JSON has recently been added to the SQL:2016 standard as a model and then updated for SQL:2019 as a datatype, as discussed here, although I don’t think SQL:2019 actually exists yet)

In this case, we’re going to use the containment operator @> which asks if the phrase is contained within the JSONb field. This operator is interesting because it seems to only exist in Postgres, so any SQL that you create that uses it can’t be run against MySQL for example. This isn’t a big deal in the context of this blog, but I find it strange that the SQL standard defines how JSON is stored, but each DB has its own way of querying it. Blogs more knowledgeable than this one also advise the use of a GIN index here, as things get pretty slow, pretty quickly.

Right, we’re done with flowers, lets's look at some trees. My garden is pretty small, but I managed to get quite a few in, I like the structure and height they bring. I can describe the trees easily using JSON and rather than duplicating the information for each tree, I can describe a type of tree and then embed an array of instances of that tree around my garden. For example, I have three apple trees (although only one is actually a proper big tree, the other two I planted this year against the fence and they are tiny) I have a JSON object which describes what I need to know about apple trees, how big they get, the color of their flowers (blossom) and the fruit they bear.

Embedded in this object is an array of sub-objects that describe the location of each instance of this tree in my garden, how healthy it is, and how big it is. This is what I’m aiming for from my API

This type of data modeling is still denormalized, but fans of relational modeling will argue that I could just drop the instances into a separate table and join them back together at query time.

Yes, I could, but why? JOINS are expensive, and ORMs are brittle and need maintenance. I’d prefer to get the data back to my API in the state that my app will consume it, with the minimum amount of work done by the database. (although I’ll admit that while tying your API to the schema of your DB is never really a good idea, it works for this example. )

So this is what we’ll load into the databases (we’ll drop the previous data first):

Inserting trees into MongoDB

It’s less easy with Postgres to insert multiple JSONb objects, and StackOverflow thinks* that performance is a problem if you are doing more than a small number of inserts. It’s considered that the best way to do this is to create a temporary table, load the raw text of the JSON into there and then parse it into the desired table/state as a second step. In our case, we can do this directly because we’ve only got six records. (*To be fair, the SO concern may be more about bulk JSON parsing into non-JSON structures)

Let’s do some really simple queries.

  1. How many trees do I have?
  2. Which types of trees have white flowers?
  3. Which trees have sweet fruit
  4. Which is the tallest tree in my garden?
  5. How many trees do I have more than 2m tall?

How many trees do I have?

This could have been a really easy query if I wanted to know how many types of trees I have, that’s just the number of rows. However, I want to know how many trees I have in total, so I need to count all the elements of the instance arrays.

Actually. it is a very simple operation in MongoDB, as it was designed from the ground up for just this kind of work.

db.plants.aggregate([
{$unwind: {path: '$instance'}},
{$count: 'TreeCount'}
])

We first unwind the array, which opens elements into their own documents, and then count all the results.

[ { TreeCount: 9 } ]

For Postgres the process is different…

SELECT SUM(jsonb_array_length(plant -> 'instance')) FROM plants;

Or, it was suggested to me that this would work too…

SELECT COUNT(plant) FROM plants AS s, jsonb_array_elements(s.plant  ->'instance') AS t;

It’s a bit more difficult to explain what’s going on here if you don’t know the Postgres JSONb operators, but the key here is the -> operator which is used to extract a JSON object.

Both options give the correct result:

sum 
-----
9
(1 row)

Which types of trees have white flowers?

This is another easy one for both platforms, for Postgres we use the @> containment operator which looks to match the JSON on the right-hand side of the operator within the selection on the left.

SELECT plant['name'] FROM plants WHERE plant @> '{"flowerColor":"White"}';plant     
--------------
"Apple Tree"
"Rowan"
"Olive"
(3 rows)

MongoDB’s find()operator is used in this query.

db.plants.find({'flowerColor': 'White'},{_id:0,name:1})[ 
{ name: 'Apple Tree' },
{ name: 'Rowan' },
{ name: 'Olive' }
]

In the MongoDB query above, I’ve also included a projection argument {_id:0,name:1} that suppresses the _id field from being returned in the result set, and the inclusion of name:1 implies that I’m going to pass in a list of fields that I do want to be returned, in this case only name. It’s entirely optional, omit it to return the full documents.

Which trees have sweet fruit?

This is an interesting problem because some of the trees in my garden don’t bear fruit at all, so the JSON objects don’t contain a field for fruit. This isn’t a problem for MongoDB, the query couldn’t be easier…

db.plants.find({'fruit.taste': 'sweet'},{_id:0,name:1})[ 
{ name: 'Apple Tree' },
{ name: 'Olive' }
]

Here, we are doing a simple find, and MongoDB does the rest, it doesn’t worry that some documents don’t have a fruit object in them, and it’s easy to work with nested objects regardless of how deeply nested they are.

However, it is a different story with Postgres. I’ll admit to not being a Postgres super-star but my SQL isn’t too rusty. I should be able to do this…

First of all, I tried something similar to MongoDB’s query, which I thought would work because it seems logical…

SELECT * FROM plants WHERE plant @> '{"fruit.taste": "sweet"}';

This should work, it’s being asked to return all the rows that contain {"fruit.taste": "sweet"} and interestingly, it doesn’t error out but just returns (0 rows)

Ahh, perhaps Postgres doesn’t understand dot notation in the context of JSONb nesting, fair enough.

SELECT * FROM plants WHERE plant @> '{"taste": "sweet"}';

{"taste":"sweet"} must be contained in the JSONb structure somewhere right? nope, (0 rows)

Ok so maybe it’s not recursive and it needs to be pointed to where in the sub-object the attribute is found, but we can’t because some rows don’t have fruit sub-objects at all!

I then found a blog post by a well-known Postgres support team, which presented various close solutions. One option they presented was converting the JSONb into text and then back into JSON, it seemed like a really long-winded option…

select id, t.* from plants, jsonb_to_recordset((plant->>'fruit'::text)::jsonb) as t(color text,taste varChar(100), avSizeCM text) where taste like 'sweet';

It still didn’t work for me. The other options they presented in the blog post were really complex, and by their own acknowledgement, didn’t have great performance and used generated columns and expression indexes and other things I don’t fully understand.

Finally, after asking around in r/postgreSQL I found my way to a popular discord channel with many Postgres enthusiasts. I explained the problem and to their credit, they took the challenge like a red rag to a bull. Sometime later, a solution was posted!

with fruit_cte as
(
select plant->>'name' as name,
plant->'fruit' as fruit
from plants
)
select name,fruit->>'taste' as taste
from fruit_cte
where fruit->>'taste' = 'sweet';
name | taste
------------+-------
Apple Tree | sweet
Olive | sweet
(2 rows)

(OK, olives are not really sweet, I know, but they aren’t sour either and I’m making this dataset up so don’t pick me up on my fruit-based definitions!)

At last, we got there! It took me two days of experimentation, trying different things, and banging my head against a brick wall before I gave up. Even the aforementioned well-respected, professional support team couldn’t find this solution, it came down to some incredibly talented community-orientated people who love solving hard challenges to come up with something that could be done in MongoDB without thinking.

Which is the tallest tree in my garden?

The tallest tree in my garden is not the tree with the highest maxHeightCM, it’s the tree with the highest instance.currentHightCM.

One way to do this in MongoDB is with the Aggregation Pipeline, and queries like this show off how incredibly easy it is to use. (If you want to see just how amazing it is, check out the definitive text by @thedonester)

db.plants.aggregate([
{$unwind: {path: "$instance"}},
{$sort: {"instance.currentHeightCM" :-1}},
{$limit: 1},
{$project: {
_id:0,
name:1,
"location":"$instance.location",
"height":"$instance.currentHeightCM"
}}
])

The Aggregation is built up out of stages, the output of the first stage is piped into the input of the second stage

The first stage is unwind which is a very handy tool for working with arrays, as it allows you to break them apart and create an individual doc for each element.

next we sort the array, before the limit stage which just returns just one document.

Finally, I’m running the result set through a project stage to reshape the data. It’s optional, but you probably don’t want the raw document returned to your application if you aren’t going to be using all of it.

You’ll also note that I’ve used the projection stage to rename a couple of the attributes, which is just for readability.

It produces the correct result:

[{ name: 'Wild Cherry', location: 'next to decking', height: 600 }]

There are other ways to do this too, some even more simple, which don’t use the aggregation framework, but this example is a good introduction to the power of using a pipeline for processing.

How do we do this in Postgres? The first job is to list all the trees.

SELECT plant->>'name' AS name, plant->'instance' AS instance FROM plants

The output looks like this:

For the next stage, I’ve borrowed heavily from the example above on sweet fruit. I couldn’t have done this without the support of the helpful Postgres community

We now need to run a SELECT on that result set, but using a new (to us) operator called jsonb_to_recordset() which then breaks apart the array and creates a row for every array element. It’s Postgres’ equivalent of MongoDB’s $unwind however, with Postgres, you have to explicitly cast the output of jsonb_to_recordset() using an AS condition.

with instance_cte as
(
SELECT plant->>'name' AS name,
plant->'instance' AS instance
FROM plants
)
SELECT * FROM instance_cte, jsonb_to_recordset(instance_cte.instance) AS items(health jsonb, location jsonb, currentHeightCM jsonb);

We’re not there yet. The result set looks like this:

It’s returning a whole heap more information than we actually need, so we’re going to work on the projection in the next step, but at least we’ve got a row per instance, rather than a row per tree.

But something fishy is going on, if you look at the far right column, for currentHeightCM it’s empty, which is a shame as it’s the one we’re going to need. We’ll have to fix that too.

Let's project out all the stuff we don’t want.

with instance_cte as
(
SELECT plant->>'name' AS name,
plant->'instance' AS instance
FROM plants
)
SELECT instance_cte.name, items.location, items.currentheightcm FROM instance_cte, jsonb_to_recordset(instance_cte.instance) AS items(location jsonb, currentHeightCM jsonb);

now we are being more explicit about what we want to return, we can see the problem clearly.

name              |      location      | currentheightcm 
------------------+--------------------+-----------------
Apple Tree | "right fence" |
Apple Tree | "right fence" |
Apple Tree | "back left corner" |
Wild Cherry | "next to decking" |
Silver Birch | "next to decking" |
Rowan | "back door" |
Olive | "back door" |
Corkscrew Willow | "left fence" |
Corkscrew Willow | "left fence" |
(9 rows)

What is wrong with currentHeightCM? It’s the only field in the sub-object that is an int not text, but that can’t be the problem (I tried casting it as an int and it still didn’t work).

Hours pass.

Why has it converted the column name in the result set into lower caps? NO, WAIT, it’s the only attribute name that I’ve written as Lower Camel Case. That can’t be it. That would be crazy. Why would the Postgres have to have attribute names as all lowercase? I’d better test it just to be sure.

I created a new table called plants3 and inserted the same data as before, but changed all references to currentHeightCM to currentheightcm then re-ran the exact same query as above.

       name       |      location      | currentheightcm 
------------------+--------------------+-----------------
Apple Tree | "right fence" | 83
Apple Tree | "right fence" | 65
Apple Tree | "back left corner" | 450
Wild Cherry | "next to decking" | 600
Silver Birch | "next to decking" | 140
Rowan | "back door" | 400
Olive | "back door" | 450
Corkscrew Willow | "left fence" | 400
Corkscrew Willow | "left fence" | 420
(9 rows)

This is exactly the problem with approaching this kind of workload using a great idea (JSONb) bolted onto an RDBMS. No matter how much you think the flexibility to model your data is there, there will always be nasty gotcha’s lurking there as hangovers from the rigidness of relational schema.

I later discovered a bug report on this subject where the reporter suggests they can’t use camelCase with jsonb_to_recordSet(), however, the respondent suggests this is intended behaviour as….

Unquoted identifiers are lowercased in postgres

Let's just get the rest of the query done before we die of frustration.

I need to ORDER BY and then LIMIT to match the MongoDB output.

with instance_cte as
(
SELECT plant->>'name' AS name,
plant->'instance' AS instance
FROM plants3
)
SELECT instance_cte.name, items.location, items.currentheightcm FROM instance_cte, jsonb_to_recordset(instance_cte.instance) AS items(location jsonb, currentheightcm jsonb) ORDER BY items.currentheightcm DESC LIMIT 1;

and at last…

name         |     location      | currentheightcm 
-------------+-------------------+-----------------
Wild Cherry | "next to decking" | 600
(1 row)

I should have cast the height as an int but I’m pleased to see that Postgres can still sort it even though it’s cast as JSONb. I’ll definitely have to do that for the next query. Also, I’m not sure why location is the only field that is in double-quotes.

How many trees do I have more than 2m tall?

The plan here is to build on the previous query and add in a range match. I suspect this will be easy for both platforms.

For MongoDB, the query is more simple than before:

db.plants.aggregate([
{$unwind: {path: "$instance",}},
{$match: {"instance.currentHeightCM":{$gt: 200}}},
{$count: 'Trees_GT_2m'}
])

and the output is:

[{ Trees_GT_2m: 6 }]

For Postgres, I’m going to start with my previous query, this time casting the height into an int and drop a WHERE clause and COUNTin there.

with instance_cte as
(
SELECT plant->>'name' AS name,
plant->'instance' AS instance
FROM plants3
)
SELECT COUNT(*) FROM instance_cte, jsonb_to_recordset(instance_cte.instance) AS items( location jsonb, currentheightcm int) WHERE items.currentheightcm > 200;

And the result…

count 
-------
6
(1 row)

Nothing to add with this one, now we’ve done all the heavy lifting with the Postgres query, it’s plain sailing. I would still suggest that if you compare the two queries side-by-side, the MongoDB query looks cleaner, intuitive, and more human-readable.

Conclusion

In this first part, we’ve looked at querying JSONb in Postgres, compared to MongoDB. The support for JSONb (and JSON) operators in Postgres looks great, but using documentation alone to inform your ideas on usability and practicality is dangerous. The actual reality is that once you step outside a fairly narrow band of easy queries, building anything remotely complex (such as grouping or aggregations for real-time analytics which Postgress should be able to do) becomes non-intuitive and time-consuming. I easily banged up against problems that were only solved by some of the very brightest and puzzle-hungry Postgres power users (with time on their hands). If I was building, running, and maintaining production services, I’d be very worried about this.

I think it’s undeniable that if you are part of the modern development community, which needs to work with JSON formatted data as your principal source for your application and services, you are going to have a much more productive and efficient time using MongoDB, rather than Postgres.

In the next part, we’ll ramp up the data volumes and re-run some of the queries to see how MongoDB Postgres JSONb performs at a larger scale.

--

--

Jim Blackhurst

Obsessed by managing data at planetary scale. (and Music, multicopters, 3D printing and geeky stuff). Opinions are my own, not my Employer's