Today I needed a query that could take some nested JSON data, a column ID, and then create a single row for each of these distinct column ID’s to show some summary numbers of different nested JSON values.
I’ve horribly explained that, and so you can imagine just how much fun I had writing the query. SQL is always harder if you can’t quite articulate in a sentence or two exactly what it is that you’re trying to display.
Anyway, words fail me, so hopefully a code based example will clear it up much more definitively.
Setting Up Our Demo Environment
I’m going to use Docker to quickly spin up a throwaway Postgres DB. For this I will use docker-compose
:
version: '3.9'
services:
db:
image: postgres:15.2
ports:
- "5495:5432"
environment:
POSTGRES_DB: app_db
POSTGRES_USER: app_user
POSTGRES_PASSWORD: my_top_secret_password
volumes:
- "./init.sql:/docker-entrypoint-initdb.d/init.sql"
Code language: JavaScript (javascript)
These contents go into a docker-compose.yaml
file.
This docker compose file defines a single service named “db”. The service runs a Postgres database version 15.2 using the official Postgres Docker image.
The service exposes port 5495 on the host machine (your local machine), which is mapped to the default Postgres port 5432 in the container. This allows you to access the database from your PC.
The environment variables should hopefully be self explanatory, but they are defined by the official postgres
docker image, and I am using the explicit config above to override them to my own values.
The volumes
section maps the local file “./init.sql
” to the container directory “/docker-entrypoint-initdb.d/init.sql
“. This allows you to run any SQL scripts in that file when the container is started up, which is very useful for initialising the database schema and loading our demo data.
Here’s the corresponding init.sql
file:
create table some_form_data (
id serial primary key,
category_id varchar(6) not null,
submission_response jsonb
);
insert into some_form_data (category_id, submission_response)
values
('CAT001', '{"age": 35, "has_children": true}'),
('CAT002', '{"age": 42, "has_children": false}'),
('CAT001', '{"age": 25, "has_children": true}'),
('CAT003', '{"age": 30, "has_children": true}'),
('CAT001', '{"age": 50, "has_children": false}'),
('CAT002', '{"age": 28, "has_children": true}'),
('CAT003', '{"age": 21, "has_children": false}'),
('CAT002', '{"age": 39, "has_children": true}'),
('CAT003', '{"age": 29, "has_children": true}'),
('CAT001', '{"age": 47, "has_children": true}');
Code language: JavaScript (javascript)
You can then spin this up with docker-compose up
.
I’m going to use JetBrains DataGrip to connect, and I documented how to do this over on this post. So if you’re unsure, check that out.
Counting and Grouping JSON Data in PostgreSQL Using FILTER
OK, so we have our demo database up and running.
Imagine, if you will, that the some_form_data
contains form submission data, where the category ID is a known quantity, but the submission_response
is built up using a wizard, and so the resulting data is stored off as JSON for … well, simplicity I guess. I don’t know. I inherited the database, I didn’t create it.
What I needed to do was get a report where for each CATEGORY_ID
, I wanted something like this:
Category ID | over_30_with_children | under_30_without_children |
CAT001 | 2 | 0 |
CAT002 | 1 | 1 |
CAT003 | 0 | 1 |
And it took a lot of frustration on my part to figure out how to do this.
I was trying joins and CTEs, and all kinds of stuff.
But after taking a pause and reading the manual a bit further, I stumbled across this bit:
Now, not only did that look like exactly what I was trying to achieve, it massively simplified my “work in progress” / horrible query.
From there I got the query down to this:
select
category_id,
count(*) filter (where
submission_response->>'has_children' = 'true'
and
submission_response->>'age' >= '30'
) as over_30_with_children,
count(*) filter (where
submission_response->>'has_children' = 'false'
and
submission_response->>'age' < '30'
) as under_30_without_children
from some_form_data
group by category_id;
Code language: SQL (Structured Query Language) (sql)
That actually looks messier than it was, because I’ve formatted it above so it doesn’t have a horizontal scroll.
For me it looked like:
select
category_id,
count(*) filter (where submission_response->>'has_children' = 'true' and submission_response->>'age' >= '30') as over_30_with_children,
count(*) filter (where submission_response->>'has_children' = 'false' and submission_response->>'age' < '30') as under_30_without_children
from some_form_data
group by category_id;
Code language: SQL (Structured Query Language) (sql)
Which, considering what I had, is a massive improvement.
The FILTER
clause is used in conjunction with an aggregate function like COUNT
, SUM
or AVG
to selectively include or exclude certain rows from the resulting aggregate value.
The basic syntax of a FILTER
clause looks like this:
<code>COUNT(*) FILTER (WHERE some_condition)
</code>
Code language: SQL (Structured Query Language) (sql)
COUNT(*)
is the aggregate function being used.
some_condition
would be something that resolves to a boolean value, which we can use to determine whether a given row should be included in the count.
Ultimately from the query above I got this:
It might not seem like much, but that’s the thing with SQL. When you finally get the outcome you want, the query often doesn’t look like that much at all. But all the hard work goes into refining the query to get exactly what you want. That’s the hard part. Or at least, it is for me.