Here’s the problem we are addressing today: you have a set of numerical data returned by your Postgres query. Your numbers look alright, but some very high (or very low) values seem to be throwing things out of whack. What you’d like to do is eliminate the outlying values from your result set, and keep the ones nearer the “normal” values. What you need (probably) is Postgres’s standard deviation function.
And fortunately for you (and me), it’s much easier to implement than you might be thinking.
OK, so cards on the table time, I am no maths whizz. Not even close. My knowledge tops out around GCSE-tier and I’m very grateful that computers are largely clever enough to do all the stuff I need with a little bit of Googling to help me out.
But nevertheless, this one stumped me for a while.
Here’s the problem, as I faced it (on the front end, at least):
This data comes from my other site – one I often bang on about on here as a basis for my examples (and frustrations).
The issue here is that I’m scraping data from numerous UK job sites, and some absolute chimps decide to enter their pay information in pennies rather than pounds.
So rather than the pay being listed as £20,000 to £25,000 (or whatever), they put in £200000-250000 or whatever, and it really skews everything big time.
I’ve thought about manually going through and sanitising the data, but that’s like… manual labour, or something. And I’m against that. Better to just rule the values out when doing the queries, I feel.
Introducing Postgres Standard Deviation Function
In order to solve the problem above, I went off reading the Postgres docs and, by sheer chance, stumbled upon the standard deviation – or stddev
function. This triggered some suppressed school boy knowledge, and with a bit of help from Khan Academy I was armed with a probable solution to my data issues.
STDDEV
is an aggregate function in PostgreSQL that calculates the standard deviation of a set of numeric values. The standard deviation is a measure of how spread out the values in a dataset are from the mean (average) value. A low standard deviation indicates that the values are clustered closely around the mean, while a high standard deviation indicates that the values are more widely spread out.
To use STDDEV
in PostgreSQL, you simply specify the column containing the numeric values you want to calculate the standard deviation for in the function. For example, the following query calculates the standard deviation of the pay_amount
column in the jobs
table:
select stddev(pay_amount) from jobs;
Code language: SQL (Structured Query Language) (sql)
The result of this query will be a single value representing the standard deviation of the pay_amount
values in the jobs
table.
STDDEV
can be useful for identifying outliers or unusual values in a dataset, as values that are more than a certain number of standard deviations away from the mean may be considered too high or too low and may be worth investigating further.
For example, you could use STDDEV
in combination with AVG
to filter out values that are more than two standard deviations away from the mean, as shown in some of the previous examples.
Here’s the Khan Academy video I found most useful:
What is Standard Deviation
Standard deviation is a measure of how spread out a set of data is from its mean or average value.
It tells you how much the data deviates from the mean.
A small standard deviation indicates that the data points tend to be close to the mean, while a large standard deviation indicates that the data points are spread out over a wider range of values.
Standard Deviation Examples
Personally, I find it easier to see some examples than read what it’s supposed to be about.
So here are a couple of examples that start as simply as I could, and then the second one is basically the same thing but with more numbers.
Remember, I said I’m not a maths whizz, so bear with me here.
Standard Deviation Example #1
Suppose you have a dataset with only two values: 2 and 4.
The mean of this dataset is (2+4)/2 = 3.
The standard deviation can be calculated as follows:
- Calculate the difference between each value and the mean:
- 2-3
- 4-3
- Square each of the differences:
- -1^2 and 1^2.
- Take the average of the squared differences:
- -1^2 + 1^2 = 2
- Divide that number by the number of values (which was 2):
- 2/2 = 1.
- Take the square root of the average:
- sqrt(1) = 1.
Therefore, the standard deviation of this dataset is 1.
This tells you that the data points are relatively close to the mean of 3.
Standard Deviation Example #2
Let’s say you have a dataset of 5 test scores: 70, 80, 90, 100, and 110.
The mean of this dataset is (70+80+90+100+110)/5 = 90.
The standard deviation can be calculated as follows:
- Calculate the difference between each value and the mean:
- -20
- -10
- 0
- 10
- 20
- Square each difference:
- 400
- 100
- 0
- 100
- 400
- Average the squared differences:
- (400+100+0+100+400)/5 = 200
- Take the square root of the average:
- sqrt(200) = 14.1421356…
Therefore, the standard deviation of this dataset is 14.142.
This tells you that the data points are spread out over a wider range of values compared to the previous example.
Using Standard Deviation In A Postgres Query
OK, so now we have had a refresher on what Standard Deviation is, it’s time to see how we can use this in Postgres.
Fortunately, as above, there’s a stddev
function built right into Postgres that we can make use of in our queries.
Setting Up Our Demo Environment
I’m going to use Docker to quickly spin up a throwaway Postgres DB. For this I will use docker-compose
:
version: '3.9'
services:
db:
image: postgres:15.2
ports:
- "5496:5432"
environment:
POSTGRES_DB: app_db
POSTGRES_USER: app_user
POSTGRES_PASSWORD: my_top_secret_password
volumes:
- "./init.sql:/docker-entrypoint-initdb.d/init.sql"
Code language: YAML (yaml)
These contents go into a docker-compose.yaml
file.
This docker compose file defines a single service named “db”. The service runs a Postgres database version 15.2 using the official Postgres Docker image.
The service exposes port 5496 on the host machine (your local machine), which is mapped to the default Postgres port 5432 in the container. This allows you to access the database from your PC.
The environment variables should hopefully be self explanatory, but they are defined by the official postgres
docker image, and I am using the explicit config above to override them to my own values.
The volumes
section maps the local file “./init.sql
” to the container directory “/docker-entrypoint-initdb.d/init.sql
“. This allows you to run any SQL scripts in that file when the container is started up, which is very useful for initialising the database schema and loading our demo data.
Here’s the corresponding init.sql
file:
CREATE TABLE jobs (
job_id SERIAL PRIMARY KEY,
job_title VARCHAR(100),
pay_amount INTEGER
);
INSERT INTO jobs (job_title, pay_amount)
VALUES
('Accountant', 50000),
('Software Developer', 70000),
('Project Manager', 80000),
('Marketing Coordinator', 45000),
('Graphic Designer', 55000),
('Sales Representative', 90000),
('Human Resources Manager', 75000),
('Web Designer', 6000),
('Customer Service Representative', 40000),
('Data Analyst', 85000),
('Domestic Cleaner', 450000),
('IT Support Specialist', 65000),
('Executive Assistant', 35000),
('Product Manager', 100000),
('Operations Manager', 95000),
('Business Analyst', 72000),
('Content Writer', 48000),
('Quality Assurance Tester', 58000),
('Technical Writer', 62000),
('Social Media Specialist', 42000),
('UI/UX Designer', 68000);
Code language: SQL (Structured Query Language) (sql)
You can then spin this up with docker-compose up
.
I’m going to use JetBrains DataGrip to connect, and I documented how to do this over on this post. So if you’re unsure, check that out.
You should now be able to see the data in your shiny new database table:
There are two deliberate outlier values in this table.
I highlighted them in the SQL code above.
('Web Designer', 6000),
('Domestic Cleaner', 450000),
Code language: JavaScript (javascript)
A comparatively low value, and a very high value.
This is to cover off the two most common issues I see in my scraped data.
Finding The Standard Deviation in Postgres Using stddev
We already saw an example of how to find the standard deviation in a Postgres query earlier on:
select stddev(pay_amount) from jobs;
Code language: SQL (Structured Query Language) (sql)
If we run that against our dataset, we get the expected single result:
Looks great.
But how can we use this in some kind of meaningful way?
Well, previously I was using a rather naive approach, relying on averages, and medians.
Median is a little more involved, and as far as I know, isn’t a standard function provided by Postgres. At least, I had to add my own.
But avg
is available.
Let’s see what we get with that:
The AVG
function calculates the average or mean value of a set of numeric values.
In this query it’s calculating the average of all 21 pay_amount
values in our jobs
table.
The true value comes here when we use the two of these aggregate functions together:
select job_title,pay_amount
from jobs
where pay_amount between
(select avg(pay_amount) - 2 * stddev(pay_amount) from jobs)
and
(select avg(pay_amount) + 2 * stddev(pay_amount) from jobs);
Code language: SQL (Structured Query Language) (sql)
Which gives:
In this query, we calculate the average and standard deviation of the pay_amount column using the AVG
and STDDEV
functions. We then use these values to define a range of acceptable values using the BETWEEN
operator. Any pay amounts that fall outside of this range will be filtered out.
Adjusting the number of standard deviations (in this case, 2) can change the strictness of the filter. A larger number of standard deviations will filter out more values, while a smaller number of standard deviations will filter out fewer values.
What About Low Values?
The above query does a good job of getting rid of the skewed higher value, but the lower value remains.
You can see this in row 8 – web designer – 6000.
Why does this not get removed?
Well, we saw the values of our two functions:
stddev(pay_amount)
: 87558.33…avg(pay_amount)
: 80523.80…
And our lower bounds query uses 2x of the stddev
, so we get:
Even my maths is sufficient to know that -94,592 is less than 6,000.
And therefore 6,000 is within the bounds of our parameters.
So if we go with just 1 standard deviation we still have a negative result:
At this point, that actually might be fine for you.
I guess it depends entirely on your data. But for me, there are some obvious mistakes we could rule out. Much like when a recruiter advertisers a cleaning job paying £450,000 per year, they most likely made a mistake, similarly anyone advertising a job that pays something like £5,000 per year also probably made a mistake.
There is a chance I’m wrong, of course. God knows, it happens enough. But for the sake of simplicity, let’s set a baseline value of the standard deviation value, or £10,000, whichever is the larger.
Or, the greater.
select job_title, pay_amount
from jobs
where pay_amount between
greatest(
(select avg(pay_amount) - stddev(pay_amount) from jobs),
10000
)
and
(select avg(pay_amount) + stddev(pay_amount) from jobs)
order by pay_amount;
Code language: SQL (Structured Query Language) (sql)
The GREATEST()
function is used to select the maximum value between the result of AVG(pay_amount) - 2 * STDDEV(pay_amount)
and the minimum threshold value of £10,000.
This ensures that the lower limit of the pay range is no lower than the threshold value, even if it falls more than the standard deviation below the average pay amount.
Postgres has a number of useful functions like this, one other one I’ve used a bunch is coalesce
. But that’s for another time.
And with a similar change implemented into my own code:
Overall then, a pretty useful feature.
Who knew high school maths lessons could come in handy again, 25 years later?
Wrapping Up
In conclusion, the Postgres standard deviation function is a useful tool for dealing with numerical data that may contain outliers or unusual values.
It allows you to calculate the spread of data from the mean, and can be used to identify values that may be worth investigating further.
And the best part? It’s not as complicated as you might think.
Even if, like me, you’re no maths whizz, with a little bit of Googling and the help of Khan Academy, you can quickly get to grips with the concept and start using it to improve your data analysis.
So don’t be intimidated by the idea of standard deviation – give it a try and see how it can help you make sense of your data.