One of my projects that runs 24x7x365 is a web scraper.
I’ve iterated on the approach several times, and what is currently working is to do the fetch and then save the entire HTTP response in a Postgres database.
From various other projects, I can then query one central database and extract out data that I care about, without having a ton of different web scraper projects dotted about all over the place. It’s a nice separation of concerns, and as I say, after many iterations, this approach is working well.
What is working less well, however, is the sheer amount of disk space that this scraper eats up.
It’s a temporary thing, right? The data is scraped at one time, and then at some point in the future, that data is consumed, and then deleted.
Or … not deleted.
Because Postgres doesn’t actually give you the space back when you delete something.
Before You Start
Before you do this process be aware that you will need two things:
- An exclusive lock of the table for which you need to recover disk space for the length of time it takes to complete the
vacuum full
(can be a while) - An equal amount of free disk space to the disk space of the table currently consumed
OK, let’s quickly cover both of these.
You will need the exclusive lock because a full Postgres Vacuum will rewrite the table and all associated indexes. This is to prevent new data being written to that table during the process.
The full vacuum creates a new, compacted version of the table. Once this new table is created, it will delete the old table.
For production databases, this is probably best completed during a scheduled downtime or maintenance window situation.
Finally, I am not a DBA. This is advice aimed at self-hosters, solo-devs, that sort of thing. Use at your own risk and always take a backup before doing things like this.
How To Monitor Used / Free Disk Space
You need to run the following commands from the host machine. Not from the Docker container.
The command I use is pretty basic:
df -h
Or, show me disk free in a human readable format:
This belts out a bunch of nerd data like:
Filesystem Size Used Avail Use% Mounted on
tmpfs 6.3G 6.3M 6.3G 1% /run
/dev/md2 873G 492G 337G 60% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/md1 989M 185M 753M 20% /boot
/dev/sda1 5.5T 143G 5.1T 3% /mnt/backup
overlay 873G 492G 337G 60% /var/lib/docker/overlay2/sdsada/merged
overlay 873G 492G 337G 60% /var/lib/docker/overlay2/xcbxcb/merged
Code language: PHP (php)
Fortunately the formatting is a bit nicer in the terminal.
Again, you probably don’t want to be sat periodically running this manually. Fortunately you can watch
the command to, by default, re-run the command every 2 seconds:
watch df -h
For me, it’s the 492G
figure I’m hoping very much is reduced fairly significantly by the vacuum process.
How To See The Disk Space That Postgres Is Eating
The next step is to run a query to see how much of your expensive disk space is being eaten by the ever hungry Postgres DB:
SELECT
table_schema || '.' || table_name AS table_full_name,
pg_size_pretty(pg_total_relation_size(table_schema || '.' || table_name)) AS total_size,
pg_total_relation_size(table_schema || '.' || table_name) / (1024^3) AS size_in_gb
FROM
information_schema.tables
WHERE
table_schema NOT IN ('pg_catalog', 'information_schema')
ORDER BY
pg_total_relation_size(table_schema || '.' || table_name) DESC;
Code language: SQL (Structured Query Language) (sql)
This spits out a nice table:
I had to redact a bit of it, but it’s the top row that we’re interested in anyway.
My scraper job saves data into the scraper_http_public.http_response
table.
My consumer job parses the data, extracts what it cares about, saves that off some place else, and then issues a DELETE
query against the consumed row.
So far, so good.
The rows are deleted as expected, but Postgres greedily keeps hold of the disk space and, behind the scenes, marks these deleted rows as ‘dead tuples’.
The only way to free up the space consumed by these dead tuples is completing a vacuum full
for the table.
How To See Your Dead Tuple Count
Another query that is pretty useful is as follows:
SELECT
schemaname,
relname AS table_name,
n_dead_tup AS dead_tuples,
n_live_tup AS live_tuples,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM
pg_stat_user_tables
WHERE
relname = 'http_response'; -- Replace with your table name
Code language: PHP (php)
This should kick out some stats like this:
[
{
"schemaname": "scraper_http_public",
"table_name": "http_response",
"dead_tuples": 6509,
"live_tuples": 68746,
"last_vacuum": "2024-12-05 14:13:40.701939 +00:00",
"last_autovacuum": "2024-12-05 13:28:40.718713 +00:00",
"last_analyze": null,
"last_autoanalyze": "2024-12-05 14:06:38.033345 +00:00"
}
]
Code language: JSON / JSON with Comments (json)
My output shows:
Dead Tuple Ratio = (dead_tuples / (live_tuples + dead_tuples)) * 100
Dead Tuple Ratio = (6509 / (68746 + 6509)) * 100 ≈ 8.65%
It also shows that I am running an auto-vacuum. However, I recently made a change to process more of the data in a faster manner, and now I want to free up that extra space proactively, rather than wait for the next automated vacuum.
If you don’t have auto-vacuuming enabled, you can do so with a query like:
ALTER TABLE scraper_http_public.http_response
SET (autovacuum_vacuum_threshold = 1000, autovacuum_vacuum_scale_factor = 0.05);
Code language: SQL (Structured Query Language) (sql)
Again, change the name for your table.
I would put this in my migrations personally, but you may not do things that way.
How To Start A Postgres Vacuum Using Docker
What I found when running Vacuum commands from DataGrip, my database IDE, is that they would time out / crash in an unusual way.
I had another table a few weeks ago that I found needed to be vacuumed and I set the command running from DataGrip.
It seemed to be taking ages, so I left the computer on overnight expecting it to be done when I returned in the morning.
Nope.
I left it going, and in the end I cancelled it around 30 hours in…
Yeah.
That’s not right.
Vacuuming can take a while – especially if your DB is running on old hardware, spinny disk, an overloaded box, etc.
But it shouldn’t take 30 hours. At least not at the scales I work at.
So rather than run from the IDE, a better choice in my opinion, is to execute the command as a Docker command. This way it runs on the server anyway, so you don’t need to leave an active connection to the server whilst it runs:
docker exec your-docker-db-container-name bash -c "nohup psql -U your_db_username -d your_db_name -c 'VACUUM FULL your_schema.your_table;' > /tmp/vacuum.log 2>&1 &"
Code language: Shell Session (shell)
The important things here are:
- We use bash to run the command inside the running container
nohup
means the command will run in the background even when we aren’t actively connectedpsql
runs thevacuum full
command against your table- If needed, the output from this command will be inside the container at
/tmp/vacuum.log
- Any errors will also be in that file (
2>&1
) - And the
&
at the end runs this process in the background
You should still be able to actively monitor this by running:
docker logs --follow your-docker-db-container-name --since 5m
Where --follow
tails the log in real time, and --since 5m
means show me the last 5 minutes, so you won’t see the entire log output if this container has been running for a while.
How To See If The Vacuum Is Running?
Regular household hoovers make a ton of noise when in use. It’s pretty obvious when one is running.
And heck, maybe if you were sat next to your server somewhere in some German data centre, your server fans might be absolutely maxed out whilst this process is running. Who knows, personally I’m about 2,000 miles away from my server.
Instead, I can keep an eye on things with the following query:
SELECT
pid,
state,
query,
query_start,
now() - query_start AS runtime,
usename AS username
FROM
pg_stat_activity
WHERE
query LIKE 'VACUUM%'
AND state = 'active';
Code language: PHP (php)
Again, this should kick out some nice output:
[
{
"pid": 140326,
"state": "active",
"query": "VACUUM FULL scraper_http_public.http_response;",
"query_start": "2024-12-05 14:13:39.536802 +00:00",
"runtime": "0 years 0 mons 0 days 0 hours 34 mins 4.58421 secs",
"username": "my_db_username"
}
Code language: JSON / JSON with Comments (json)
If you want to get fancy, you can get DataGrip to auto-run this query every so often, so you can monitor its progress:
My Outcome
For me, after all was said and done, I freed up 24g of space. Not mega, but definitely better being freely available than locked up for no use by anyone.
To give some ideas of time taken, this is a Intel Core i7-8700 with the SSD’s in RAID 5.
This process took about 40 minutes.
(Possible) Alternative Approach
One alternative approach to this – if the vacuum is going to take too long – is to stand up a brand new database server.
Take a dump of your existing database. I assume you do this regularly – if not, please do.
Then create a new database container and restore your backup into that container.
To the best of my knowledge, this results in the same outcome and might be a lot quicker.