I’ve recently been forced to migrate from Rancher v1.x to using
docker-compose to manage my production Docker containers.
In many ways its actually a blessing in disguise. Rancher was a nice GUI, but under the hood it was a total black box. Also, fairly recently they migrated to v2, and no true transition path was provided – other than “LOL, reinstall?”
Anyway, one thing that Rancher was doing for me – at least, I think – was making sure the log files didn’t eat up all my hard disk space.
This one just completely caught me by surprise, as I am yet to get my monitoring setup back up and running on this particular box:
➜ ssh email@example.com # obv not real
Welcome to chris server
System information as of Wed May 13 12:28:24 CEST 2020
System load: 3.37
Usage of /: 100.0% of 1.77TB
Memory usage: 13%
Swap usage: 0%
Users logged in: 0
=> / is using 100.0% of 1.77TB
There’s a pretty handy command to drill down into exactly what is eating up all your disk space – this isn’t specific to Docker either:
chris@chris-server / # du -h --max-depth=1
The culprit here being
/var with its 2.3T used… of 1.8T file system? Yeah.. idk.
Anyway you can keep drilling down with the disk usage command until you isolate the culprit. But as this is Docker related, I’ll save you the bother:
chris@chris-server /var/lib/docker/containers # du -h --max-depth=1
Son of a diddly.
Basically, this wasn’t caused by Docker directly. This was caused by my shonky migration.
The underlying issue here is that some of the Docker containers I run are Workers. They are little Node apps that connect to RabbitMQ, pull a job down, and do something with it.
When the brown stuff hits the twirly thing, they log out a bit of info to help me figure out what went wrong. Fairly standard stuff, I admit.
However, in this new setup, there was no limit to what was getting logged. I guess previously Rancher had enforced some max filesize limits or was helpfully rotating logs periodically.
In this case, the first port of call was to truncate a log. This might not actually be safe, but seeing as it’s my server and it’s not mission critical, I just truncated one of the huge logs:
/var/lib/docker/containers/6bfcad1f93a7fffa8f0e2b852a401199faf628f5ed7054ad01606f38c24fc568 # ls -la
drwx------ 4 root root 4096 May 9 10:26 .
drwx------ 36 root root 12288 May 9 16:50 ..
-rw-r----- 1 root root 301628608512 May 13 12:44 6bfcad1f93a7fffa8f0e2b852a401199faf628f5ed7054ad01606f38c24fc568-json.log
drwx------ 2 root root 4096 May 2 10:14 checkpoints
-rw------- 1 root root 4247 May 9 10:26 config.v2.json
-rw-r--r-- 1 root root 1586 May 9 10:26 hostconfig.json
-rw-r--r-- 1 root root 34 May 9 10:25 hostname
-rw-r--r-- 1 root root 197 May 9 10:25 hosts
drwx------ 3 root root 4096 May 2 10:14 mounts
-rw-r--r-- 1 root root 38 May 9 10:25 resolv.conf
-rw-r--r-- 1 root root 71 May 9 10:25 resolv.conf.hash
truncate --size 0 6bfcad1f93a7fffa8f0e2b852a401199faf628f5ed7054ad01606f38c24fc568-json.log
That freed up about 270gb. Top lols.
Anyway, I had four of these workers running, so that’s where all my disk space had gone.
Not Out Of The Woods Just Yet
There’s two further issues to address though at this point:
Firstly, I needed to update the Docker image to set the proper path to the RabbitMQ instance. This would stop the log file spam. Incidentally, within the space of truncating and then running a further
ls -la, the log was already at 70mb. That’s some aggressive connecting.
This would have been nicer as an environment variable – you shouldn’t need to do a rebuild to fix a parameter. But that’s not really the point here. Please excuse my crappy setup.
Secondly, and more importantly, I needed a way to enforce Docker never to misbehave in this way again.
docker-compose has a solution to this problem.
Here’s a small sample from my revised config:
OK, obviously a bit stripped down, but the gist of this is I borrow the config directly from the Docker Compose docs.
The one thing that I had to do was to put the
x-logging declaration above the
services declaration. Not sure why the order matters, but it didn’t seem to want to work until I made this change.
Once done, restarting all the Docker containers in this project (with the revised Docker image for the workers) not only resolved the log spam, but helpfully removed all the old containers – and associated huge log files – as part of the restart process.
Another fine disaster averted.