How I Fixed: Failed To Delete Snapshot in Virtualbox

Man alive. I hate stuff like this. Virtualbox is a great piece of software, but it does some whacky things.

Recently I migrated my infrastructure from Virtual Machines to Docker.

I replaced a Digital Ocean VPS with a local Virtualbox VM for running my private GitLab. The primary reason for this is that Docker images take up a chunk of space, and a low tier DO droplet just doesn’t cut it in terms of disk space.

I had a spare 120gb SSD laying around so figured: hey, why not use that and 4x my usable disk space for GitLab? Sounds like a good idea, right?

Actually, it took a lot of effort. But in the end, it worked. I decided to use thin provisioning and make the virtual box image think it had a 2tb disk, when in reality it was sharing the same SSD with another Virtualbox machine that runs my GitLab CI multi-runner instance.

Ok, so the whys-and-wherefores of that set up are for a forthcoming / different post.

What I didn’t expect is for my disk to fill up in less than 2 weeks. I mean, I knew my Docker images took up a chunk of space, but I had purposefully mounted a totally different disk for GitLab backups, and disabled container backups along the way. How could it be that within 2 weeks I had 98% disk utilisation?

Well, it turns out: snapshots.

Or more specifically, one single 90.1gb snapshot:

du -h
84G	./rancher-node-2/Snapshots
88G	./rancher-node-2

What I had done is taken a “base snapshot” just after creating the VM, and then promptly forgotten about said snapshot entirely.

2 weeks later, I log on today and try to hit my GitLab, but got a 503 error:

Fun times.

A bit of digging showed me that both my “rancher-node-2” VM, and the GitLab CI Multi-Runner VM were in a paused state.

A little further digging showed I had 2gb of disk space left. And that’s where I found out about the snapshot.

Ok, so simple solution – delete the snapshot.

Yeah, if only:

So, that’s not enough free disk space to delete a file then? Heh, not quite. Apparently deleting a snapshot also involves merging snapshots, or some such – I didn’t dive into the technicals.

But still, seems daft.

Anyway, the advice I found out there on the ‘net was to have at least as much disk space again in order to do the delete. In other words, if you have a 10gb VM, and a 20gb snapshot, in order to delete the snapshot you’d need a 60gb disk. But of course!

Sadly, I don’t have another spare 240gb disk laying around. I don’t use large disks anymore as I’ve lost two 2tb disks (old spinny stuff, but still) in recent years and the data loss is mildly irritating to put it politely. I stick to smaller disks so if data loss does occur, it isn’t as bad. In theory.

Fortunately, I did have a spare 100gb or so on a different partition. But on the face of it, that doesn’t seem that useful, right?

My Solution

This may seem a little unorthodox but here goes.

To begin with, I tried to simply clone the existing VM. Doing a full clone gives the option to disregard any snapshots.

I moved my second VM off the 120gb disk freeing up about 18gb or so.

I tried to clone, it took a very long time, and then it promptly failed:

Don’t be fooled by that timer, it took a lot longer than that.

Anyway, that didn’t work, so I came up with a more geeky plan.

I moved the snapshot file from my 120gb disk. This freed up a huge amount of space:

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       110G  3.4G  101G   4% /mnt/kingston-120-second

Then, I symlinked the snapshot back into place:

➜  Snapshots ln -s "/media/chris/Data/Virtual Machines/{43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi" ./{43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi
➜  Snapshots ls -la
total 12
drwx------ 2 chris chris 4096 Apr 30 20:37 .
drwxrwxr-x 4 chris chris 4096 Apr 30 20:10 ..
lrwxrwxrwx 1 chris chris   77 Apr 30 20:37 {43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi -> /media/chris/Data/Virtual Machines/{43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi

Symlinks seem scary. Here’s how I remember the syntax:

It’s just like the copy command.

ln {path to source} {path to become my symlink}

# just like 'cp'

cp {path to copy from} {path to new file}

I tried to clone the VM at this point, but this again failed with an out of disk space error.

Instead, I then tried to delete the snapshot.

This consumed nearly all the disk space, but finally worked. Hoorah, right? Not quite.

There was still a downside. My .vdi file was now at 97.3gb. I could boot the VM and see that inside the VM I was only using 46gb. Hmm.

What I had to do was to somehow shrink the disk back down to as close to 46gb as I could. This was a little involved, and took a while.

I did the following:

chris@rancher-node-2:~$ sudo dd if=/dev/zero | pv | sudo dd of=/bigemptyfile bs=4096k

dd: error writing '/bigemptyfile': No space left on device                                                         ]
2017+63027194 records in
2017+63027193 records out
2103230164992 bytes (2.1 TB, 1.9 TiB) copied, 5693.08 s, 369 MB/s
1.91TiB 1:34:53 [ 352MiB/s] [           <=>                                                                        ]

chris@rancher-node-2:~$ Connection to closed by remote host.
Connection to closed.

I can’t say this is my own solution – I found it on StackOverflow 🙂

As you can see, this command ran until it failed. It never consumed any disk space on my physical hard disk – which is nice, as as I say, I thin provisioned this disk so that wouldn’t have worked out so well.

Still, once this process failed, I wasn’t done.

I then ran:

vboxmanage modifyhd rancher-node-2/rancher-node-2.vdi --compact

This took about 10 minutes, but after finishing I was down to a 56gb .vdi file. Good enough.

Finally, remember to delete the bigemptyfile :

rm /bigemptyfile

How I solved “New runner. Has not connected yet” in GitLab CI

gitlab-brown-triangle-of-doomI hit on an issue with GitLab CI whereby newly added GitLab CI Runners would register successfully, but never start a build.

The error message was shown on the Projects &gt; Settings as a brown triangle with the runner’s ID hash next to it.

The runner itself seemed to be active, but it had the message: “New runner. Has not connected yet” next to it.

Frustrated, I re-installed, re-installed again, tried doing a gitlab-ci-multi-runner register AND a sudo gitlab-ci-multi-runner register which only exacerbated the problem.

The Problem

Firstly, note that running a :

gitlab-ci-multi-runner register

does something different to:

sudo gitlab-ci-multi-runner register

Both will create a config.toml file which may not be immediately obvious.

Normally when you run a command without sudo and it doesn’t work, you would expect it to fail and give a helpful error message.

This is a case of GitLab being flexible, but confusing.

gitlab-ci-multi-runner register will create a file in your home directory called config.toml.

sudo gitlab-ci-multi-runner register will create the file: /etc/gitlab-runner/config.toml.

My Solution to “New runner. Has not connected yet”

In my case, the solution was to delete the config.toml in my home directory. That’s the home directory of the user who I was ssh‘d into the GitLab CI Runner instance.

As soon as I did this, the GitLab runner immediately started – no further intervention required.

If your’s doesn’t, my next course of action would be to delete both config.toml files, remove any references to that runner from GitLab CI Runner dashboard, and then re-run the command:

sudo gitlab-ci-multi-runner register