My GitLab Runner Config.toml [Example]

I hit on an annoying issue this week, which I’m not sure of the root cause.

Last week I bumped GitLab from 10.6, to 10.8, and somehow broke my GitLab CI Runner.

Somewhere, I have a backup of the config.toml file I was using. I run my GitLab CI Runner in a Docker container. I only run one, as it’s only for my projects. And one is enough.

Somehow, the Runner borked. And annoyingly I neither had a reference of the running version (never use :latest unless you like uncertainty), and recreating without the config.toml file has been a pain.

So for my own future reference, here is my current GitLab Runner config.toml file:

user@8818901c05c8:/# cat /etc/gitlab-runner/config.toml

concurrent = 1
check_interval = 0

  name = "runner-1"
  url = "https://my.gitlab.url"
  token = "{redacted}"
  executor = "docker"
    tls_verify = false
    image = "docker:dind"
    privileged = true
    pull_policy = "if-not-present"
    disable_cache = false
    volumes = ["/var/run/docker/sock:/var/run/docker.sock","/cache"]
    shm_size = 0
    insecure = false

FWIW this isn’t perfect. I’m hitting on a major issue currently whereby GitLab CI Pipeline stages with multiple jobs in the stage are routinely failing. It’s very frustrating. It’s also not scheduled for fix until v11, afaik.

Almost a year on

Wow, it’s almost a year since I last destroyed my personal GitLab.

Back then I was running the omnibus edition. Since then I’ve been rocking sameersbn/docker-gitlab.

Highly recommended. Love me some GitLab.

And yes, last, because I have unfortunately destroyed my GitLab three times so far. Each time is an extreme sad panda situation. I have backups, thankfully, but it still sucks.

How I Fixed: “error authorizing context: authorization token required”

I love me some Dockerised GitLab. I have the full CI thing going on, with a private registry for all my Docker images that are created during the CI process.

It all works real nice.

Until that Saturday night, when suddenly, it doesn’t.

Though it sounds like I’m going off on a tangent, it’s important to this story that you know I recently I changed my home broadband ISP.

I host one of my GitLab instances at my house. All my GitLab instances are now Dockerised, managed by Rancher.

I knew that as part of switching ISPs, there might (read: 100% would) be “fun” with firewalls, and ports, and all that jazz.

I thought I’d got everything sorted, and largely, I had.

Except I decided that whilst all this commotion was taking place, I would slightly rejig my infrastructure.

I use LetsEncrypt for SSL. I use the LetsEncrypt certs for this particular GitLab’s private registry.

I had the LetsEncrypt container on one node, and I was accessing the certs via a file share. It seemed pointless, and added complexity (the afore mentioned extra firewall rules), which I could remove if I moved the container on to the same box as the GitLab instance.

I made this move.

Things worked, and I felt good.

Then, a week or so later, I made some code changes and pushed.

The build failed almost immediately. Not what I needed on a Saturday night.

In the build logs I could see this:

Error response from daemon: Get https://my.gitlab:5000/v2/: received unexpected HTTP status: 500 Internal Server Error

This happened when the CI process was trying to log in to the private registry.

After a bit of head scratching, I tried from my local machine and sure enough I got the same message.

My Solution

As so many of my problems seem to, it boiled down to permissions.

Rather than copy the certs over from the original node, I let LetsEncrypt generate some new ones. Why not, right?

This process worked.

The GitLab and the Registry containers used a bind mounted volume to access the LetsEncrypt cert inside the container on the path /certs/.

When opening each container, I would be logged in as root.

Root being root, I had full permissions. I checked each file with a cheeky cat and visually confirmed that all looked good.

GitLab doesn’t run as root, however, and as the files were owned by root, and had 600 permissions:

Completed 500 Internal Server Error in 125ms (ActiveRecord: 7.2ms)
Errno::EACCES (Permission denied @ rb_sysopen - /certs/privkey.pem):
lib/json_web_token/rsa_token.rb:20:in `read'
lib/json_web_token/rsa_token.rb:20:in `key_data'
lib/json_web_token/rsa_token.rb:24:in `key'
lib/json_web_token/rsa_token.rb:28:in `public_key'
lib/json_web_token/rsa_token.rb:33:in `kid'
lib/json_web_token/rsa_token.rb:12:in `encoded'

The user GitLab is running as doesn’t have permission to read the private key.

Some more error output that may help future Googlers:

21/01/2018 21:31:51 time="2018-01-21T21:31:51.048129504Z" level=warning msg="error authorizing context: authorization token required" go.version=go1.7.6"my.gitlab:5000" http.request.method=GET http.request.remoteaddr="" http.request.uri="/v2/" http.request.useragent="docker/17.12.0-ce go/go1.9.2 git-commit/d97c6d6 kernel/4.4.0-109-generic os/linux arch/amd64 UpstreamClient(Docker-Client/17.12.0-ce (linux))" service=registry version=v2.6.2
21/01/2018 21:31:5110.42.16.142 - - [21/Jan/2018:21:31:51 +0000] "GET /v2/ HTTP/1.1" 401 87 "" "docker/17.12.0-ce go/go1.9.2 git-commit/d97c6d6 kernel/4.4.0-109-generic os/linux arch/amd64 UpstreamClient(Docker-Client/17.12.0-ce (linux))"


Thankfully I hadn’t deleted the old cert, so I went back and saw that I had previously set 0640  on the private key in the old setup.

Directory permissions for the certs was set to 0750 with execute being required as well as read.

In my case this was sufficient to satisfy GitLab.

When making the change on the new node, I could then immediately log back in.

A Tip To Spot This Sooner

I would strongly recommend that you schedule your project to run a build every ~24 hours, even if nothing has changed.

This will catch weird quirks that aren’t related to your project, but have inadvertently broken your project’s build.

It’s much easier to diagnose problems whilst they are fresh in your mind.

Also, ya’ know, better documentation! This is exactly why I’m now writing this post. So in the future when I inevitable make a similar mistake, I now know where to look first 🙂

How I Fixed: Failed To Delete Snapshot in Virtualbox

Man alive. I hate stuff like this. Virtualbox is a great piece of software, but it does some whacky things.

Recently I migrated my infrastructure from Virtual Machines to Docker.

I replaced a Digital Ocean VPS with a local Virtualbox VM for running my private GitLab. The primary reason for this is that Docker images take up a chunk of space, and a low tier DO droplet just doesn’t cut it in terms of disk space.

I had a spare 120gb SSD laying around so figured: hey, why not use that and 4x my usable disk space for GitLab? Sounds like a good idea, right?

Actually, it took a lot of effort. But in the end, it worked. I decided to use thin provisioning and make the virtual box image think it had a 2tb disk, when in reality it was sharing the same SSD with another Virtualbox machine that runs my GitLab CI multi-runner instance.

Ok, so the whys-and-wherefores of that set up are for a forthcoming / different post.

What I didn’t expect is for my disk to fill up in less than 2 weeks. I mean, I knew my Docker images took up a chunk of space, but I had purposefully mounted a totally different disk for GitLab backups, and disabled container backups along the way. How could it be that within 2 weeks I had 98% disk utilisation?

Well, it turns out: snapshots.

Or more specifically, one single 90.1gb snapshot:

du -h
84G	./rancher-node-2/Snapshots
88G	./rancher-node-2

What I had done is taken a “base snapshot” just after creating the VM, and then promptly forgotten about said snapshot entirely.

2 weeks later, I log on today and try to hit my GitLab, but got a 503 error:

Fun times.

A bit of digging showed me that both my “rancher-node-2” VM, and the GitLab CI Multi-Runner VM were in a paused state.

A little further digging showed I had 2gb of disk space left. And that’s where I found out about the snapshot.

Ok, so simple solution – delete the snapshot.

Yeah, if only:

So, that’s not enough free disk space to delete a file then? Heh, not quite. Apparently deleting a snapshot also involves merging snapshots, or some such – I didn’t dive into the technicals.

But still, seems daft.

Anyway, the advice I found out there on the ‘net was to have at least as much disk space again in order to do the delete. In other words, if you have a 10gb VM, and a 20gb snapshot, in order to delete the snapshot you’d need a 60gb disk. But of course!

Sadly, I don’t have another spare 240gb disk laying around. I don’t use large disks anymore as I’ve lost two 2tb disks (old spinny stuff, but still) in recent years and the data loss is mildly irritating to put it politely. I stick to smaller disks so if data loss does occur, it isn’t as bad. In theory.

Fortunately, I did have a spare 100gb or so on a different partition. But on the face of it, that doesn’t seem that useful, right?

My Solution

This may seem a little unorthodox but here goes.

To begin with, I tried to simply clone the existing VM. Doing a full clone gives the option to disregard any snapshots.

I moved my second VM off the 120gb disk freeing up about 18gb or so.

I tried to clone, it took a very long time, and then it promptly failed:

Don’t be fooled by that timer, it took a lot longer than that.

Anyway, that didn’t work, so I came up with a more geeky plan.

I moved the snapshot file from my 120gb disk. This freed up a huge amount of space:

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       110G  3.4G  101G   4% /mnt/kingston-120-second

Then, I symlinked the snapshot back into place:

➜  Snapshots ln -s "/media/chris/Data/Virtual Machines/{43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi" ./{43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi
➜  Snapshots ls -la
total 12
drwx------ 2 chris chris 4096 Apr 30 20:37 .
drwxrwxr-x 4 chris chris 4096 Apr 30 20:10 ..
lrwxrwxrwx 1 chris chris   77 Apr 30 20:37 {43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi -> /media/chris/Data/Virtual Machines/{43895f1b-1b8a-4eab-9d47-40627ccca33f}.vdi

Symlinks seem scary. Here’s how I remember the syntax:

It’s just like the copy command.

ln {path to source} {path to become my symlink}

# just like 'cp'

cp {path to copy from} {path to new file}

I tried to clone the VM at this point, but this again failed with an out of disk space error.

Instead, I then tried to delete the snapshot.

This consumed nearly all the disk space, but finally worked. Hoorah, right? Not quite.

There was still a downside. My .vdi file was now at 97.3gb. I could boot the VM and see that inside the VM I was only using 46gb. Hmm.

What I had to do was to somehow shrink the disk back down to as close to 46gb as I could. This was a little involved, and took a while.

I did the following:

chris@rancher-node-2:~$ sudo dd if=/dev/zero | pv | sudo dd of=/bigemptyfile bs=4096k

dd: error writing '/bigemptyfile': No space left on device                                                         ]
2017+63027194 records in
2017+63027193 records out
2103230164992 bytes (2.1 TB, 1.9 TiB) copied, 5693.08 s, 369 MB/s
1.91TiB 1:34:53 [ 352MiB/s] [           <=>                                                                        ]

chris@rancher-node-2:~$ Connection to closed by remote host.
Connection to closed.

I can’t say this is my own solution – I found it on StackOverflow 🙂

As you can see, this command ran until it failed. It never consumed any disk space on my physical hard disk – which is nice, as as I say, I thin provisioned this disk so that wouldn’t have worked out so well.

Still, once this process failed, I wasn’t done.

I then ran:

vboxmanage modifyhd rancher-node-2/rancher-node-2.vdi --compact

This took about 10 minutes, but after finishing I was down to a 56gb .vdi file. Good enough.

Finally, remember to delete the bigemptyfile :

rm /bigemptyfile

How I solved “New runner. Has not connected yet” in GitLab CI

gitlab-brown-triangle-of-doomI hit on an issue with GitLab CI whereby newly added GitLab CI Runners would register successfully, but never start a build.

The error message was shown on the Projects &gt; Settings as a brown triangle with the runner’s ID hash next to it.

The runner itself seemed to be active, but it had the message: “New runner. Has not connected yet” next to it.

Frustrated, I re-installed, re-installed again, tried doing a gitlab-ci-multi-runner register AND a sudo gitlab-ci-multi-runner register which only exacerbated the problem.

The Problem

Firstly, note that running a :

gitlab-ci-multi-runner register

does something different to:

sudo gitlab-ci-multi-runner register

Both will create a config.toml file which may not be immediately obvious.

Normally when you run a command without sudo and it doesn’t work, you would expect it to fail and give a helpful error message.

This is a case of GitLab being flexible, but confusing.

gitlab-ci-multi-runner register will create a file in your home directory called config.toml.

sudo gitlab-ci-multi-runner register will create the file: /etc/gitlab-runner/config.toml.

My Solution to “New runner. Has not connected yet”

In my case, the solution was to delete the config.toml in my home directory. That’s the home directory of the user who I was ssh‘d into the GitLab CI Runner instance.

As soon as I did this, the GitLab runner immediately started – no further intervention required.

If your’s doesn’t, my next course of action would be to delete both config.toml files, remove any references to that runner from GitLab CI Runner dashboard, and then re-run the command:

sudo gitlab-ci-multi-runner register