I love me some Dockerised GitLab. I have the full CI thing going on, with a private registry for all my Docker images that are created during the CI process.
It all works real nice.
Until that Saturday night, when suddenly, it doesn’t.
Though it sounds like I’m going off on a tangent, it’s important to this story that you know I recently I changed my home broadband ISP.
I host one of my GitLab instances at my house. All my GitLab instances are now Dockerised, managed by Rancher.
I knew that as part of switching ISPs, there might (read: 100% would) be “fun” with firewalls, and ports, and all that jazz.
I thought I’d got everything sorted, and largely, I had.
Except I decided that whilst all this commotion was taking place, I would slightly rejig my infrastructure.
I use LetsEncrypt for SSL. I use the LetsEncrypt certs for this particular GitLab’s private registry.
I had the LetsEncrypt container on one node, and I was accessing the certs via a file share. It seemed pointless, and added complexity (the afore mentioned extra firewall rules), which I could remove if I moved the container on to the same box as the GitLab instance.
I made this move.
Things worked, and I felt good.
Then, a week or so later, I made some code changes and pushed.
The build failed almost immediately. Not what I needed on a Saturday night.
In the build logs I could see this:
Error response from daemon: Get https://my.gitlab:5000/v2/: received unexpected HTTP status: 500 Internal Server Error
This happened when the CI process was trying to log in to the private registry.
After a bit of head scratching, I tried from my local machine and sure enough I got the same message.
My Solution
As so many of my problems seem to, it boiled down to permissions.
Rather than copy the certs over from the original node, I let LetsEncrypt generate some new ones. Why not, right?
This process worked.
The GitLab and the Registry containers used a bind mounted volume to access the LetsEncrypt cert inside the container on the path /certs/.
When opening each container, I would be logged in as root.
Root being root, I had full permissions. I checked each file with a cheeky cat and visually confirmed that all looked good.
GitLab doesn’t run as root, however, and as the files were owned by root, and had 600 permissions:
Completed 500 Internal Server Error in 125ms (ActiveRecord: 7.2ms)
Errno::EACCES (Permission denied @ rb_sysopen - /certs/privkey.pem):
lib/json_web_token/rsa_token.rb:20:in `read'
lib/json_web_token/rsa_token.rb:20:in `key_data'
lib/json_web_token/rsa_token.rb:24:in `key'
lib/json_web_token/rsa_token.rb:28:in `public_key'
lib/json_web_token/rsa_token.rb:33:in `kid'
lib/json_web_token/rsa_token.rb:12:in `encoded'
The user GitLab is running as doesn’t have permission to read the private key.
Some more error output that may help future Googlers:
21/01/2018 21:31:51 time="2018-01-21T21:31:51.048129504Z" level=warning msg="error authorizing context: authorization token required" go.version=go1.7.6 http.request.host="my.gitlab:5000" http.request.id=4d91b482-1c43-465d-9a6e-fab6b823a76c http.request.method=GET http.request.remoteaddr="10.42.18.141:36654" http.request.uri="/v2/" http.request.useragent="docker/17.12.0-ce go/go1.9.2 git-commit/d97c6d6 kernel/4.4.0-109-generic os/linux arch/amd64 UpstreamClient(Docker-Client/17.12.0-ce (linux))" instance.id=24bb0a87-92ce-47fc-b0ca-b9717eabf171 service=registry version=v2.6.2
21/01/2018 21:31:5110.42.16.142 - - [21/Jan/2018:21:31:51 +0000] "GET /v2/ HTTP/1.1" 401 87 "" "docker/17.12.0-ce go/go1.9.2 git-commit/d97c6d6 kernel/4.4.0-109-generic os/linux arch/amd64 UpstreamClient(Docker-Client/17.12.0-ce (linux))"
Argh.
Thankfully I hadn’t deleted the old cert, so I went back and saw that I had previously set 0640 on the private key in the old setup.
Directory permissions for the certs was set to 0750 with execute being required as well as read.
In my case this was sufficient to satisfy GitLab.
When making the change on the new node, I could then immediately log back in.
A Tip To Spot This Sooner
I would strongly recommend that you schedule your project to run a build every ~24 hours, even if nothing has changed.
This will catch weird quirks that aren’t related to your project, but have inadvertently broken your project’s build.
It’s much easier to diagnose problems whilst they are fresh in your mind.
Also, ya’ know, better documentation! This is exactly why I’m now writing this post. So in the future when I inevitable make a similar mistake, I now know where to look first 🙂