An Ad Hoc, Informally-Specified, Bug-Ridden Implementation of GitOps
I work on a platform composed of a group of mostly Java-based services. I learnt Kubernetes early on so we (really just two of us) could migrate the services from an ad hoc VM which was created and updated by hand.
The result of that migration was a Helm chart to deploy all
the services as a single unit. Following best practices, rather than using latest
for the image
tags, every deployment specifies unique image
tags
generated in the CI pipelines. Using immutable tags in this way avoids significant pitfalls,
but updating services (e.g. A and B) involved too much manual labour:
- Merge the changes to A.
- Open the relevant pipeline details.
- Wait for the relevant job in the pipeline to complete.
- Open the logs for the job.
- Scroll to the end.
- Select the tag (
a-tag
, say) without any surrounding whitespace in the output, making sure to select the unique tag and not thelatest
tag which was pushed at the same time. - Copy the tag.
- Repeat this process to get the tag for B (
b-tag
, say). - Notify my colleagues that I was about to update A and B in environment E to
a-tag
andb-tag
. - Run Helm using helm upgrade --atomic chart-name chart-directory --set a.image.tag=x,b.image.tag=y.
- Notify my colleagues that I had updated A and B in environment E to
a-tag
andb-tag
.
There were many opportunities to make mistakes. Ideally, we would do away with all that by implementing the GitOps workflow: a central repository would describe our deployments and new commits would trigger updates. However, I was certain a proper GitOps engine like Argo CD would involve too many new concepts for the entire team to adapt to on top of everything else. After all, only the two of us understood Kubernetes in the first place. Therefore, I rolled up my sleeves and got to work last month on a hacked-together substitute that just about meets our requirements.[1]
The existing process required copying and pasting tags to notify colleagues. Before automating anything, I added all the tag values to the chart’s NOTES.txt template. Now we could copy them as a block from the output of a successful helm upgrade and paste them into Slack.
No more job-hunting
The first and comparatively easiest goal was to obviate the need to pore over CI jobs. I made
each repository’s pipeline end by sending a Slack message carrying the new tag. Since most of the
jobs use Alpine Linux–derived images, I had to add apk add --no-cache -q curl
in order to send the
request. (Sub-optimal, but not significant.) It took a few attempts to quote the step in YAML
such that it would produce a correctly-quoted string with escaped JSON for the ash
interpreter, even with a multiline
string:
YAML# other steps here
- apk add --no-cache -q curl
- |
curl -X POST -H "Content-Type: application/json" --data "{\"text\": \"*Pushed new $APP image to registry:* \`$TAG\`\", \"username\": \"$SLACK_USERNAME\"}" $SLACK_URL
Once this was in place, I could just wait for the Slack announcement and easily copy the new tag from there instead of monitoring the jobs until they were triggered and finished.
No more typing out notifications
I earlier made the list of deployed versions accessible in the chart notes, but we still had to copy and announce them. Fortunately, Helm can run Jobs in response to specific events. I wrote a tiny Bash script to send a notification to Slack in the same way as before, populating details from environment variables. I put this in a subdirectory and defined a ConfigMap like so:
NunjucksapiVersion: v1
kind: ConfigMap
metadata:
name: {{ include "app.fullname" . }}-notifications
labels:
{{ include "app.labels" . | indent 4 }}
family: app
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-6"
"helm.sh/hook-delete-policy": before-hook-creation
data:
{{- (.Files.Glob "notifications-configmap/*").AsConfig | nindent 2 }}
(The helm.sh
annotations ensure the ConfigMap is created before the Job that uses it.)
values.yaml now needed a few new entries for the hooks to use:
YAMLcurl:
image: "curlimages/curl"
tag: "7.80.0"
slack:
secretRef: ""
# this ought to be part of a template instead of part of the values,
# but I did it the wrong way and never fixed it
hooks:
pre-install: "Deploying"
pre-upgrade: "Deploying"
post-install: "Deployed"
post-upgrade: "Deployed"
post-rollback: "Rolled back"
environment: "" # the human-readable label for this environment
With those pieces in place, I could add a hook—technically, one each for pre-install
,
pre-upgrade
, post-install
, post-upgrade
, and post-rollback
—using the official cURL image
with a custom entrypoint, mounting the script from the ConfigMap and adding all the required values
as environment variables:
Nunjucks{{- range $hook, $description := .Values.slack.hooks }}
---
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ $.Release.Name }}-slack-{{ $hook }}"
labels:
app.kubernetes.io/managed-by: {{ $.Release.Service | quote }}
app.kubernetes.io/instance: {{ $.Release.Name | quote }}
app.kubernetes.io/version: "{{ $.Chart.AppVersion }}"
helm.sh/chart: "{{ $.Chart.Name }}-{{ $.Chart.Version }}"
annotations:
"helm.sh/hook": {{ $hook }}
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": hook-succeeded
spec:
template:
metadata:
name: "{{ $.Release.Name }}-slack-{{ $hook }}"
labels:
app.kubernetes.io/managed-by: {{ $.Release.Service | quote }}
app.kubernetes.io/instance: {{ $.Release.Name | quote }}
helm.sh/chart: "{{ $.Chart.Name }}-{{ $.Chart.Version }}"
spec:
restartPolicy: Never
containers:
- name: notify-slack
image: "{{ $.Values.curl.image }}:{{ $.Values.curl.tag }}"
command: ["/bin/sh"]
args:
- /etc/scripts/send-notification.sh
env:
- name: WEBHOOK_URL
valueFrom:
secretKeyRef:
name: "{{ required "A valid Slack Secret is required!" $.Values.slack.secretRef }}"
key: webhookUrl
- name: WEBHOOK_CHANNEL
valueFrom:
secretKeyRef:
name: "{{ required "A valid Slack Secret is required!" $.Values.slack.secretRef }}"
key: webhookChannel
- name: WEBHOOK_USERNAME
valueFrom:
secretKeyRef:
name: "{{ required "A valid Slack Secret is required!" $.Values.slack.secretRef }}"
key: webhookUsername
- name: ACTION
value: "{{ $description }}"
- name: ENVIRONMENT
value: "{{ required "A valid environment is required!" $.Values.environment }}"
- name: A_TAG
value: "{{ $.Values.a.image.tag }}"
# (other service tag values go here)
volumeMounts:
- name: scripts
mountPath: /etc/scripts
volumes:
- name: scripts
configMap:
name: {{ include "app.fullname" $ }}-notifications
{{- end }}
I was subsequently rewarded with automatic, informative Slack notifications any time a deployment started, ended, or failed.
No more running Helm locally
Running Helm on each developer’s machine was a terrible way to manage deployments. It forced everyone to memorize or write down opaque incantations. It was also all too easy to deploy to the wrong environment or with the wrong settings, or even with the wrong version of the chart if someone forgot to pull the latest updates first. (We had experienced all three scenarios at various points, almost as frequently as the successful scenarios.)
I recently had the opportunity to try helmfile and was certain it was right for us: it’s a reasonably simple tool with a single purpose that reuses familiar concepts. The configuration I built takes advantage of its solid support for multiple environments. There is a base values file plus one file per environment for more specific configuration.[2] The pipeline has two jobs per environment, one for ‘preparation’ (which runs helmfile diff) and one for ‘deployment’ (which runs helmfile apply) that requires the first one to succeed. There’s also a linting stage.
Automatic deployments were never a goal, so each one is triggered manually and requires all the previous jobs to have succeeded. The diff is only advisory, unlike a Terraform plan. It may be outdated by the time the deployment runs or ignored by the developer, but it serves as a sanity check.
From start to finish, integrating helmfile only took about one evening. I did discover a major caveat, though: forcibly aborting the execution of helmfile apply (not letting it clean up) can leave the Helm release and therefore the application in a broken state. I spent 15 panicked minutes searching for a solution. What worked was to identify and delete the release secrets (kubectl delete secret sh.helm.release.v1.chart-name.vrelease-revision) for the last two releases. Since we had deployed several tiny, consecutive experimental releases to test all this, rolling back was not a problem.
Although it’s always possible to cancel a job manually in the CI interface, I added a flag to all deployment jobs to prevent automatic cancellation. It’s the developer’s responsibility to avoid creating a race condition by running two deployment jobs to the same environment concurrently.
No more changing image tags by hand
At this point, the process had been abbreviated to:
- Merge the changes to A.
- Wait for the pipeline to announce the new image tag on Slack.
- Update the appropriate environment in the deployment repository with the new tag and merge the changes.
- Activate the jobs in succession as required.
Having got this far, I was unwilling to continue updating the tags by hand. I wanted a better way.
The complicating factor was that, since the deployment details were now maintained in Git, updating
the tag required adding a Git commit. GitHub Actions make this easy, as the default GITHUB_TOKEN
can push to the
repository.
Our CI system does not.
I spent a lot of time wrestling with this. In the end, I created an SSH key to use in a job designed to be triggered by other repositories. (Only that key can push directly to trunk in the deployment repository.) Funnily enough, the job runs the base Alpine Linux image, but it starts by installing Bash (along with Git, OpenSSH, and yq, to edit the YAML) to run the script, ash being less likely to be available in a typical development environment than Bash.[3] The final script looks something like this:
YAML# get $UPSTREAM_REPOSITORY, $UPSTREAM_REF, $TARGET_ENVIRONMENT,
# $TARGET_IMAGE, and $NEW_TAG from triggering job
# install wget, Git, OpenSSH, and yq
- |
apk add -q --no-cache wget git openssh && \
wget -q https://github.com/mikefarah/yq/releases/download/v4.16.1/yq_linux_amd64 -O /usr/bin/yq && \
chmod +x /usr/bin/yq
# make the SSH key file private so that OpenSSH accepts it
- chmod 600 $SSH_KEY
# work under a temporary directory
- cd /tmp
# turn off key checking and use the provided SSH key
- 'export GIT_SSH_COMMAND="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i $SSH_KEY"'
# check out the latest commit on the current branch
- "git clone -b $CURRENT_BRANCH --depth 1 git@host.com:repo.git && cd repo"
- cd path/to/environments
# edit just one tag (should have used another multiline string here)
- "yq e -i \".$TARGET_IMAGE.image.tag = \\\"$NEW_TAG\\\"\" ./$TARGET_ENVIRONMENT.yaml"
# set up the author details and commit it with a message pointing at the triggering commit
- 'git config user.email "$USER_EMAIL" && git config user.name "$USER_NAME"'
- 'echo -e "feat($TARGET_ENVIRONMENT): update $TARGET_IMAGE to $NEW_TAG\n\n$UPSTREAM_REPOSITORY@$UPSTREAM_REF" | git commit -aF -'
# push it to the current branch (which will trigger a normal pipeline run)
- git push -u origin HEAD
I then set up each application repository to trigger this job for each environment after pushing a new image, sequenced appropriately. That involved access tokens, HTTP requests, and CI-side templating. It took a while because I didn’t quite understand how to use the tokens, but otherwise it was straightforward.
Ultimately, all this glue eliminated the tedium of updating tags. The final process looks like this:
- Merge changes to A.
- Wait for the new image to be announced on Slack.
- Activate a manual job to update the deployment repository (easy with the UI) and wait for the new commit to be announced on Slack.
- Activate the deployment job.
Later improvements
This system works surprisingly well considering that it’s held together by scripts, cURL, and conventions. I’ve refined it a bit in the interim:
- Most of the CI steps have migrated to standalone Bash scripts that can be used by local environments too.
- The trigger template now also passes the Git revision to the triggered job. The ConfigMap tracks both the tags and the revisions being used.
- The scripts pass the existing tags and revisions as chart values with the aid of another tiny
script that retrieves them from the cluster. It may be possible to fetch this data in the chart
itself using the
lookup
function. I had trouble with that approach during testing, but I’ll try again at some point. - The Bash notifications job has been superceded by a more robust Python arrangement that compares the existing tags and revisions to the new tags and revisions (all still passed as environment variables). Each tag mentioned in the Slack deployment announcement thus includes a link to the Git revision in the repository. In addition, confirmation of deployments or rollbacks uses a shorter syntax than the pre-deployment message. Finally, since not all services are updated in all deployments, the messages highlight what changed.
There are also a few known limitations:
- Updating one environment requires updating those before it first even if they don’t need it, because of the structure of the pipeline. It would be nice to get around this by moving the update logic into a separate service with a Slack interface.
- The notifications depend entirely on propagating state from CI to Helm. While the available
scripts handle everything, nothing stops a developer from manually running Helm or helmfile with
the wrong values. This would result in confusing or outright incorrect release announcements on
Slack, not just once but until the next automated deployment of each service. One way to make this
less likely is to add
required
where the values are used in the chart. - As mentioned already, forcibly stopping helmfile will leave the installation in a non-functional state.
I might address the redundancy in updating different environments. The rest is tolerable.
- Whence the article title (with apologies to Philip Greenspun).↩
- The charts are passed the names of Secrets managed outside Git. We might add something like SOPS at some point but it’s not a priority.↩
- In hindsight, I don’t know why expecting developers to install all the other tools seemed reasonable but not expecting them to install ash.↩