I work on a platform composed of a group of mostly Java-based services. I learnt Kubernetes early on so we (really just two of us) could migrate the services from an ad hoc VM which was created and updated by hand.

The result of that migration was a Helm chart to deploy all the services as a single unit. Following best practices, rather than using latest for the image tags, every deployment specifies unique image tags generated in the CI pipelines. Using immutable tags in this way avoids significant pitfalls, but updating services (e.g. A and B) involved too much manual labour:

  1. Merge the changes to A.
  2. Open the relevant pipeline details.
  3. Wait for the relevant job in the pipeline to complete.
  4. Open the logs for the job.
  5. Scroll to the end.
  6. Select the tag (a-tag, say) without any surrounding whitespace in the output, making sure to select the unique tag and not the latest tag which was pushed at the same time.
  7. Copy the tag.
  8. Repeat this process to get the tag for B (b-tag, say).
  9. Notify my colleagues that I was about to update A and B in environment E to a-tag and b-tag.
  10. Run Helm using helm upgrade --atomic chart-name chart-directory --set a.image.tag=x,b.image.tag=y.
  11. Notify my colleagues that I had updated A and B in environment E to a-tag and b-tag.

There were many opportunities to make mistakes. Ideally, we would do away with all that by implementing the GitOps workflow: a central repository would describe our deployments and new commits would trigger updates. However, I was certain a proper GitOps engine like Argo CD would involve too many new concepts for the entire team to adapt to on top of everything else. After all, only the two of us understood Kubernetes in the first place. Therefore, I rolled up my sleeves and got to work last month on a hacked-together substitute that just about meets our requirements.[1]

The existing process required copying and pasting tags to notify colleagues. Before automating anything, I added all the tag values to the chart’s NOTES.txt template. Now we could copy them as a block from the output of a successful helm upgrade and paste them into Slack.

No more job-hunting

The first and comparatively easiest goal was to obviate the need to pore over CI jobs. I made each repository’s pipeline end by sending a Slack message carrying the new tag. Since most of the jobs use Alpine Linux–derived images, I had to add apk add --no-cache -q curl in order to send the request. (Sub-optimal, but not significant.) It took a few attempts to quote the step in YAML such that it would produce a correctly-quoted string with escaped JSON for the ash interpreter, even with a multiline string:

YAML# other steps here
- apk add --no-cache -q curl
- |
  curl -X POST -H "Content-Type: application/json" --data "{\"text\": \"*Pushed new $APP image to registry:* \`$TAG\`\", \"username\": \"$SLACK_USERNAME\"}" $SLACK_URL

Once this was in place, I could just wait for the Slack announcement and easily copy the new tag from there instead of monitoring the jobs until they were triggered and finished.

No more typing out notifications

I earlier made the list of deployed versions accessible in the chart notes, but we still had to copy and announce them. Fortunately, Helm can run Jobs in response to specific events. I wrote a tiny Bash script to send a notification to Slack in the same way as before, populating details from environment variables. I put this in a subdirectory and defined a ConfigMap like so:

NunjucksapiVersion: v1
kind: ConfigMap
  name: {{ include "app.fullname" . }}-notifications
{{ include "app.labels" . | indent 4 }}
    family: app
    "helm.sh/hook": pre-install,pre-upgrade
    "helm.sh/hook-weight": "-6"
    "helm.sh/hook-delete-policy": before-hook-creation
  {{- (.Files.Glob "notifications-configmap/*").AsConfig | nindent 2 }}

(The helm.sh annotations ensure the ConfigMap is created before the Job that uses it.)

values.yaml now needed a few new entries for the hooks to use:

  image: "curlimages/curl"
  tag: "7.80.0"

  secretRef: ""

  # this ought to be part of a template instead of part of the values,
  # but I did it the wrong way and never fixed it
    pre-install: "Deploying"
    pre-upgrade: "Deploying"
    post-install: "Deployed"
    post-upgrade: "Deployed"
    post-rollback: "Rolled back"

environment: "" # the human-readable label for this environment

With those pieces in place, I could add a hook—technically, one each for pre-install, pre-upgrade, post-install, post-upgrade, and post-rollback—using the official cURL image with a custom entrypoint, mounting the script from the ConfigMap and adding all the required values as environment variables:

Nunjucks{{- range $hook, $description := .Values.slack.hooks }}
apiVersion: batch/v1
kind: Job
  name: "{{ $.Release.Name }}-slack-{{ $hook }}"
    app.kubernetes.io/managed-by: {{ $.Release.Service | quote }}
    app.kubernetes.io/instance: {{ $.Release.Name | quote }}
    app.kubernetes.io/version: "{{ $.Chart.AppVersion }}"
    helm.sh/chart: "{{ $.Chart.Name }}-{{ $.Chart.Version }}"
    "helm.sh/hook": {{ $hook }}
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded
      name: "{{ $.Release.Name }}-slack-{{ $hook }}"
        app.kubernetes.io/managed-by: {{ $.Release.Service | quote }}
        app.kubernetes.io/instance: {{ $.Release.Name | quote }}
        helm.sh/chart: "{{ $.Chart.Name }}-{{ $.Chart.Version }}"
      restartPolicy: Never
        - name: notify-slack
          image: "{{ $.Values.curl.image }}:{{ $.Values.curl.tag }}"
          command: ["/bin/sh"]
            - /etc/scripts/send-notification.sh
            - name: WEBHOOK_URL
                  name: "{{ required "A valid Slack Secret is required!" $.Values.slack.secretRef }}"
                  key: webhookUrl
            - name: WEBHOOK_CHANNEL
                  name: "{{ required "A valid Slack Secret is required!" $.Values.slack.secretRef }}"
                  key: webhookChannel
            - name: WEBHOOK_USERNAME
                  name: "{{ required "A valid Slack Secret is required!" $.Values.slack.secretRef }}"
                  key: webhookUsername
            - name: ACTION
              value: "{{ $description }}"
            - name: ENVIRONMENT
              value: "{{ required "A valid environment is required!" $.Values.environment }}"
            - name: A_TAG
              value: "{{ $.Values.a.image.tag }}"
            # (other service tag values go here)
            - name: scripts
              mountPath: /etc/scripts
        - name: scripts
            name: {{ include "app.fullname" $ }}-notifications
{{- end }}

I was subsequently rewarded with automatic, informative Slack notifications any time a deployment started, ended, or failed.

No more running Helm locally

Running Helm on each developer’s machine was a terrible way to manage deployments. It forced everyone to memorize or write down opaque incantations. It was also all too easy to deploy to the wrong environment or with the wrong settings, or even with the wrong version of the chart if someone forgot to pull the latest updates first. (We had experienced all three scenarios at various points, almost as frequently as the successful scenarios.)

I recently had the opportunity to try helmfile and was certain it was right for us: it’s a reasonably simple tool with a single purpose that reuses familiar concepts. The configuration I built takes advantage of its solid support for multiple environments. There is a base values file plus one file per environment for more specific configuration.[2] The pipeline has two jobs per environment, one for ‘preparation’ (which runs helmfile diff) and one for ‘deployment’ (which runs helmfile apply) that requires the first one to succeed. There’s also a linting stage.

Automatic deployments were never a goal, so each one is triggered manually and requires all the previous jobs to have succeeded. The diff is only advisory, unlike a Terraform plan. It may be outdated by the time the deployment runs or ignored by the developer, but it serves as a sanity check.

From start to finish, integrating helmfile only took about one evening. I did discover a major caveat, though: forcibly aborting the execution of helmfile apply (not letting it clean up) can leave the Helm release and therefore the application in a broken state. I spent 15 panicked minutes searching for a solution. What worked was to identify and delete the release secrets (kubectl delete secret sh.helm.release.v1.chart-name.vrelease-revision) for the last two releases. Since we had deployed several tiny, consecutive experimental releases to test all this, rolling back was not a problem.

Although it’s always possible to cancel a job manually in the CI interface, I added a flag to all deployment jobs to prevent automatic cancellation. It’s the developer’s responsibility to avoid creating a race condition by running two deployment jobs to the same environment concurrently.

No more changing image tags by hand

At this point, the process had been abbreviated to:

  1. Merge the changes to A.
  2. Wait for the pipeline to announce the new image tag on Slack.
  3. Update the appropriate environment in the deployment repository with the new tag and merge the changes.
  4. Activate the jobs in succession as required.

Having got this far, I was unwilling to continue updating the tags by hand. I wanted a better way. The complicating factor was that, since the deployment details were now maintained in Git, updating the tag required adding a Git commit. GitHub Actions make this easy, as the default GITHUB_TOKEN can push to the repository. Our CI system does not.

I spent a lot of time wrestling with this. In the end, I created an SSH key to use in a job designed to be triggered by other repositories. (Only that key can push directly to trunk in the deployment repository.) Funnily enough, the job runs the base Alpine Linux image, but it starts by installing Bash (along with Git, OpenSSH, and yq, to edit the YAML) to run the script, ash being less likely to be available in a typical development environment than Bash.[3] The final script looks something like this:

# $TARGET_IMAGE, and $NEW_TAG from triggering job

# install wget, Git, OpenSSH, and yq
- |
  apk add -q --no-cache wget git openssh && \
  wget -q https://github.com/mikefarah/yq/releases/download/v4.16.1/yq_linux_amd64 -O /usr/bin/yq && \
  chmod +x /usr/bin/yq

# make the SSH key file private so that OpenSSH accepts it
- chmod 600 $SSH_KEY

# work under a temporary directory
- cd /tmp

# turn off key checking and use the provided SSH key
- 'export GIT_SSH_COMMAND="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i $SSH_KEY"'

# check out the latest commit on the current branch
- "git clone -b $CURRENT_BRANCH --depth 1 git@host.com:repo.git && cd repo"

- cd path/to/environments
# edit just one tag (should have used another multiline string here)
- "yq e -i \".$TARGET_IMAGE.image.tag = \\\"$NEW_TAG\\\"\" ./$TARGET_ENVIRONMENT.yaml"

# set up the author details and commit it with a message pointing at the triggering commit
- 'git config user.email "$USER_EMAIL" && git config user.name "$USER_NAME"'
- 'echo -e "feat($TARGET_ENVIRONMENT): update $TARGET_IMAGE to $NEW_TAG\n\n$UPSTREAM_REPOSITORY@$UPSTREAM_REF" | git commit -aF -'

# push it to the current branch (which will trigger a normal pipeline run)
- git push -u origin HEAD

I then set up each application repository to trigger this job for each environment after pushing a new image, sequenced appropriately. That involved access tokens, HTTP requests, and CI-side templating. It took a while because I didn’t quite understand how to use the tokens, but otherwise it was straightforward.

Ultimately, all this glue eliminated the tedium of updating tags. The final process looks like this:

  1. Merge changes to A.
  2. Wait for the new image to be announced on Slack.
  3. Activate a manual job to update the deployment repository (easy with the UI) and wait for the new commit to be announced on Slack.
  4. Activate the deployment job.

Later improvements

This system works surprisingly well considering that it’s held together by scripts, cURL, and conventions. I’ve refined it a bit in the interim:

There are also a few known limitations:

I might address the redundancy in updating different environments. The rest is tolerable.

  1. Whence the article title (with apologies to Philip Greenspun).
  2. The charts are passed the names of Secrets managed outside Git. We might add something like SOPS at some point but it’s not a priority.
  3. In hindsight, I don’t know why expecting developers to install all the other tools seemed reasonable but not expecting them to install ash.