The Death and Rebirth of a Cluster: Introduction
The year was 2021. I was feeling inappropriately smug after having assembled an observability stack in my personal Kubernetes cluster. Then everything abruptly broke and I could no longer access my beautiful graphs.
I still don’t know what the reason was—most likely resource exhaustion—but the plan was always to rebuild it in the form of Infrastructure-as-Code (IaC), replacing my ad hoc bundle of manifests and instructions. I thought I’d end up with a simple Terraform-based system that deployed the same monitoring stack I had at that point. As a software developer with minimal ops experience, I expected understanding and assembling all the pieces to take a few weeks, or at most a month.
The odyssey that unfolded (erratically) over the next 16 months, however, saw me dive into everything from tricking Terraform into adopting resources, to booting NixOS in Packer, to desperately trying to tame TimescaleDB WALs, to adapting Linkerd dashboards for Grafana. As I write this, my stack includes:
- Terraform to provision the infrastructure, with kind (Kubernetes in Docker) as the testing environment
- Argo CD for a GitOps approach to provisioning the cluster, using Jsonnet as much as possible rather than Helm or Kustomize
- cert-manager
- The Linkerd service mesh
- The Contour ingress controller
- A different, entirely Grafana Labs–based observability stack consisting of Mimir, Tempo, Loki, and Grafana provisioned via Jsonnet, with all data stored in S3-compatible storage
- external-dns for automatically updating DNS records
- Lens to inspect resources in the cluster when I need a GUI
- GitLab Runner (as before) and GitHub Actions Runner Controller
But getting this far took much exploration, many false starts, and no small amount of questioning my sanity and competence. Some of the things I used along the way are:
- minikube
- The Traefik ingress controller
- Tanka
- SOPS and age (though I still use these elsewhere)
- KubeView
- Octant
- webmentiond and docker-postfix
- Jaeger with ElasticSearch and Kafka
- TimescaleDB with Promscale, Patroni (with Raft or etcd), and pgBackRest
- NixOS
- QEMU
- Packer
- Vagrant
- VirtualBox
- Ansible
- linkerd-disable-injection-mutation-webhook
- Nomad
You can find my code on GitLab at shivjm-www/infrastructure, although the (ever-evolving) current incarnation might differ from what you see in the next few entries. Over the course of this series, I intend to outline what I did and why, not how; while I’ll link to my commits where appropriate, you won’t see much code in the articles themselves. The primary goal is to record my thoughts for myself, but I hope others will find it helpful too. Feel free to subscribe to the feed to stay updated.
Onwards!