The year was 2021. I was feeling inappropriately smug after having assembled an observability stack in my personal Kubernetes cluster. Then everything abruptly broke and I could no longer access my beautiful graphs.

I still don’t know what the reason was—most likely resource exhaustion—but the plan was always to rebuild it in the form of Infrastructure-as-Code (IaC), replacing my ad hoc bundle of manifests and instructions. I thought I’d end up with a simple Terraform-based system that deployed the same monitoring stack I had at that point. As a software developer with minimal ops experience, I expected understanding and assembling all the pieces to take a few weeks, or at most a month.

The odyssey that unfolded (erratically) over the next 16 months, however, saw me dive into everything from tricking Terraform into adopting resources, to booting NixOS in Packer, to desperately trying to tame TimescaleDB WALs, to adapting Linkerd dashboards for Grafana. As I write this, my stack includes:

But getting this far took much exploration, many false starts, and no small amount of questioning my sanity and competence. Some of the things I used along the way are:

You can find my code on GitLab at shivjm-www/infrastructure, although the (ever-evolving) current incarnation might differ from what you see in the next few entries. Over the course of this series, I intend to outline what I did and why, not how; while I’ll link to my commits where appropriate, you won’t see much code in the articles themselves. The primary goal is to record my thoughts for myself, but I hope others will find it helpful too. Feel free to subscribe to the feed to stay updated.


Next in series: (#2 in The Death and Rebirth of a Cluster)