I once thought of New Relic, Prometheus, and their ilk as irrelevant in the absence of a team dedicated to instrumenting and monitoring applications. The idea that they could be used at any scale only percolated through my brain after watching Coda Hale’s excellent talk ‘Metrics, Metrics, Everywhere’.[1] By the time I tackled Kubernetes a year and a half ago, I had some inkling of how useful monitoring could be, and people seemed to be talking about ‘observability’ everywhere I looked, often alongside ‘service meshes’.

Come last week, we had a catastrophe on our hands at work. An operation that ought to have finished within an hour was taking days. We had no means of finding out why, since the only debugging tool available was the application logs, which were overflowing with irrelevant details.[2] Seeing the perfect excuse right in front of me, I seized the opportunity to begin ‘adding observability to our stack’.

The assassination of Zipkin

I dove into the world of observability the same way I once dove into Kubernetes. First came local testing. My priority was implementing distributed tracing to determine where the bottlenecks lay. I added the Spring Cloud Sleuth library to our applications. It can send the traces to the Zipkin system, which I ran in Docker.

Having completed the formalities, I, er, clicked on buttons in the applications and checked the corresponding traces. One advantage of Java immediately became apparent: integration. With a single added dependency, I could watch the trace context pass unaltered from one service to the other thanks to the automatic instrumentation of Spring libraries… even through a RabbitMQ queue! I was able to inspect the resultant 3-second trace without incident. However, the Zipkin UI foundered under the weight of a 3-minute trace: it became unresponsive for a full minute before showing timestamps of NaN.

A better solution

In searching for a more usable replacement, I read that the Jaeger distributed tracing system, which I’ve been hearing about for a couple of years, can understand the Zipkin protocol. I tried the all-in-one image in Docker per the guide and noted with some relief that it could effortlessly process and display the aforementioned 3-minute trace.

Encouraged by this, I tried the long operation from production which had prompted all the monitoring in the first place. It went on and on. Jaeger’s trace ended at the six-hour mark, perhaps because of some builtin limit, but that was more than enough for me to locate the areas that were disproportionately slow: I had discovered yet another O(M*N) problem in the form of a database query that executed further queries for each record returned.[3] Replacing it with a single aggregation pipeline shortened the full operation to 35 minutes. Distributed tracing had already paid for itself, and I was only running it locally!

Getting the same applications to talk to Prometheus in Minikube as well as Jaeger in Docker was a bit of a nuisance, but I eventually had Grafana visualizing my metrics. I kept it simple, only adding counters and timers to two core operations in our applications.

Moving on to this site, I installed Prometheus and Grafana in my DigitalOcean cluster once more using the standalone Helm charts. This time, I didn’t need anything more than a few annotations on the resources I wanted Prometheus to monitor. Partway through my experiments, I had to redeploy Grafana as a StatefulSet rather than a Deployment to properly preserve its data, and the same for Prometheus. (Then I got to thinking I should reinstall everything with the Prometheus Operator so I can use ServiceMonitors and transfer labels from pods to metrics. Later.)

There are many promising dashboards on the Grafana website, but they often reveal themselves to be broken once installed. I fixed a few. I even installed an exporter for GitLab CI pipeline metrics. When I publish this entry will be the first time it shows any activity. (It didn’t.)

I’m a little uncomfortable exposing Grafana to the world. I would have liked to use an IP whitelist for the admin user to only allow logging in from localhost (through port forwarding), but it relies on upstream HTTP servers to enforce more complicated authentication strategies. I’m not ready for that just yet.

Brave, but not that brave

The numbers I could see, in conjunction with one blog post from Chris Siebenmann about the footprint of a Prometheus deployment and another about high cardinality metrics, allayed my biggest fears: contrary to what I had heard, metrics required very little space. Given the rate of ingestion I saw locally, we could have chosen to retain the data for a full year at almost no cost. I expected logs would require less space still.

Now that I had basic monitoring and tracing in place, I thought I would take a gander at service meshes. I explored the Linkerd documentation (in particular, the Helm alternative to the *nix-only CLI), but I concluded I didn’t need it just yet. Another day perhaps. I am curious about the Grafana dashboards it provides—it may be worth installing just to tinker with those. What’s more, we (unfortunately) intend to deploy Istio in production, so Linkerd may be a good midpoint to learn how meshes work once I’ve finished setting up all the monitoring tools.

I also briefly considered InfluxDB, which I understand to be specialized for storing time series data. It looks promising as a Prometheus backend but, again, too complicated for my requirements: it’s yet another moving part, or two with the Telegraf agent. (Chronograf appears to be obsolete, as the website only refers to InfluxDB v1.)

I don’t know how we’ll handle the cardinality problem (and cardinality is critical). Apparently, no existing open source product fits all the criteria of observability. Honeycomb does—naturally, considering the author outlined these criteria to demonstrate its value—but it’s a paid service. I suppose we should focus on what we can do right now. Grafana Tempo might someday offer a solution:

To do analytics like this, a columnar store is perfect, but I couldn't find a good one that is backed by an object store, schemaless and super simple to operate. I pray I am wrong and that there exists one already, because that would be awesome!

I should give a shoutout to sybil, but its sadly not embeddable and uses a forked process for everything. Nothing wrong with that in itself, but it also has very little activity and the code is far from idiomatic. Honeycomb is a great solution and is powered by such a columnar store and Slack uses Honeycomb for its trace indexing! But sadly it's closed source and I think (heavy emphasis on I) we will end up building an OSS columnar store as part of Tempo 2.0.

Looking to the future

I still need to install Jaeger in the cluster. It displeases me that it requires ElasticSearch for persistence, but needs must. I also have to try Loki, which I installed but couldn’t use earlier because there was no straightforward way to stream the application logs from the host machine into Minikube.

Once I’m confident I’ve set things up correctly in my personal cluster, I’ll try repeating the procedure for A Viral World. I must admit, though, I’m not looking forward to having to instrument all the backend Rust code myself. I hope Node (which we use on the frontend) has good automatic instrumentation, at least. Maybe I’ll even feel brave and give InfluxDB another try; anything is possible.

In any event, metrics are fascinating. I love dashboards (although I understand that they’re not the real thing, aren’t the point, aren’t the territory, and so on and so forth). I exult over every successful addition, and I already feel so important and skilled, what with having my own Grafana instance making queries that I’ve fixed with zero PromQL knowledge.

There’s much more to talk about, but this has taken long enough to detail already, so I’ll continue in another instalment.

  1. Upon rewatching it now I’m a bit confused about how the types he mentions map to Micrometer types.
  2. 50 lines of useless logs per request, most of which were being produced automatically by the framework. We need to do better, but there were other things to think about at the time.
  3. I say ‘yet another’, but I still need to write about the first one.