I watched Gil Tene’s talk ‘Your Load Generator Is Probably Lying To You’. A lot of it was beyond me, but all of it sounded bad. This is the little I was able to comprehend:

I’ll have to revisit this talk someday when I have a bit of experience.

Meanwhile, in the cluster

I’ve now reinstalled Prometheus and Grafana using kube-prometheus-stack. The Traefik dashboard’s ‘Service’ variable confused me. I assumed it was supposed to slice the traefik traffic according to which Kubernetes service it was going to, but I had to fiddle with the panels a lot before they showed any data for non-Traefik services, and much of what I saw didn’t make sense even then.

What I was missing is that ‘Service’ is not a Kubernetes Service but rather a Traefik Service, so there will only ever be traefik (or maybe something like one per Traefik Pod, I don’t know). The dashboard was misleading me because it used label_value(service) and therefore returned every value for a service label in the system instead of only the ones for Traefik metrics. Once I changed the definition to label_values(traefik_service_requests_total, service), it stopped showing irrelevant values and the dashboard became comprehensible.

I’ve also got Loki up and running. Its chart says:

YAML## If you set enabled as "True", you need :
## - create a pv which above 10Gi and has same namespace with loki
## - keep storageClassName same with below setting

I created a 10 GB volume. There was no readymade way to see usage, so I tracked down the official dashboards (written in Jsonnet, which I haven’t used but seems reasonably self-explanatory) and found the metric I was looking for in loki-writes-resources.libsonnet: kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*$Container.*"}. I added kubelet_volume_stats_available_bytes{persistentvolumeclaim=~".*$Container.*"} and set them both to stack so I have an easy indicator of how much headroom I have…

Disk usage grew by 25.3 MB in 48 hours, which would mean roughly 379.5 MB in a month. I assume this will suffice for the next two years.

Jaeger

I considered using the Jaeger Operator, but that chart doesn’t appear to be compatible with Helm 3, so I have to assume it’s outdated. I turned to the standalone Helm chart. That was a frustrating experience, to say the least. It required:

  1. Somewhere between two and three hours of changing heap sizes and fiddling with TLS certificates to get ElasticSearch working until I figured out that the Bash script that reported each Pod’s readiness couldn’t handle spaces in the password.

  2. Another hour of trying to make Jaeger connect to Kafka that ended only when I understood that the chart wasn’t configuring the Jaeger components to connect to the Kafka installation it created. It was better to install Kafka separately.

  3. Another couple of hours lost connecting Jaeger to the ElasticSearch cluster, including regenerating the certificates with the right host names. I even tried importing the generated root certificate into my local store, tried to connect locally, saw it was rejected, gave up, and turned off TLS verification. Or thought I did.

  4. Another hour spent searching for the right incantations to force all the different Jaeger components to disable TLS verification. This is what did the job:

    YAML# irrelevant details elided
    storage:
      type: elasticsearch
    
      elasticsearch:
        env:
          ES_TLS_ENABLED: "true"
          ES_TLS_SKIP_HOST_VERIFY: "true"
          ES_ARCHIVE_TLS_ENABLED: "true"
          ES_ARCHIVE_TLS_SKIP_HOST_VERIFY: "true"

It worked in the end, though. I started reflexively trying to enable tracing in Prometheus, ElasticSearch, et al until I remembered that… isn’t useful to me.

What did I gain?

It was all worth it to see those beautiful graphs. 99.9% of my cluster now consists of monitoring tools watching other monitoring tools, and even themselves, like so:

Who needs software to run when you’ve got an observability stack, eh?

Monitoring containers by Helm chart

I installed a dashboard to monitor individual containers and updated it to use the correct variables. Then I created a few panels for it to show the resource usage history:

Two panels showing the CPU and memory usage for the Kafka container.

It works, but all the underlying queries match the container label against the value selected for the Container variable, which breaks down when the pods aren’t named accordingly (for example, if container is foo while the pods are f-master-0) and, as is visible in the above graphs, when more than one matching container has existed in the selected period. Since I use Helm charts for everything, I thought I could use metric relabeling to coalesce and add a few Helm variables. Then I’d match against app.kubernetes.io/name and app.kubernetes.io/component, and maybe some label that marked a new deployment.

This seemed like something I might need to hack kube-state-metrics directly to do. I found where the labels are applied, and the code has access to the pod metadata at that point. Fortunately, though, I didn’t need to patch anything; instead, I set up some relabeling (or, more precisely, some metric relabeling) for kube-state-metrics:

YAML# values.yaml for kube-stack-prometheus
kube-state-metrics:
extraArgs:
  - "--metric-labels-allowlist=pods=[app,name,app.kubernetes.io/name,helm.sh/chart,app.kubernetes.io/component]"
kubeStateMetrics:
  serviceMonitor:
    metricRelabelings:
      - sourceLabels: [label_app, label_app_kubernetes_io_name]
        separator: ":"
        targetLabel: "app"
      - sourceLabels: [app]
        regex: "(.*):(.*)"
        replacement: "$1$2"
        targetLabel: app
      - action: "labeldrop"
        regex: "^label_app_kubernetes_io_name|label_app$"

The extra arguments for kube-state-metrics tell it to transfer those specific labels from pods onto their metrics. Then the metricRelabelings section normalizes the app and app.kubernetes.io/name labels in a roundabout fashion.[1] Then the dashboard only needs to match against a single label. This worked perfectly:

The trouble now is, I would need to transfer the labels and then relabel the metrics for every component that collects them, which isn’t always possible. For example, the metrics for CPU and memory usage are collected by cAdvisor, which is running in the cluster as part of the kubelet. Because I’m using DigitalOcean’s Managed Kubernetes, I don’t have any control over cAdvisor and therefore can’t pass it the command-line flag to transfer labels, rendering this entire endeavour somewhat useless.

I restored the original dashboard in defeat. (On the bright side, I’ve noted for the future that all this extra cardinality I was worrying about amounted to fewer than 200 new samples appended per second in Prometheus.)

Assorted titbits


  1. Since a given pod will only have at most one of those labels, it puts them both in a new app label with a : in between, removes the colon (leaving us with at most one full name), and drops the two original labels.