Cardinality Collapse

Cardinality Collapse

Reading time1 min
#devops#prometheus#monitoring#metrics#kubernetes

Signal vs. Noise

Modern observability tools like Prometheus make it really easy to collect metrics. Sometimes... too easy.

With just a few lines of code, you can track every edge case, every click, every moving piece of your system. Feels powerful at first. But there’s a catch: the more you measure, the harder it gets to see what actually matters.

Especially when every metric has a dozen labels, and each label has a thousand values. That’s when observability flips—from helpful to harmful.

This article digs into the hidden cost of high-cardinality metrics, how real teams got burned by it, and how you can avoid the same fate.


What Is Cardinality—and Why It Can Wreck Your Day

In Prometheus, a "time series" is just a unique combo of a metric name plus its labels. So the more labels—and the more different values those labels have—the more time series you end up with.

That’s cardinality.

And high cardinality? It’s not just a data problem. It’s a blow-up-your-monitoring-stack problem. Think:

  • More memory used
  • Slower queries
  • Creaky dashboards
  • Systems tipping over when you need them most

Case Study #1: When Metrics Turned Against Us

One mid-sized e-commerce company wanted full visibility into user behavior. So they tracked everything—clicks, product views, purchases—per user.

Seemed smart. Until Prometheus hit 1.5 billion time series. In a month.

Dashboards that used to load in under a second now took 10 to 12:

sum(rate(http_requests_total{status="200", source="web-app", instance=~".*"}[5m])) by (instance)

Internal users noticed. “Why’s the dashboard so slow?” And suddenly, observability—meant to shine a light—became part of the mess.

The fix? The team stripped out user-level labels. Aggregated metrics by region, device type, and user tier. They got speed and sanity back.


Case Study #2: The 5,000-Label Disaster

A logistics company wanted to track package delivery in real time. So they wrote this:

var deliveryMetrics = promauto.NewCounterVec(prometheus.CounterOpts{
    Name: "package_deliveries_total",
    Help: "Total number of package deliveries.",
}, []string{"package_id", "sender", "destination"})

Looks innocent. But that single metric exploded into 5,000+ unique label combos. Prometheus memory usage jumped 80%. Retention policies broke. Alerts misfired. Queries timed out.

In short: they couldn’t trust their monitoring anymore.

The recovery? They:

  • Dropped package_id (too unique)
  • Grouped destinations into regions
  • Shifted per-package data into logs and traces—better tools for high-detail stuff

The Real Problem: Misunderstanding Observability

In both cases, teams fell into the same trap: assuming more data means better insights.

But observability isn’t about hoarding data. It’s about knowing what’s happening—and why.

High-cardinality metrics cause pain because:

  • Each label combo adds a new time series
  • Prometheus keeps label indexes in memory
  • High-cardinality queries are slow and expensive
  • Downsampling becomes tricky, and deduplication nearly impossible

How to Keep Cardinality Under Control

Before you add a new metric, ask:

  • Do I need this level of detail?
  • Can I roll this up into broader categories?
  • Is this better suited for logs or traces?

Some solid practices:

  • Avoid unbounded labels like user_id, request_id, or timestamps
  • Use fixed vocabularies—status codes, regions, device types
  • Create label budgets as part of design reviews
  • Audit your metrics and clean up what no one’s using

The Bottom Line

High-cardinality metrics aren’t evil. They’re just dangerous when ignored.

They eat memory, slow everything down, and turn your observability tools into liabilities.

So start with less. Be picky. Aggregate where you can.

Because the goal isn’t to collect everything—it’s to understand what matters.