False Signals

False Signals

Reading time1 min
#devops#latency#metrics#performance#kubernetes

In big systems, metrics are everything. They’re how we know things are working. Until they lie to us.

That’s exactly what happened to us. Everything looked good—dashboards were green, metrics in check, latency at the 99th percentile looking fine. But users? They were stuck. Waiting. Wondering what went wrong.

The short answer: we trusted the 99th percentile too much. It told a happy story. But the real story was happening somewhere deeper—and it wasn’t showing up in the charts.

Let’s walk through two real-world examples where metrics misled us, and how tools like tracing and chaos engineering helped us finally see what was really going on.


When Percentiles Hide the Truth

Percentiles are great at summarizing data. But sometimes they summarize too much. You can have a “healthy” 99th percentile latency and still have a decent chunk of users getting hit with multi-second delays.

Why? Because percentiles smooth things out. They hide the long tail. And in systems built on microservices—where one slow downstream call can ripple across the stack—that tail gets sharp fast.

Here’s the problem: when only a few requests are slow, they don’t show up in the 99th percentile. But those few might be the ones that matter most.


Case #1: Slowness Hidden in a Sea of Green

One of our clients—a financial services firm running hundreds of microservices—ran into exactly this.

Their dashboards looked perfect. Latency percentiles were healthy. No red flags.

But users were frustrated, especially during key moments like portfolio rebalancing. Things felt sluggish.

So we dug deeper. During high traffic, certain downstream services were intermittently overwhelmed. Not down. Just struggling. Some calls took too long, which triggered retries. Retries caused queues. Queues slowed everything else down. It became a feedback loop.

But because it only affected a slice of requests, the percentile metrics didn’t catch it. It was the classic “boiling frog” problem—something was clearly wrong, but the system didn’t think so.


Chaos to the Rescue

To get to the root, we ran controlled chaos experiments. Off-peak, safe window, no customer impact.

Using Terraform and the Chaos Toolkit, we injected latency into lower-risk services. We wanted to see how the system really handled stress—not in theory, but in action.

Here’s a stripped-down version of the setup:

resource "aws_ec2_instance" "chaos_instance" {
  ami           = "ami-0abcd1234efgh5678"
  instance_type = "t2.micro"

  tags = {
    Name        = "ChaosTester"
    Environment = "Testing"
  }
}

We watched how latency moved through the stack. We found weak spots: internal APIs without proper backoff, services with tight timeouts, and orchestration patterns that didn’t degrade gracefully.

Afterwards, the team rolled out a few fixes:

  • Adaptive retries
  • Circuit breakers
  • Local caching

Tail latency dropped from 8+ seconds to under 400 ms in the most critical paths.


Case #2: When Flash Sales Flatter the Metrics

We hit a similar issue in our own e-commerce stack. Big campaign. Huge spike in traffic.

Dashboards? All green.

  • Average latency: fine.
  • Error rate: low.
  • 99th percentile? Steady.

But support lit up. “Site’s laggy.” “Pages freezing.” Not everywhere—just some users.

So what happened?

Two things:

  • Our load balancer wasn’t balancing well. Round-robin looked fair on paper, but traffic wasn’t evenly distributed.
  • Our database had hidden connection limits that kicked in under load.

The result? Some API paths got throttled. Not enough to break, just enough to slow down. Average latency? Still smooth. But some users were waiting over 3 seconds per request.

So we added OpenTelemetry for distributed tracing inside our Kubernetes cluster:

# Deploy OpenTelemetry Collector
kubectl apply -f otel-collector.yaml

Now we could follow every request—every hop, every delay, every hiccup. We traced the slowdowns to database lock contention during peak reads. Queries were stacking up.

We fixed it with:

  • Better indexes
  • Smarter batching
  • Load-aware routing

By the next campaign, things held steady. Fast. Reliable. And no surprises.


The Bigger Lesson

Here’s what we learned the hard way:

Metrics are helpful. But they aren’t enough.

If you only watch percentiles, you’ll miss the users stuck in the long tail. And those are the ones who remember the experience.

To truly understand your system, you need:

  • Tracing (so you can follow real requests)
  • Chaos tests (so you can find pressure points)
  • Production-like load testing (so you don’t get surprised later)

Before You Trust the Numbers…

The 99th percentile is useful—but only if you understand what it hides. Relying on it alone is like checking a city’s average temperature and assuming everyone’s comfortable.

The world doesn’t work that way. Neither do systems.

So trace more. Break things (on purpose). Stress test smart. And most importantly, listen to your users—because they notice what your dashboards miss.

When it counts, performance isn’t just a metric. It’s a feeling.