False Promises

False Promises

Reading time1 min
#devops#slo#availability#engineering#cloud

The Illusion of Reliability

99.9% uptime. Sounds great, right? That’s less than nine hours of downtime per year. Totally doable. Reassuring for customers. Easy to slap on a slide deck.

But here’s the thing: that number can be a trap.

Behind the scenes, teams chase it with dashboards and late-night deployments. Meanwhile, customers are left staring at loading spinners and outage messages.

What you're looking at is SLO debt—what builds up when you keep making promises your systems can't keep. Like technical debt, it adds up fast. And it hurts just as much.

When Expectations Meet Reality

Let’s look at two real-world cases. Both companies publicly promised 99.9% uptime. Both learned the hard way what that actually costs.

Case 1: The Holiday Meltdown

One e-commerce company was gearing up for their biggest season ever. Their stack? Microservices. Cloud-native. Built to scale. Or so they thought.

Then traffic tripled.

The system cracked. Services collapsed. Dependencies froze. During peak hours, they were down for 36 hours. That’s a real uptime of 97.5%—not even close to the promised 99.9%.

The fallout? About $2 million in lost sales, a storm of angry customers, and a mountain of SLO debt. Fixing it wasn’t just about code. It was about regaining trust.

Case 2: Complexity in Disguise

A financial services firm also waved the 99.9% flag. But their cloud setup was a tangled web—too many third-party tools, too little visibility.

In just six months, they logged 15 incidents. Each lasted around 90 minutes. Add it up, and that’s 22+ hours of downtime. Way past the threshold.

Their customers noticed. So did regulators.

This wasn’t just a tech failure. It was a reality check: the architecture, monitoring, and expectations weren’t in sync.

What Creates SLO Debt?

SLO debt shows up when your service goals outpace what your systems can really deliver.

Some usual suspects:

  • Overpromising: Setting targets without looking at real-world performance
  • Ignoring dependencies: External APIs, DNS hiccups, flaky cloud services
  • Poor observability: Missing partial outages or slowdowns
  • Disjointed response: Manual, siloed, or just plain slow incident handling
  • Lack of alignment: Business leaders make promises engineers can’t deliver on

How to Pay It Down

SLO debt doesn’t go away on its own. You have to work it off—bit by bit. Here's how.

1. Start With the Right Monitoring

Don't just watch CPU or memory. Monitor actual availability. Use tools like Prometheus to catch when your service is down—not just when a machine is stressed.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: service-availability-rule
spec:
  groups:
  - name: availability-rules
    rules:
    - alert: ServiceDown
      expr: up{job="your_service_name"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Service {{ $labels.job }} is down"
        description: "Service {{ $labels.job }} has been down for more than 5 minutes."

2. Automate Your Recovery

Manual recovery is slow. And slow costs money. Use tools like Terraform to spin up backups or roll back broken deployments.

resource "aws_instance" "backup" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"

  tags = {
    Name = "BackupInstance"
  }
}

3. Set SLOs Based on Data

Don’t guess. Don’t hope. Build your SLOs from actual history. Look at what failed, why it failed, and how it impacted real users. Then set error budgets—and use them in planning.

4. Build a Strong Incident Culture

Have playbooks. Run drills. Write postmortems people actually read. And don’t just fix the outage—learn from it.

5. Keep Talking to Users

Stuff breaks. Everyone knows that. What people care about is how you handle it. A clear, honest update in the middle of a mess? That builds more trust than a perfect record.

Rethinking Uptime

SLO debt doesn’t mean you’re doing it wrong. It means you’ve outgrown the old way of thinking.

It’s a signal. Maybe your systems are too complex. Or maybe your promises are too optimistic. Either way, it’s time for a reset.

Forget chasing a perfect 99.9%. Treat it like what it really is: a constraint, not a goal.

Because your users don’t expect perfection. They expect resilience. Honesty. Fast response when things go sideways.

Give them that—and you’ll have something better than a fancy uptime number. You’ll have their trust.

And maybe, just maybe, you’ll finally earn that vacation.