Signals of Collapse

Signals of Collapse

Reading time1 min
#oncall#burnout#devops#metrics#engineering

Imagine it’s 3 AM. Your heart’s racing—not from a dream, but because your phone just buzzed. Pager alert. You grab it, already bracing yourself. This isn’t some rare emergency. It’s just another night on-call.

For a lot of engineers, this isn’t a fluke. It’s the norm.

On-call work is one of the biggest reasons engineers burn out. But here’s the thing: most teams don’t talk about it. And when they don’t? People don’t just get tired—they leave.


So What’s Really Burning People Out?

Let’s talk numbers: nearly 70% of on-call engineers say they’re feeling burned out. Not just a little tired. Properly worn down.

You see it in the tension during standups. The sighs when someone’s phone buzzes. The good engineers walking out the door after one escalation too many.

Here’s what that burnout actually looks like in the real world.


Story #1: TechCorp’s “Everything’s On Fire” Alerts

At TechCorp, a mid-sized SaaS company, engineers took on-call one week a month. It was manageable—until it wasn’t.

As the product grew, so did the noise. 300 alerts per week. And guess what? 80% were low-priority. Stuff that didn’t need to wake anyone up.

Morale dropped fast. Two senior engineers left in six months. The rest had to cover more shifts. And the cycle just kept spinning.


Story #2: DataDynasty’s Weekend From Hell

DataDynasty went all-in on microservices. But they didn’t plan for how messy that could get.

One weekend, a failure in the service mesh triggered 500 alerts. In two days.

The alerting system broke down. Dashboards went red. And engineers were stuck on calls almost nonstop. No sleep. No breaks. Just chaos.

Afterward? People were fried. Some took unplanned time off. Others started questioning whether they wanted to stay in tech at all.

These aren’t one-off disasters. They’re signs of a system that isn’t working.


How to Spot Burnout Before People Quit

The problem with burnout? You often don’t see it until it’s too late.

But if you know what to look for, you can catch it early.

1. Watch the Alert Numbers

Start tracking your alert load—seriously. Tools like Prometheus or Grafana can help.

Use a query like this to find noisy alerts:

sum(increase(alerts_total[1w])) by (alertname) > 100

Things to monitor:

  • How many alerts each engineer gets
  • What time they’re getting pinged (especially at night)
  • Repeat alerts for the same problem

Set some limits. If the numbers spike, take action.

2. Automate the Boring Stuff

Not everything needs a human. If something’s predictable, automate it.

Here’s an example using Terraform to cut down on CPU alerts:

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high_cpu_alarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "1"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "300"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors CPU utilization"
}

This way, alerts are tied to real problems—not vague signals that cause panic at 2 AM.

3. Track the Human Side

Pager volume is just one piece. To really understand burnout, track:

  • MTTA / MTTR: How long it takes to respond and fix things
  • Sleep disruptions: How often someone’s woken up
  • Monthly satisfaction surveys (make them anonymous!)
  • Escalation count per engineer

You're not tracking people. You're spotting patterns—before they become resignations.


Culture Matters More Than Dashboards

Data’s important. But you won’t fix burnout with graphs alone.

What actually helps?

  • Give people recovery time after on-call weeks
  • Do blameless postmortems: no finger-pointing, just learning
  • Rotate alert ownership so no one’s always “the hero”
  • Fix the root causes—not just the symptoms

Incidents shouldn’t feel like punishment. They should be learning moments.


Don’t Wait for the Goodbye Email

If you're guessing at burnout, you're already behind.

Start measuring it. Pay attention to how often your team gets paged, and how they feel about it. Build systems and cultures that reduce the load.

Because when good engineers leave, it’s rarely just because of one bad night. It’s because they saw no one trying to make things better.


Some Tools That Can Help:


Final Thought:
The next time your phone buzzes at 3 AM, pause. Listen to it—not just for the alert, but for what it’s telling you about your systems, your team, and your culture.

The pager isn’t just noise. It’s a signal. Don’t ignore it.