Fragile States

Stateful apps on Kubernetes can feel like juggling knives. On paper, they look resilient. In reality? One misstep and your data's gone — or worse, your architecture unravels mid-traffic spike.

One minute, your pods are scaling smoothly. Metrics look great. Then — boom — logs fill with “volumeMounts failed,” and suddenly your database has no clue where its data went.

This isn’t some edge-case scenario. Kubernetes storage can fail in ways that are subtle, expensive, and surprisingly easy to miss. Let’s walk through two real-world outages — complete with price tags — and unpack what went wrong.

The Hidden Complexity of Persistent Storage

Before we dive in, here’s a quick refresher. When you're running stateful workloads on Kubernetes, you’re typically dealing with:

Persistent Volumes (PVs) and Persistent Volume Claims (PVCs)
Storage Classes that define how those volumes get provisioned
Cloud-specific quirks (hello, AWS EBS)
Access modes like ReadWriteOnce, ReadOnlyMany, or ReadWriteMany

It’s a layered system — and every layer can break in a different way. Here’s how that plays out in the real world.

⚠️ Story #1: Black Friday, Broken Storage

A retail company moved to Kubernetes just in time for Black Friday. Their main database — MongoDB — was running as a StatefulSet using dynamically provisioned EBS volumes.

Everything seemed fine until they scaled MongoDB across multiple availability zones. That’s when it broke.

The issue? The volume was set to ReadWriteOnce, which only lets one node access the volume at a time.

So when pods got scheduled to different zones, volumes couldn’t mount. That meant broken pods, failed writes — and no redundancy when they needed it most.

The damage? They lost about 25% of orders during peak traffic. That translated to $500,000 in lost revenue.

What went wrong:

The team relied on the default access mode.
They didn’t account for cross-AZ limitations of EBS.
The storage backend didn’t support multi-node access.

Lessons learned:

Know your access modes — and when to use ReadWriteMany (like NFS or EFS).
Don’t trust defaults for critical systems.
Always test your architecture under real-world failover conditions.

⚠️ Story #2: Analytics Gone Silent

A data analytics company was processing terabytes of telemetry every day. PostgreSQL on Kubernetes, with EBS backing it. The setup was built with Terraform. Everything was automated.

And then writes started slowing down. Reports lagged. Dashboards stalled. Customers noticed.

Turns out they were using gp2 volumes with default IOPS — which weren’t nearly enough for their growing workload. Disk throughput tanked. Write latency spiked. And their reporting pipeline came crashing down.

The real kicker? While the team debated scaling CPU and memory, the actual problem was storage. By the time they switched to io1 volumes with tuned IOPS, it was too late.

The fallout? They lost over 48 hours of telemetry data. SLA penalties and emergency response totaled $200,000+.

What went wrong:

Default volume types couldn’t handle the load.
No proactive monitoring of disk throughput.
No IOPS validation for production workloads.

Takeaways:

Know your storage performance profile — especially for databases.
Parameterize Terraform configs so volumes get provisioned with the right specs.
Monitor disk I/O — not just CPU and memory.

🔍 Debugging Persistent Volume Issues

Storage weirdness? Here’s where to start:

1. Check Your PVCs

kubectl get pvc -n your-namespace
kubectl describe pvc <claim-name> -n your-namespace

Look for:

PVC stuck in Pending
Mount access conflicts
Events showing attach errors

2. Review Your Storage Classes

kubectl get storageclass
kubectl describe storageclass <name>

You want to know:

What backend it uses (EBS? EFS?)
What access modes it supports
Whether it allows volume expansion, snapshotting, etc.

3. Provision with Care (Terraform Example)

resource "aws_ebs_volume" "db_volume" {
  availability_zone = "us-east-1a"
  size              = 100
  type              = "io1"
  iops              = 1000
  encrypted         = true
}

Avoid hardcoding weak defaults. Parameterize based on the environment — staging vs production should not look the same.

🛠 Tools That Help

Some tools to keep in your pocket:

Monitoring & Alerting:
- Prometheus, Grafana — Track disk IOPS, latency, and saturation
Backups & Recovery:
- Velero, Stash — Automate snapshots and test restores regularly
Infrastructure Automation:
- Terraform, Pulumi — Keep storage config consistent across environments
Policy Enforcement:
- OPA, Gatekeeper, kube-scan — Catch misconfigurations before they hit prod

✅ What to Remember

Kubernetes gives you power, but it won’t save you from yourself. Stateful apps need extra attention — because if the data layer goes down, everything goes down.

Checklist to Stay Safe:

Confirm access modes and volume compatibility
Monitor disk IOPS and throughput — not just CPU
Don’t let defaults define your architecture
Know how your cloud provider handles volume mounts and failover
Test failovers before they’re real

Running databases on Kubernetes isn’t impossible. But it is risky — unless you treat your volumes like first-class citizens.

So go ahead. Scale your apps. But make sure your data’s coming with you.