Architecting a Highly Resilient Kubernetes Cluster: Key Practices
Outages happen, regardless of intent or planning. The difference between hours of downtime and seamless failover usually comes down to what’s in place before the failure. A high-resilience Kubernetes cluster is not an accident—it’s engineered.
1. Multi-AZ Deployment: Don’t Let a Single Availability Zone Take You Down
AZs fail; networks partition. Spanning nodes and storage across multiple Availability Zones is non-negotiable for software with uptime requirements.
Implementation:
- Spread worker nodes across at least three AZs. EKS, AKS, GKE all support this at cluster bootstrap.
- Managed control planes generally do this, but always confirm with
eksctl get cluster --name your-cluster
or equivalent. - Example (EKS):
eksctl create cluster --name prod-ha \ --zones us-east-1a,us-east-1b,us-east-1c \ --nodegroup-name infra-nodes \ --node-type m6i.large --nodes 3
- Note: Spot fleets can inadvertently concentrate nodes in a single AZ under pressure. Monitoring is necessary.
2. High Availability Control Plane: Managed or Self-Hosted?
Control plane downtime is cluster downtime. For production, avoid single-master clusters.
-
Managed Services: EKS, GKE, and recent AKS (>1.23) provide multi-AZ HA by default. Always verify the feature is enabled (e.g., for EKS, control plane endpoints should be distributed).
-
Self-Managed: Run at least three etcd nodes, preferably across physical fault domains (not just separate VMs).
- Sample etcd cluster error on quorum loss:
etcdserver: failed to reach the peer deadline error="context deadline exceeded"
- Use taints/affinities to ensure masters stay segregated.
- Sample etcd cluster error on quorum loss:
-
Caution: Clock skew between etcd nodes (>1s drift) leads to unpredictable failures. Monitor with
chrony
orntpd
.
3. Worker Node Architecture: Expect and Embrace Failure
Instances terminate for patching, host failure, or spot market volatility.
a) Auto-Replacement & Scaling
- Use Managed Node Groups (EKS) or Autoscale Groups (ASG) with health checks:
- Set
maxUnavailable: 1
in rolling update strategy for Deployments. - Test node loss: kill node and observe eviction + pod rescheduling.
- Set
b) Intelligent Pod Scheduling
- Anti-affinities spread replicas. Example for zone and host diversity (common error: overlapping labels, pods still colocate):
affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: payment-api topologyKey: "topology.kubernetes.io/zone"
- Some workloads (legacy stateful sets) aren’t designed for restarts; refactor or choose alternatives.
c) Node Upgrades: Patch Without Outage
- Use
kubectl cordon
+kubectl drain
for sequential node patching. - Known issue: Some old CNI plugins (e.g., Calico <3.20) fail to teardown cleanly during rapid upgrades.
4. Intracluster Redundancy: Stopping at "3 Replicas" Isn't Enough
High availability is more than replica count.
-
Replicas: Three+ replicas, but always combine with anti-affinity and Pod Disruption Budgets.
-
Probes:
readinessProbe: httpGet: path: /healthz port: 8080 failureThreshold: 3 timeoutSeconds: 2 livenessProbe: httpGet: path: /livez port: 8080 failureThreshold: 2 timeoutSeconds: 2
-
Real-world: Mistuned probes can create cascading restarts; always sanity-check thresholds during failure simulation.
-
PodDisruptionBudget Example:
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: core-pdb spec: minAvailable: 2 selector: matchLabels: app: core-service
Gotcha: Draining more than one node simultaneously can violate PDBs; adjust automation accordingly.
5. Stateful Workloads: Storage Class and Volume Provisioning
Datastores, caches, and queues often undermine cluster durability if misconfigured.
-
Use StorageClasses with volume replication/multi-AZ support.
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: multi-az-gp3 provisioner: ebs.csi.aws.com parameters: type: gp3 volumeBindingMode: WaitForFirstConsumer
-
AWS/Azure: Attach volumes in multi-AZ, but understand that true cross-AZ mounts are typically not supported—use operator-level replication for apps like Cassandra or CockroachDB.
-
Trade-off: Higher availability often comes at lower throughput (network-attached storage is slower than local SSD). Adjust accordingly.
6. Backups: Assumptions Kill Recovery
Backups only matter if you test restore.
- Critical Data: For managed, ensure cloud provider snapshots are scheduled (EBS, AKS Disks). For self-hosted etcd, automate and test:
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F).db
- Cluster State: Use Velero or Stash for namespace/PV backup.
- Example: Nightly Velero schedule
velero schedule create nightly --schedule "0 2 * * *"
- Example: Nightly Velero schedule
- Note: Restores will not fix custom resource drift unless included in backup scope.
7. Monitoring, Alerting, and Automated Recovery
Discovery after the incident is common. Avoid it.
- Metrics: Prometheus scrapes nodes, kubelets, and external-dns. Grafana dashboards highlight resource pressure, pod restarts, etc.
- Alerting: Prometheus Alertmanager + on-call hooks (PagerDuty, Slack).
- Automation:
- Kured for safe node reboots post-kernel patching.
- Watch for
NodeNotReady
events in:kubectl get events --field-selector involvedObject.kind=Node
- PDB enforcement: Ensure not more than N pods unavailable during disruptions.
- Use
kubectl deprecations
plug-in to identify APIs at risk during upgrades.
Final Notes
Cluster resilience is not a checklist; it’s a discipline. Decouple storage and compute, schedule pods with failure domains in mind, automate backups, then simulate actual outages—regularly. Not all workloads play nicely with every HA mechanism; some trade-offs (like performance hit for volume replication) are unavoidable.
If you haven’t run a node failure or AZ isolation drill, schedule one. The list above is baseline, not ceiling.
Practical next step: Re-examine node spread and pod anti-affinity rules. On a recent audit, a client’s "high-availability" cluster lost half its apps—AZ spread looked fine, but all pods landed on nodes sharing the same underlying hypervisor. The details matter.
What’s less obvious and often missed: Be cautious with autoscalers and PDBs—they can deadlock upgrades if tuned poorly. Always test change rollouts under simulated fault conditions.
Have an HA pitfall not covered here? Share it—failure data is more valuable than another dashboard.