Architecting a Highly Resilient Kubernetes Cluster: Key Practices

Outages happen, regardless of intent or planning. The difference between hours of downtime and seamless failover usually comes down to what’s in place before the failure. A high-resilience Kubernetes cluster is not an accident—it’s engineered.

1. Multi-AZ Deployment: Don’t Let a Single Availability Zone Take You Down

AZs fail; networks partition. Spanning nodes and storage across multiple Availability Zones is non-negotiable for software with uptime requirements.

Implementation:

Spread worker nodes across at least three AZs. EKS, AKS, GKE all support this at cluster bootstrap.
Managed control planes generally do this, but always confirm with eksctl get cluster --name your-cluster or equivalent.

Example (EKS):

eksctl create cluster --name prod-ha \
  --zones us-east-1a,us-east-1b,us-east-1c \
  --nodegroup-name infra-nodes \
  --node-type m6i.large --nodes 3

Note: Spot fleets can inadvertently concentrate nodes in a single AZ under pressure. Monitoring is necessary.

2. High Availability Control Plane: Managed or Self-Hosted?

Control plane downtime is cluster downtime. For production, avoid single-master clusters.

Managed Services: EKS, GKE, and recent AKS (>1.23) provide multi-AZ HA by default. Always verify the feature is enabled (e.g., for EKS, control plane endpoints should be distributed).
Self-Managed: Run at least three etcd nodes, preferably across physical fault domains (not just separate VMs).
- Sample etcd cluster error on quorum loss:
```
etcdserver: failed to reach the peer deadline error="context deadline exceeded"
```
- Use taints/affinities to ensure masters stay segregated.
Caution: Clock skew between etcd nodes (>1s drift) leads to unpredictable failures. Monitor with chrony or ntpd.

3. Worker Node Architecture: Expect and Embrace Failure

Instances terminate for patching, host failure, or spot market volatility.

a) Auto-Replacement & Scaling

Use Managed Node Groups (EKS) or Autoscale Groups (ASG) with health checks:
- Set maxUnavailable: 1 in rolling update strategy for Deployments.
- Test node loss: kill node and observe eviction + pod rescheduling.

b) Intelligent Pod Scheduling

Anti-affinities spread replicas. Example for zone and host diversity (common error: overlapping labels, pods still colocate):

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: payment-api
        topologyKey: "topology.kubernetes.io/zone"

Some workloads (legacy stateful sets) aren’t designed for restarts; refactor or choose alternatives.

c) Node Upgrades: Patch Without Outage

Use kubectl cordon + kubectl drain for sequential node patching.
Known issue: Some old CNI plugins (e.g., Calico <3.20) fail to teardown cleanly during rapid upgrades.

4. Intracluster Redundancy: Stopping at "3 Replicas" Isn't Enough

High availability is more than replica count.

Replicas: Three+ replicas, but always combine with anti-affinity and Pod Disruption Budgets.

Probes:

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 3
  timeoutSeconds: 2
livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  failureThreshold: 2
  timeoutSeconds: 2

Real-world: Mistuned probes can create cascading restarts; always sanity-check thresholds during failure simulation.

PodDisruptionBudget Example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: core-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: core-service

Gotcha: Draining more than one node simultaneously can violate PDBs; adjust automation accordingly.

5. Stateful Workloads: Storage Class and Volume Provisioning

Datastores, caches, and queues often undermine cluster durability if misconfigured.

Use StorageClasses with volume replication/multi-AZ support.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: multi-az-gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
volumeBindingMode: WaitForFirstConsumer

AWS/Azure: Attach volumes in multi-AZ, but understand that true cross-AZ mounts are typically not supported—use operator-level replication for apps like Cassandra or CockroachDB.
Trade-off: Higher availability often comes at lower throughput (network-attached storage is slower than local SSD). Adjust accordingly.

6. Backups: Assumptions Kill Recovery

Backups only matter if you test restore.

Critical Data: For managed, ensure cloud provider snapshots are scheduled (EBS, AKS Disks). For self-hosted etcd, automate and test:
```
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F).db
```
Cluster State: Use Velero or Stash for namespace/PV backup.
- Example: Nightly Velero schedule
```
velero schedule create nightly --schedule "0 2 * * *"
```
Note: Restores will not fix custom resource drift unless included in backup scope.

7. Monitoring, Alerting, and Automated Recovery

Discovery after the incident is common. Avoid it.

Metrics: Prometheus scrapes nodes, kubelets, and external-dns. Grafana dashboards highlight resource pressure, pod restarts, etc.
Alerting: Prometheus Alertmanager + on-call hooks (PagerDuty, Slack).
Automation:
- Kured for safe node reboots post-kernel patching.
- Watch for NodeNotReady events in:
```
kubectl get events --field-selector involvedObject.kind=Node
```
- PDB enforcement: Ensure not more than N pods unavailable during disruptions.
- Use kubectl deprecations plug-in to identify APIs at risk during upgrades.

Final Notes

Cluster resilience is not a checklist; it’s a discipline. Decouple storage and compute, schedule pods with failure domains in mind, automate backups, then simulate actual outages—regularly. Not all workloads play nicely with every HA mechanism; some trade-offs (like performance hit for volume replication) are unavoidable.

If you haven’t run a node failure or AZ isolation drill, schedule one. The list above is baseline, not ceiling.

Practical next step: Re-examine node spread and pod anti-affinity rules. On a recent audit, a client’s "high-availability" cluster lost half its apps—AZ spread looked fine, but all pods landed on nodes sharing the same underlying hypervisor. The details matter.

What’s less obvious and often missed: Be cautious with autoscalers and PDBs—they can deadlock upgrades if tuned poorly. Always test change rollouts under simulated fault conditions.

Have an HA pitfall not covered here? Share it—failure data is more valuable than another dashboard.

Kubernetes Cluster How To