Blue-Green Deployments on AWS EKS: Reliable Zero-Downtime Release Pattern

Production rollouts routinely put availability at risk. In user-facing SaaS environments, even sub-minute interruptions during a deployment are unacceptable. The Kubernetes rolling update strategy (Deployment.spec.strategy.type: RollingUpdate) sometimes falters—liveness probe issues, delayed startup, or cache invalidation can all cause user-facing hiccups. In regulated industries or fintech, this is a non-starter.

Blue-green deployments eliminate the problem by decoupling release from in-place modification. With AWS EKS as the foundation, this pattern is both clean and infrastructure-agnostic. Here’s a practical implementation, including caveats the official docs rarely mention.

Blue-Green Model (Kubernetes, EKS)

Two parallel environments are mandatory:

Blue: Current production version (“known good”).
Green: Candidate version (new release).

Routing is exclusively handled at the Service (or Ingress) layer. No in-place pod replacement. Cutover can be atomically controlled—critical in distributed or stateful workloads.

Visual:

[ Service ]
    |
-----------+-----------
|                      |
[Blue pods]      [Green pods]
(v1, prod)        (v2, candidate)

Switching traffic is as simple as updating a label selector.

Prerequisites

AWS account with EKS 1.27+ cluster (EKS control plane and worker nodes operational)
kubectl v1.27+ configured (aws eks update-kubeconfig --cluster <name>)
IAM permissions for deployments, services, and optional Route53 changes
Familiarity with Kubernetes Deployment/Service YAML

Note: If using AWS-provisioned LoadBalancers, account for Service-level delays due to subnets or SecurityGroups misconfigurations. Initial LB provisioning can take up to 3 min.

Step 1: Dual Deployments

Suppose the current release is v1. To stage a v2 upgrade:

# myapp-blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      track: blue
  template:
    metadata:
      labels:
        app: myapp
        track: blue
    spec:
      containers:
      - name: myapp
        image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/myapp:v1
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
        ports:
        - containerPort: 8080

# myapp-green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      track: green
  template:
    metadata:
      labels:
        app: myapp
        track: green
    spec:
      containers:
      - name: myapp
        image: 123456789012.dkr.ecr.us-west-2.amazonaws.com/myapp:v2
        env:
        - name: FEATURE_FLAG
          value: "true"
        ports:
        - containerPort: 8080

kubectl apply -f myapp-blue-deployment.yaml
kubectl apply -f myapp-green-deployment.yaml

Deployments exist side by side, each with distinct label: track: blue | green.

Step 2: Service Routing (Atomic Switch)

A single Service object routes traffic:

apiVersion: v1
kind: Service
metadata:
  name: myapp-svc
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: myapp
    track: blue

Initial routing is to blue.

Wait for the ELB (inspect with kubectl get svc myapp-svc -w). Expect initial external load balancer provisioning to occasionally hang or return 503s until health checks are stable.

Testing green:

Port-forward to the green pods for validation:

kubectl port-forward deployment/myapp-green 9000:8080

or expose a temporary Service if ingress testing is required.

Step 3: Cutover

Switch all traffic with one patch:

kubectl patch svc myapp-svc -p '{"spec": {"selector": {"app": "myapp", "track": "green"}}}'

Effect is nearly instantaneous; all requests to the load balancer now route to v2. Observe via pod logs, e.g.:

kubectl logs deployment/myapp-green -c myapp | grep "Connection from"

Note: Some ALBs cache target health for 20–30 seconds. In high-traffic scenarios, a brief mixing of old/new may occur as connections drain naturally. No perfect solution; for ultra-low-latency cutover, move traffic at the DNS or ALB Target Group level instead.

Step 4: Validate and Roll Back (If Necessary)

Monitor SLOs. If the green deployment degrades, revert routing as before:

kubectl patch svc myapp-svc -p '{"spec": {"selector": {"app": "myapp", "track": "blue"}}}'

Pods never die; recovery is sub-second.

Cleanup:

kubectl delete deployment myapp-blue

—but keep blue running for a “hot standby” period, especially after major schema changes.

Advanced: DNS-Based Switchover

If latency or third-party dependency dictates, consider moving the cutover to Route53:

Assign distinct ELBs to myapp-blue and myapp-green Service objects.
Use weighted or failover Route53 records for traffic management.
TTLs < 60s work, but actual client DNS caching may introduce lag.

This approach increases infra overhead, but decouples routing entirely from Kubernetes service logic.

Automation and CI/CD

Automate the process in CodePipeline, Argo CD, or Jenkins:

Template manifests via Helm; include deterministic selectors (track: blue|green).
Run automated canary or smoke tests on green environment before cutover.
Integrate manual or automated approval stages before switching Service selectors.
Always log the cutover event and the result (auditability matters for regulated pipelines).

Non-obvious tip: Set up Slack or SNS notifications triggered by cutover events, especially in high-frequency deploy shops.

Known Pitfalls

Watch for lingering zombie pods when scaling down; ensure readiness/liveness probes are strictly enforced on both blue and green.
Cloud load balancers have propagation lag—don’t attempt to guarantee a hard 0s switch unless using direct client-side feature flags or DNS SRV records.
If using stateful or session-aware applications, session stickiness (especially in AWS Classic ELB) can send returning users to old pods. ALB with cookie-based stickiness can mitigate.

Summary table:

Step	Command/Config	Outcome
Blue up	`kubectl apply -f myapp-blue-deployment.yaml`	v1 pods running
Green up	`kubectl apply -f myapp-green-deployment.yaml`	v2 pods running (idle)
Route	Service selector `track: blue`	All traffic to blue
Test	Port-forward or temp Service to green	Green validated, not public
Switch	`kubectl patch svc ... "track": "green"`	Traffic moves to green
Rollback	`kubectl patch svc ... "track": "blue"`	Instantly revert to blue
Cleanup	`kubectl delete deployment myapp-blue`	Only green active

Note: No deployment strategy is truly “one size fits all.” Evaluate rollout mechanics versus team supportability and operational complexity. Blue-green on EKS is robust—but not magic.

Got a stateful workload or critical 24/7 API? Consider layering feature flags, or hybrid blue-green/canary with traffic splitting (e.g. via Istio's VirtualService) for even finer granularity.

Aws Eks How To