CloudWatch to Slack: Real-Time Notification Pipeline for Incident Response
Struggling with missed alerts in a critical incident because someone overlooked their email or didn’t have access to the AWS Console? For engineers in distributed systems and production SRE, notification latency translates directly into downtime risk and customer impact.
This guide cuts directly to a proven pattern: routing AWS CloudWatch alarms as actionable messages to a Slack channel—no more inbox hunting or dashboard refreshes.
Context and Prerequisites
- AWS Account with permissions for CloudWatch, SNS, Lambda, IAM (Administrator or fine-grained custom roles).
- Slack Workspace with owner/administrator ability to add apps.
- Python 3.9+ for Lambda runtime (alternatives exist, but most AWS SDK examples target Python).
- Pro Tip: Use a non-personal Slack channel (e.g.,
#ops-alerts
) with persistent logging for audit trails.
Step 1: Provision a Slack Incoming Webhook
Slack exposes Incoming Webhooks
to allow external systems to post structured messages.
- Navigate to https://api.slack.com/messaging/webhooks
- Create a new Slack App (e.g.,
cloudwatch-incident-feed
) - Activate Incoming Webhooks, then “Add New Webhook to Workspace”. Select your alert target channel.
- Save the generated Webhook URL; treat as secret.
Side note: Webhook endpoints are not access-controlled. If compromised, unexpected messages will post to your Slack channel.
Step 2: Configure CloudWatch Alarm
Most production teams already have a set of CloudWatch alarms on metrics like EC2 CPUUtilization, ELB 5xx rates, RDS Replica Lag, or custom application metrics.
Sample: High CPU on EC2 (us-east-1)
- CloudWatch → Alarms → Create Alarm
- Metric:
EC2 > Per-Instance Metrics > CPUUtilization
- Condition:
CPUUtilization > 85% for 3 datapoints within 10 minutes
- Action: Send notification to an SNS topic (e.g.,
arn:aws:sns:us-east-1:123456789012:OpsAlerts
)
- Metric:
- If the SNS topic does not exist, create it now; this is the trigger bridge.
Known Issue: Direct notification to Slack is not supported natively by SNS—Lambda is required as glue.
Step 3: Lambda as Translational Bridge
CloudWatch → SNS → Lambda → Slack
Lambda Function (Python 3.9):
import os
import json
import urllib.request
def lambda_handler(event, context):
webhook_url = os.environ["SLACK_WEBHOOK_URL"]
try:
record = event['Records'][0]['Sns']['Message']
alarm = json.loads(record)
except Exception as e:
# Sometimes SNS payloads are simple strings—log for debug
print(f"Malformed message: {e}\nRaw: {record}")
alarm = {"AlarmName": "Unknown", "NewStateValue": "UNKNOWN", "AlarmDescription": record}
color = "#FF0000" if alarm.get('NewStateValue') == "ALARM" else "#36a64f"
payload = {
"attachments": [
{
"color": color,
"title": f"CloudWatch Alarm: {alarm.get('AlarmName', 'N/A')}",
"fields": [
{"title": "State", "value": alarm.get('NewStateValue', ''), "short": True},
{"title": "Reason", "value": alarm.get('AlarmDescription', '')}
],
"footer": "cloudwatch → slack",
}
]
}
req = urllib.request.Request(
webhook_url,
data=json.dumps(payload).encode('utf-8'),
headers={'Content-Type': 'application/json'}
)
try:
response = urllib.request.urlopen(req)
return {'statusCode': response.status, 'body': response.read()}
except Exception as e:
print(f"Slack post failed: {e}")
raise
- Deploy in Lambda console:
- Runtime: Python 3.9
- Handler:
lambda_function.lambda_handler
- Set
SLACK_WEBHOOK_URL
in Lambda environment variables (never hardcode in source) - IAM Role: Attach basic Lambda execution (
AWSLambdaBasicExecutionRole
)
- Memory: 128 MB; Timeout: 3 seconds is sufficient.
Known Issue: When CloudWatch sends a status OTHER THAN "ALARM" (e.g., OK, INSUFFICIENT_DATA), your channel may get noise. Add logic to filter as needed.
Step 4: SNS Subscription to Lambda
- SNS Console → Topics → select your alarm's notification topic.
- Actions → Create subscription
- Protocol:
AWS Lambda
- Endpoint: your Lambda function ARN
- Protocol:
- Confirm Lambda policy allows invocation by SNS (
sns.amazonaws.com
principal).
Test: Use SNS “Publish message” to send a manual notification for confidence before going live. Watch for formatting or lack of expected color coding.
Practical Example: Real-Time CPU Alert
Triggered Example in Slack (as seen by SRE on-duty):
-------------------------------
CloudWatch Alarm: Prod-EC2-CPUHigh
State: ALARM
Reason: Threshold Crossed: 3 datapoints [89, 91, 88] > 85%
-------------------------------
With color highlighting in Slack and direct @mention if configured, the SRE can respond or escalate immediately via /incident
or a PagerDuty integration.
Optimization and Enhancements
- Severity Filtering: Suppress ‘OK’ states in Lambda if only incidents matter.
- Channel Routing: Use additional Lambda logic to route different alarm severities or sources to different Slack channels.
- Delivery Failures: Cloud/Slack outages will silently drop messages; consider logging failed sends to CloudWatch Logs or enabling an SQS dead-letter queue on Lambda.
- Extended Workflow: Chain further with downstream integrations (PagerDuty, OpsGenie) by branching from SNS or Lambda, depending on your response policy.
- Alert Fatigue: Batch minor alerts or aggregate similar alarms in Lambda before posting, to prevent spamming the team.
Observations
- This pipeline removes manual steps and delays entirely from incident notification—everything after the alarm triggers is sub-minute latency.
- Slack’s webhook-based integration is straightforward, but lacks advanced routing and security. For sensitive or regulated environments, alternatives (e.g., bot tokens, encrypted comms) may be required.
- Debugging tip: If your Lambda fails and you see no messages in Slack, always check CloudWatch Logs for stack traces. Most missed alerts are due to IAM role misconfigurations or invalid webhooks.
Terraform modules, detailed IAM policies, and production-grade enhancements are available for this integration—just ask if you need concrete examples for IaC pipelines or permission boundaries.