Mastering Secure AWS Site-to-Site VPN Connections: Beyond the Basics
Default AWS VPN configurations often leave gaps—encryption defaults aren't always strong enough, failover can be unreliable, and network segmentation may be lacking. In enterprise and regulated environments, these weaknesses can turn into an attack vector or cause outages during maintenance or failure events.
Typical Architecture: Where Problems Emerge
Start with the common scenario—an organization linking its on-premises datacenter (e.g., Cisco ASA, Juniper SRX, Palo Alto) to an AWS VPC via a site-to-site IPsec VPN. AWS provisions dual VPN tunnels by default:
[On-Prem Firewall] <-----IPsec-----> [AWS VGW]
\ /
(2 Tunnels)
Dead Peer Detection + (Optional) BGP
Why dual tunnels? One drops, traffic must immediately shift—outages here can trigger major incidents. But out-of-the-box, the failover may not function predictably across devices. Vendor differences (ASA, SRX, FortiGate) often trip up teams.
Hardening IPsec Tunnels
Algorithm Selection: No Place for Legacy
Neglecting custom tunnel parameters leaves you with AES-128 and SHA-1. SHA-1 is deprecated (see NIST SP 800-131A), and some auditors will flag its presence.
Recommended configuration:
- Encryption:
AES-256
- Integrity:
SHA-256
(orSHA-384
, supported since ASA 9.8 for IKEv2) - PFS: Enabled, Group 14 (2048-bit MODP) or higher
Example for Cisco ASA running 9.12+ (IKEv2):
crypto ikev2 policy 10
encryption aes-256
integrity sha256
group 14
prf sha256
lifetime seconds 28800
Note: AWS site-to-site VPN supports IKEv2, but confirm the on-prem firmware revision. Use IKEv1 only for legacy kit; it can introduce instability.
Redundancy and Automated Failover
Both tunnels are provisioned, but symmetric routing isn’t a given—misconfigurations cause silent packet blackholing. Aggressive DPD (dpd interval 10
on JunOS) is advised; AWS declares a tunnel DOWN if it misses three DPDs. Some on-prem appliances default to 30 seconds or longer, which can delay failover.
When possible, enable dynamic routing (BGP) instead of static routes. Here’s a stripped example for an on-prem BGP config with AWS (replace <tunnel_ip>
/ASNs):
router bgp 65001
neighbor 169.254.44.121 remote-as 7224
neighbor 169.254.44.121 timers 10 30
network 10.10.0.0 mask 255.255.0.0
Observed issue: ASA before 9.10 may drop BGP if rekey intervals are misaligned—set identical lifetimes on both ends.
Segmentation and Network Restrictions
A VPN’s job isn’t just “connect everything.” Lateral movement is the real risk post-compromise. Use AWS Security Groups and NACLs to enforce “least privilege” on both inbound and outbound rules.
Example: Only allow specific subnets to access database hosts via VPN:
{
"SecurityGroupRule": {
"Protocol": "tcp",
"Port": 5432,
"Source": "10.0.0.0/24"
}
}
Implement subnet-level ACLs, not just at the VPN or firewall. Avoid blanket 0.0.0.0/0 policies—these show up in many audits.
Monitoring the Only Way That Matters: Before Failure
CloudWatch and CloudTrail gather data on tunnel state and API usage, but less obvious: the TunnelState
metric can drift out of sync if the firewall restarts or the AWS VPN service is restarted on the backend.
Push CloudWatch alarms to your on-prem SIEM or alert manager. Synthetic tests (i.e., periodic pings from both directions with “Do Not Fragment” set) catch MTU and blackhole errors not visible via tunnel metrics alone.
Sample alarm trigger (CloudWatch CLI):
aws cloudwatch put-metric-alarm \
--alarm-name "AWS-VPN-Tunnel1-Down" \
--namespace "AWS/VPN" \
--metric-name TunnelState \
--dimensions Name=VpnId,Value=vpn-xxxxxxx Name=TunnelIpAddress,Value=xx.xx.xx.xx \
--statistic Average --period 60 --threshold 0 --comparison-operator LessThanOrEqualToThreshold \
--evaluation-periods 1 --alarm-actions arn:aws:sns:...
MTU Tweaks: Reducing Blackholes
IPsec adds 60-80 bytes of header overhead. Leaving MTU at 1500 can silently drop packets or cause performance issues due to fragmentation.
Set relevant interfaces to MTU 1420–1436.
Test with:
ping -M do -s 1400 <remote_ip>
Note: Some network gear doesn’t play well with non-standard MTUs; check for errors in logs like
“IPSEC: Received fragment larger than MTU, dropped”
Document MTU settings in both AWS and on-prem configuration management.
Trade-offs and Non-Obvious Problems
- BGP: Auto-failover is cleaner, but misadvertised routes during tunnel flap can trigger asymmetric routing loops. Filter routes carefully.
- Certificate-based authentication is only supported on AWS via third-party appliances—pre-shared keys are standard, but rotate them regularly (found AWS console doesn’t notify on expiring keys).
- Long-lived tunnels can get “stuck” after changes—on some ASA versions, a tunnel teardown (
clear crypto ikev2 sa peer <peer_ip>
) is required after updating parameters.
Side-by-Side: Advanced AWS Site-to-Site VPN Reference
Aspect | Recommendation/Setting | Issues to Watch For |
---|---|---|
Encryption | AES-256-s, SHA-256, PFS (DH G14+) | SHA-1 in legacy configs |
Authentication | Strong PSK, rotate at least annually | Console doesn't prompt on expiry |
Redundancy | 2 Tunnels, BGP w/ DPD | ASA: DPD timing mismatches |
MTU | 1420 (test per deployment) | Silent packet drop |
Monitoring | CloudWatch, SIEM integration, synthetic | TunnelState drift on restarts |
Segmentation | SGs/NACLs per subnet | Default allows overly broad access |
As hybrid cloud becomes standard, advanced AWS Site-to-Site VPN design isn’t an option—it’s table stakes. Layering robust encryption, resilient routing, and continuous monitoring makes the difference between a functional link and an enterprise-grade transport.
For specific interoperability challenges (e.g., Juniper vs. Cisco IPsec proposal mismatches), reference vendor field notes; rarely do two environments look exactly alike. Not perfect, but it’s what keeps the lights on.
Questions on vendor-specific negotiation quirks, trade-offs in BGP filtering strategy, or deep-dive debug flows? Bring logs, not just error codes.