AWS Lambda, RDS, EC2 Down? How to Diagnose Individual AWS Service Failures in 2026
AWS Lambda, RDS, EC2 Down? How to Diagnose Individual AWS Service Failures in 2026
The AWS Health Dashboard can show "All services operational" while specific services in your region experience intermittent failures. This guide covers granular diagnosis for the three most impactful AWS services: Lambda, RDS, and EC2.
AWS Status Pages — Which One to Check
AWS has multiple health status interfaces. Each serves a different purpose:
| Interface | URL | Best For |
|---|---|---|
| AWS Service Health Dashboard | health.aws.amazon.com/health/status | Broad service-level status, publicly visible |
| AWS Personal Health Dashboard | AWS Console → Health → Your Account Health | Issues affecting YOUR specific account/resources |
| AWS Health API | AWS Health API (requires Business/Enterprise support) | Programmatic alerts for your affected services |
| AWS Service Status RSS | status.aws.amazon.com/rss | Machine-readable status feed |
Important: The public dashboard often lags behind actual incidents by 15–30 minutes. If your service is failing but the dashboard shows green, check the Personal Health Dashboard first — AWS often notifies affected accounts before updating the public page.
AWS Lambda Diagnostics
Common Lambda Failure Patterns
| Symptom | Likely Cause | First Check |
|---|---|---|
| Task timed out (e.g., 3.00 seconds) | Cold start + initialization exceeds timeout, or downstream API slow | CloudWatch logs: INIT_DURATION line |
| ERR_CONNECTION_REFUSED from Lambda | VPC config: Lambda can't reach RDS/ElastiCache; wrong subnet or missing NAT | Check VPC config, security groups, route tables |
| 429 TooManyRequests | Account-level concurrent execution limit hit (default: 1,000 per region) | CloudWatch → Lambda → ConcurrentExecutions metric |
| 502/503 from API Gateway | Lambda error not caught → API Gateway timeout (29s max), or Lambda throttle | API Gateway execution logs, Lambda error rate |
| ENI creation failure | VPC Lambda exhausting ENIs (subnet /27 or smaller) | VPC → Network Interfaces, check subnet capacity |
| Runtime.ImportModuleError | Lambda layer missing, wrong architecture (x86 vs arm64), wrong runtime | Check layer ARNs, runtime version, architecture setting |
Lambda Diagnostic Commands (AWS CLI)
# Check function configuration
aws lambda get-function-configuration --function-name your-function-name
# Get last 5 minutes of invocation errors
aws logs filter-log-events \
--log-group-name /aws/lambda/your-function-name \
--start-time $(date -d '5 minutes ago' +%s000) \
--filter-pattern "ERROR"
# Check concurrent executions (CloudWatch metric)
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name ConcurrentExecutions \
--dimensions Name=FunctionName,Value=your-function-name \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Maximum
# Check throttles
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Throttles \
--dimensions Name=FunctionName,Value=your-function-name \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Sum
Lambda Cold Start Diagnosis
In CloudWatch Logs, filter for REPORT lines. A cold start shows Init Duration: X ms:
REPORT RequestId: abc123 Duration: 243.55 ms Billed Duration: 244 ms
Memory Size: 512 MB Max Memory Used: 89 MB Init Duration: 487.23 ms
If Init Duration is consuming most of your timeout budget, options: increase timeout, use Provisioned Concurrency, or reduce initialization work (lazy imports, connection reuse).
Amazon RDS Diagnostics
Common RDS Failure Patterns
| Symptom | Likely Cause | First Check |
|---|---|---|
| Connection refused / ECONNREFUSED | Security group blocking access, RDS instance stopped, max_connections hit | Security groups → inbound rules on port 5432/3306; RDS console → Status |
| FATAL: sorry, too many clients | max_connections exhausted (common with Lambda at scale) | CloudWatch → DatabaseConnections; use RDS Proxy |
| SSL SYSCALL error: EOF detected | Failover in progress (Multi-AZ) or network interruption | RDS Events for failover events; implement retry logic |
| Read replica lag > threshold | High write load on primary, large transactions, binlog delay | CloudWatch → ReplicaLag metric |
| FreeStorageSpace = 0 | Disk full — instance will become read-only | CloudWatch → FreeStorageSpace; enable autoscaling storage |
| High CPUUtilization (>80%) | Missing index, N+1 queries, autovacuum conflict | Performance Insights → Top SQL statements |
RDS Diagnostic Commands (AWS CLI)
# Check RDS instance status
aws rds describe-db-instances --db-instance-identifier your-db-id \
--query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,AZ:AvailabilityZone,Endpoint:Endpoint.Address}'
# Check recent RDS events (last 1 hour)
aws rds describe-events \
--source-identifier your-db-id \
--source-type db-instance \
--duration 60
# Check CloudWatch metrics (CPU, connections, free storage)
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=your-db-id \
--start-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Maximum
# List parameter groups (check max_connections setting)
aws rds describe-db-parameters \
--db-parameter-group-name your-parameter-group \
--query 'Parameters[?ParameterName==`max_connections`]'
RDS max_connections and Lambda
Lambda functions can spawn hundreds of concurrent instances, each opening its own database connection. For db.t3.micro (default), max_connections is ~87. This is almost always the cause of "too many clients" errors in Lambda + RDS architectures.
Fix: Use RDS Proxy (connection pooling) or a connection pool library (PgBouncer, pgx's built-in pool). RDS Proxy is especially effective for Lambda use cases — connections are pooled at the proxy level, not per-Lambda-instance.
Amazon EC2 Diagnostics
Common EC2 Failure Patterns
| Symptom | Likely Cause | First Check |
|---|---|---|
| Instance unreachable (SSH timeout) | Security group rule missing, VPC routing issue, instance stopped/terminated | EC2 console → Instance State; security groups inbound 22/443 |
| Instance status check failed (1/2) | OS-level issue: disk full, OOM, kernel panic, filesystem error | EC2 console → Monitoring tab → Status Checks |
| System status check failed (2/2) | AWS infrastructure issue: host hardware failure, power/network issue | Check AWS Health Dashboard for your AZ; stop/start instance (migrates to new host) |
| EBS volume unresponsive | EBS service degradation (AZ-specific), I/O credit exhaustion (gp2), volume offline | CloudWatch → VolumeQueueLength spike; check EBS status in AWS console |
| Instance terminated unexpectedly | Spot instance interruption, Auto Scaling scale-in, account billing issue | CloudTrail → TerminateInstances events |
| ELB target health: unhealthy | Health check path returning non-200, security group blocking ELB | EC2 → Target Groups → Health status; check security group allows ELB CIDR |
EC2 Diagnostic Commands (AWS CLI)
# Check instance status and status checks
aws ec2 describe-instance-status \
--instance-ids i-0123456789abcdef0 \
--query 'InstanceStatuses[0].{State:InstanceState.Name,System:SystemStatus.Status,Instance:InstanceStatus.Status}'
# Get system log (last output before connectivity loss)
aws ec2 get-console-output --instance-id i-0123456789abcdef0 --output text
# Check recent CloudTrail events for instance
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=i-0123456789abcdef0 \
--start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ)
# Describe target group health
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/id
System Status Check Failed — What To Do
A System Status Check failure (2/2) means AWS infrastructure is having problems with the underlying hardware. This is AWS's problem, not yours. Options:
- Stop and start the instance (not reboot) — migrates to healthy host. Note: Elastic IP stays attached, but if you have an instance-store-backed instance you'll lose data.
- Check AWS Personal Health Dashboard — AWS may have already flagged the AZ/host issue and scheduled a maintenance event.
- Scheduled Retirement — AWS occasionally schedules retirement of instances on degraded hosts. Check the console for retirement notifications.
AWS Availability Zone Failures — Pattern Recognition
AWS incidents are often AZ-specific, not regional. Signs that an AZ is degraded rather than a whole region:
- Some instances/services in the same region are fine, others fail
- The issue correlates with a specific AZ suffix (e.g., us-east-1c vs us-east-1a)
- AWS Health shows "us-east-1c" in the affected scope
- Some RDS replicas fail but primary (in different AZ) is fine
Mitigation: Multi-AZ deployments for RDS, Auto Scaling groups spread across 3+ AZs, ELB with health checks to automatically route around unhealthy AZ instances.
Notable AWS Individual Service Incidents — Q1 2026
- AWS Lambda — us-east-1 (Feb 2026): Lambda cold start latency increased 3–5x for approximately 2 hours. Functions that worked within their timeout limit began timing out. AWS attributed it to internal capacity management changes. Workaround: Provisioned Concurrency on critical functions.
- Amazon RDS (Aurora) — ap-southeast-1 (Jan 2026): Aurora Serverless v2 auto-scaling delays caused connection pool exhaustion on rapidly scaling workloads. Instances were healthy; capacity wasn't scaling fast enough to match demand spike.
- EBS — us-west-2a (Mar 2026): Elevated error rates for EBS volumes in us-west-2a. EC2 instances with EBS root volumes in the affected AZ saw I/O stalls. Instances using instance store were unaffected.
Monitoring AWS Services with Ezmon
AWS's own status page tracks service availability, but your monitoring should answer a more specific question: is your application working for your users?
The layers:
- AWS status page: "Is the service having problems globally/regionally?"
- Personal Health Dashboard: "Is the service having problems for my specific account/resources?"
- CloudWatch metrics: "What's happening inside my resources right now?"
- External monitoring (Ezmon): "Can a user in Tokyo actually hit my endpoint and get a response in under 2 seconds?"
Ezmon monitors your actual application endpoints from 15+ global locations. This catches the gap between "AWS is healthy" and "your users can't reach you" — which is where most real outages live.
Monitor your AWS-hosted services from outside AWS →
Related Guides
- Is AWS Down? Platform-Level Status Checker
- Kubernetes Cluster Down? K8s Triage Guide
- Is Cloudflare Down? (Including Workers + CDN)
- Is GitHub Down? (Including Actions, Codespaces)
- Monitoring Best Practices 2026
AWS service status sourced from AWS Service Health Dashboard. All times UTC. For account-specific issues, always check the AWS Personal Health Dashboard in your AWS Console.