status-checker

AWS Lambda, RDS, EC2 Down? How to Diagnose Individual AWS Service Failures in 2026

Ezmon Team • March 17, 2026 • 10 min read

AWS Lambda, RDS, EC2 Down? How to Diagnose Individual AWS Service Failures in 2026

The AWS Health Dashboard can show "All services operational" while specific services in your region experience intermittent failures. This guide covers granular diagnosis for the three most impactful AWS services: Lambda, RDS, and EC2.

AWS Status Pages — Which One to Check

AWS has multiple health status interfaces. Each serves a different purpose:

Interface	URL	Best For
AWS Service Health Dashboard	health.aws.amazon.com/health/status	Broad service-level status, publicly visible
AWS Personal Health Dashboard	AWS Console → Health → Your Account Health	Issues affecting YOUR specific account/resources
AWS Health API	AWS Health API (requires Business/Enterprise support)	Programmatic alerts for your affected services
AWS Service Status RSS	status.aws.amazon.com/rss	Machine-readable status feed

Important: The public dashboard often lags behind actual incidents by 15–30 minutes. If your service is failing but the dashboard shows green, check the Personal Health Dashboard first — AWS often notifies affected accounts before updating the public page.

AWS Lambda Diagnostics

Common Lambda Failure Patterns

Symptom	Likely Cause	First Check
Task timed out (e.g., 3.00 seconds)	Cold start + initialization exceeds timeout, or downstream API slow	CloudWatch logs: INIT_DURATION line
ERR_CONNECTION_REFUSED from Lambda	VPC config: Lambda can't reach RDS/ElastiCache; wrong subnet or missing NAT	Check VPC config, security groups, route tables
429 TooManyRequests	Account-level concurrent execution limit hit (default: 1,000 per region)	CloudWatch → Lambda → ConcurrentExecutions metric
502/503 from API Gateway	Lambda error not caught → API Gateway timeout (29s max), or Lambda throttle	API Gateway execution logs, Lambda error rate
ENI creation failure	VPC Lambda exhausting ENIs (subnet /27 or smaller)	VPC → Network Interfaces, check subnet capacity
Runtime.ImportModuleError	Lambda layer missing, wrong architecture (x86 vs arm64), wrong runtime	Check layer ARNs, runtime version, architecture setting

Lambda Diagnostic Commands (AWS CLI)

# Check function configuration
aws lambda get-function-configuration --function-name your-function-name

# Get last 5 minutes of invocation errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/your-function-name \
  --start-time $(date -d '5 minutes ago' +%s000) \
  --filter-pattern "ERROR"

# Check concurrent executions (CloudWatch metric)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name ConcurrentExecutions \
  --dimensions Name=FunctionName,Value=your-function-name \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum

# Check throttles
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Throttles \
  --dimensions Name=FunctionName,Value=your-function-name \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Sum

Lambda Cold Start Diagnosis

In CloudWatch Logs, filter for REPORT lines. A cold start shows Init Duration: X ms:

REPORT RequestId: abc123 Duration: 243.55 ms Billed Duration: 244 ms
Memory Size: 512 MB Max Memory Used: 89 MB Init Duration: 487.23 ms

If Init Duration is consuming most of your timeout budget, options: increase timeout, use Provisioned Concurrency, or reduce initialization work (lazy imports, connection reuse).

Amazon RDS Diagnostics

Common RDS Failure Patterns

Symptom	Likely Cause	First Check
Connection refused / ECONNREFUSED	Security group blocking access, RDS instance stopped, max_connections hit	Security groups → inbound rules on port 5432/3306; RDS console → Status
FATAL: sorry, too many clients	max_connections exhausted (common with Lambda at scale)	CloudWatch → DatabaseConnections; use RDS Proxy
SSL SYSCALL error: EOF detected	Failover in progress (Multi-AZ) or network interruption	RDS Events for failover events; implement retry logic
Read replica lag > threshold	High write load on primary, large transactions, binlog delay	CloudWatch → ReplicaLag metric
FreeStorageSpace = 0	Disk full — instance will become read-only	CloudWatch → FreeStorageSpace; enable autoscaling storage
High CPUUtilization (>80%)	Missing index, N+1 queries, autovacuum conflict	Performance Insights → Top SQL statements

RDS Diagnostic Commands (AWS CLI)

# Check RDS instance status
aws rds describe-db-instances --db-instance-identifier your-db-id \
  --query 'DBInstances[0].{Status:DBInstanceStatus,Class:DBInstanceClass,AZ:AvailabilityZone,Endpoint:Endpoint.Address}'

# Check recent RDS events (last 1 hour)
aws rds describe-events \
  --source-identifier your-db-id \
  --source-type db-instance \
  --duration 60

# Check CloudWatch metrics (CPU, connections, free storage)
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=your-db-id \
  --start-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum

# List parameter groups (check max_connections setting)
aws rds describe-db-parameters \
  --db-parameter-group-name your-parameter-group \
  --query 'Parameters[?ParameterName==`max_connections`]'

RDS max_connections and Lambda

Lambda functions can spawn hundreds of concurrent instances, each opening its own database connection. For db.t3.micro (default), max_connections is ~87. This is almost always the cause of "too many clients" errors in Lambda + RDS architectures.

Fix: Use RDS Proxy (connection pooling) or a connection pool library (PgBouncer, pgx's built-in pool). RDS Proxy is especially effective for Lambda use cases — connections are pooled at the proxy level, not per-Lambda-instance.

Amazon EC2 Diagnostics

Common EC2 Failure Patterns

Symptom	Likely Cause	First Check
Instance unreachable (SSH timeout)	Security group rule missing, VPC routing issue, instance stopped/terminated	EC2 console → Instance State; security groups inbound 22/443
Instance status check failed (1/2)	OS-level issue: disk full, OOM, kernel panic, filesystem error	EC2 console → Monitoring tab → Status Checks
System status check failed (2/2)	AWS infrastructure issue: host hardware failure, power/network issue	Check AWS Health Dashboard for your AZ; stop/start instance (migrates to new host)
EBS volume unresponsive	EBS service degradation (AZ-specific), I/O credit exhaustion (gp2), volume offline	CloudWatch → VolumeQueueLength spike; check EBS status in AWS console
Instance terminated unexpectedly	Spot instance interruption, Auto Scaling scale-in, account billing issue	CloudTrail → TerminateInstances events
ELB target health: unhealthy	Health check path returning non-200, security group blocking ELB	EC2 → Target Groups → Health status; check security group allows ELB CIDR

EC2 Diagnostic Commands (AWS CLI)

# Check instance status and status checks
aws ec2 describe-instance-status \
  --instance-ids i-0123456789abcdef0 \
  --query 'InstanceStatuses[0].{State:InstanceState.Name,System:SystemStatus.Status,Instance:InstanceStatus.Status}'

# Get system log (last output before connectivity loss)
aws ec2 get-console-output --instance-id i-0123456789abcdef0 --output text

# Check recent CloudTrail events for instance
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=ResourceName,AttributeValue=i-0123456789abcdef0 \
  --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%SZ)

# Describe target group health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:region:account:targetgroup/name/id

System Status Check Failed — What To Do

A System Status Check failure (2/2) means AWS infrastructure is having problems with the underlying hardware. This is AWS's problem, not yours. Options:

Stop and start the instance (not reboot) — migrates to healthy host. Note: Elastic IP stays attached, but if you have an instance-store-backed instance you'll lose data.
Check AWS Personal Health Dashboard — AWS may have already flagged the AZ/host issue and scheduled a maintenance event.
Scheduled Retirement — AWS occasionally schedules retirement of instances on degraded hosts. Check the console for retirement notifications.

AWS Availability Zone Failures — Pattern Recognition

AWS incidents are often AZ-specific, not regional. Signs that an AZ is degraded rather than a whole region:

Some instances/services in the same region are fine, others fail
The issue correlates with a specific AZ suffix (e.g., us-east-1c vs us-east-1a)
AWS Health shows "us-east-1c" in the affected scope
Some RDS replicas fail but primary (in different AZ) is fine

Mitigation: Multi-AZ deployments for RDS, Auto Scaling groups spread across 3+ AZs, ELB with health checks to automatically route around unhealthy AZ instances.

Notable AWS Individual Service Incidents — Q1 2026

AWS Lambda — us-east-1 (Feb 2026): Lambda cold start latency increased 3–5x for approximately 2 hours. Functions that worked within their timeout limit began timing out. AWS attributed it to internal capacity management changes. Workaround: Provisioned Concurrency on critical functions.
Amazon RDS (Aurora) — ap-southeast-1 (Jan 2026): Aurora Serverless v2 auto-scaling delays caused connection pool exhaustion on rapidly scaling workloads. Instances were healthy; capacity wasn't scaling fast enough to match demand spike.
EBS — us-west-2a (Mar 2026): Elevated error rates for EBS volumes in us-west-2a. EC2 instances with EBS root volumes in the affected AZ saw I/O stalls. Instances using instance store were unaffected.

Monitoring AWS Services with Ezmon

AWS's own status page tracks service availability, but your monitoring should answer a more specific question: is your application working for your users?

The layers:

AWS status page: "Is the service having problems globally/regionally?"
Personal Health Dashboard: "Is the service having problems for my specific account/resources?"
CloudWatch metrics: "What's happening inside my resources right now?"
External monitoring (Ezmon): "Can a user in Tokyo actually hit my endpoint and get a response in under 2 seconds?"

Ezmon monitors your actual application endpoints from 15+ global locations. This catches the gap between "AWS is healthy" and "your users can't reach you" — which is where most real outages live.

Monitor your AWS-hosted services from outside AWS →

Related Guides

AWS service status sourced from AWS Service Health Dashboard. All times UTC. For account-specific issues, always check the AWS Personal Health Dashboard in your AWS Console.

aws lambda downrds downec2 downaws services statusaws troubleshootinglambda timeoutrds connection refused