Is [Service] Down? A Developer's Guide to Diagnosing Outages Fast
Is [Service] Down? A Developer's Guide to Diagnosing Outages Fast
Introduction
Your deployment pipeline just failed. Your monitoring dashboard is screaming. Users are tweeting "is [service] down?"
Before you spiral into debugging a bug that isn't yours — here's how to determine in under 2 minutes whether a third-party service is down, and what to do next.
Time cost of not having this process: The average engineering team wastes 47 minutes per incident just confirming whether the problem is theirs or the vendor's. Multiply that by 20+ incidents per year.
The 2-Minute Outage Diagnosis Protocol
Step 1: Check Your Own Status (30 seconds)
Before looking outward, eliminate the obvious:
# Is it just your IP/network?
curl -I https://api.stripe.com/v1/charges \
-H "Authorization: Bearer sk_test_xxx" \
--connect-timeout 5 \
--max-time 10 \
-w "\n\nHTTP Status: %{http_code}\nTime to connect: %{time_connect}s\nTotal time: %{time_total}s"
What to look for:
- HTTP 200 → Your connection works, problem may be upstream in your app
- Connection timeout → Network issue (yours or theirs)
- HTTP 500/502/503 → Service-side error, likely their problem
- HTTP 429 → Rate limiting — definitely your code
Step 2: Multi-Source Confirmation (60 seconds)
One monitoring probe failing proves nothing. Look for corroboration:
- ezmon.com — Real-time monitoring from 15 global probe locations
- Vendor status page — Always check (but be skeptical — vendors are slow to update)
- Twitter/X search — Search
"[service name]" down— DevOps engineers are fast reporters - DownDetector — Consumer-focused but shows geographic patterns
Red flags that confirm an outage: - Multiple independent sources reporting the same issue - Spike in reports from unrelated users in different regions - Vendor acknowledges "investigating" on their status page - Your ezmon.com dashboard shows probe failures from 3+ locations
Step 3: Scope the Blast Radius (30 seconds)
Not all outages are created equal. Determine:
| Question | Why It Matters |
|---|---|
| Which regions affected? | Route around the failure if possible |
| Which endpoints/features? | Partial outage may allow graceful degradation |
| How long has it been? | >30 min = likely a complex incident, plan for hours |
| Is there a stated ETA? | Vendors almost always underestimate recovery time |
How to Check If a Specific Service Is Down
AWS
- Status page: status.aws.amazon.com (check specific region + service)
- ezmon.com monitoring: Real-time S3, EC2, Lambda, RDS probes
- Key tip: AWS status page notoriously lags. Check Twitter
#AWSOutagefor faster intel. - API check:
aws s3 ls s3://your-bucket --region us-east-1 2>&1 | head -5
GitHub
- Status page: githubstatus.com
- Quick check:
curl -s https://api.github.com/meta | python3 -m json.tool | head -5
- Common issue: API degradation before web UI fails — CI/CD pipelines are first to break
Cloudflare
- Status page: cloudflarestatus.com
- Impact: ~20% of global web traffic proxied through Cloudflare. Their outages are the widest-blast-radius incidents on the internet.
- Signature symptom: 522 errors (connection timed out) across unrelated domains
Stripe / Payment Providers
- Status page: status.stripe.com
- Critical: Payment outages are compliance events. Document the timeline from first detection.
- Check:
curl -s https://status.stripe.com/api/v2/status.json | python3 -m json.tool
Google APIs (GCP, Maps, Search)
- Status page: status.cloud.google.com
- Gotcha: Google often shows "minor disruption" when the reality is widespread impact
What To Do When a Vendor Is Down
Immediate (0-5 minutes)
- Stop the bleeding — Disable the dependent feature, show a maintenance message
- Document the start time — For SLA claims and postmortem
- Alert your team — Don't let 5 engineers debug independently for 20 minutes
- Set up a status page entry — Even if it's just "We're monitoring a third-party issue"
# Example: Feature flag for graceful degradation
if not payment_service.is_healthy():
return {"error": "Payment processing temporarily unavailable",
"retry_after": 300}
Short-term (5-60 minutes)
- Implement circuit breakers — Stop hammering a down service, protect your own resources
- Check SLA entitlements — Most vendors require you to report within the incident window
- Consider fallbacks — Can you queue requests for retry? Use a backup provider?
Communication (ongoing)
Post a brief status update every 15-30 minutes. Users hate silence more than they hate downtime.
Template:
[SERVICE NAME] INCIDENT — [TIME]
Status: Monitoring
Impact: [describe]
Cause: Third-party vendor ([vendor]) experiencing [type] outage
Next update: [time]
Building Outage Detection Into Your Stack
Active Health Checks
Don't wait for users to report outages. Probe your dependencies:
import httpx
import asyncio
from datetime import datetime
async def check_dependency(name: str, url: str, timeout: float = 5.0):
try:
async with httpx.AsyncClient() as client:
start = datetime.now()
resp = await client.get(url, timeout=timeout)
latency = (datetime.now() - start).total_seconds() * 1000
return {
"name": name,
"status": "healthy" if resp.status_code < 400 else "degraded",
"latency_ms": latency,
"http_status": resp.status_code
}
except httpx.TimeoutException:
return {"name": name, "status": "timeout", "latency_ms": None}
except Exception as e:
return {"name": name, "status": "error", "error": str(e)}
# Run every 60 seconds
dependencies = [
("stripe", "https://api.stripe.com/v1/charges"),
("github", "https://api.github.com/meta"),
("aws-s3", "https://s3.amazonaws.com"),
]
Multi-Region Probing
A service might be "up" from your server but down from where your users are. Always check from multiple geographic locations.
ezmon.com does this automatically — 15 probe locations, 60-second intervals, with incident reports published in real-time.
The Real Cost of Missing Outages Early
| Detection Method | Avg Time to Detection | Cost at $10K/hr Revenue |
|---|---|---|
| User reports (Zendesk/support) | 23 minutes | $3,833 |
| Internal monitoring ping | 5 minutes | $833 |
| Real-time external probe (ezmon.com) | <2 minutes | $333 |
The math is simple: early detection saves money. A $30/month monitoring subscription that catches one AWS outage 20 minutes earlier pays for itself 10x annually.
Quick Reference Checklist
When you suspect a service is down:
- [ ] Check your own logs first (eliminate self as cause)
- [ ] Curl the specific endpoint from your servers
- [ ] Check ezmon.com for real-time probe data
- [ ] Check vendor status page (add "lag factor")
- [ ] Search Twitter for
"[service]" downlast 15 minutes - [ ] Confirm from 3+ independent sources
- [ ] Document start time for SLA/postmortem
- [ ] Implement graceful degradation immediately
- [ ] Communicate to users within 5 minutes
Stay Ahead of Outages
ezmon.com monitors 500+ services from 15 global locations, 60 seconds apart. When something breaks, we publish real-time incident reports with verified data — no speculation, no lag.
Subscribe to our weekly digest — every Monday, the top 5 outages of the week with impact analysis, SRE takeaways, and industry uptime stats.
Data sources: ezmon.com monitoring data, vendor status page APIs, SRE industry benchmarks.