guides

Is [Service] Down? A Developer's Guide to Diagnosing Outages Fast

Is [Service] Down? A Developer's Guide to Diagnosing Outages Fast

Introduction

Your deployment pipeline just failed. Your monitoring dashboard is screaming. Users are tweeting "is [service] down?"

Before you spiral into debugging a bug that isn't yours — here's how to determine in under 2 minutes whether a third-party service is down, and what to do next.

Time cost of not having this process: The average engineering team wastes 47 minutes per incident just confirming whether the problem is theirs or the vendor's. Multiply that by 20+ incidents per year.


The 2-Minute Outage Diagnosis Protocol

Step 1: Check Your Own Status (30 seconds)

Before looking outward, eliminate the obvious:

# Is it just your IP/network?
curl -I https://api.stripe.com/v1/charges \
  -H "Authorization: Bearer sk_test_xxx" \
  --connect-timeout 5 \
  --max-time 10 \
  -w "\n\nHTTP Status: %{http_code}\nTime to connect: %{time_connect}s\nTotal time: %{time_total}s"

What to look for: - HTTP 200 → Your connection works, problem may be upstream in your app - Connection timeout → Network issue (yours or theirs) - HTTP 500/502/503 → Service-side error, likely their problem - HTTP 429 → Rate limiting — definitely your code

Step 2: Multi-Source Confirmation (60 seconds)

One monitoring probe failing proves nothing. Look for corroboration:

  1. ezmon.com — Real-time monitoring from 15 global probe locations
  2. Vendor status page — Always check (but be skeptical — vendors are slow to update)
  3. Twitter/X search — Search "[service name]" down — DevOps engineers are fast reporters
  4. DownDetector — Consumer-focused but shows geographic patterns

Red flags that confirm an outage: - Multiple independent sources reporting the same issue - Spike in reports from unrelated users in different regions - Vendor acknowledges "investigating" on their status page - Your ezmon.com dashboard shows probe failures from 3+ locations

Step 3: Scope the Blast Radius (30 seconds)

Not all outages are created equal. Determine:

Question Why It Matters
Which regions affected? Route around the failure if possible
Which endpoints/features? Partial outage may allow graceful degradation
How long has it been? >30 min = likely a complex incident, plan for hours
Is there a stated ETA? Vendors almost always underestimate recovery time

How to Check If a Specific Service Is Down

AWS

  • Status page: status.aws.amazon.com (check specific region + service)
  • ezmon.com monitoring: Real-time S3, EC2, Lambda, RDS probes
  • Key tip: AWS status page notoriously lags. Check Twitter #AWSOutage for faster intel.
  • API check:
aws s3 ls s3://your-bucket --region us-east-1 2>&1 | head -5

GitHub

  • Status page: githubstatus.com
  • Quick check:
curl -s https://api.github.com/meta | python3 -m json.tool | head -5
  • Common issue: API degradation before web UI fails — CI/CD pipelines are first to break

Cloudflare

  • Status page: cloudflarestatus.com
  • Impact: ~20% of global web traffic proxied through Cloudflare. Their outages are the widest-blast-radius incidents on the internet.
  • Signature symptom: 522 errors (connection timed out) across unrelated domains

Stripe / Payment Providers

  • Status page: status.stripe.com
  • Critical: Payment outages are compliance events. Document the timeline from first detection.
  • Check:
curl -s https://status.stripe.com/api/v2/status.json | python3 -m json.tool

Google APIs (GCP, Maps, Search)

  • Status page: status.cloud.google.com
  • Gotcha: Google often shows "minor disruption" when the reality is widespread impact

What To Do When a Vendor Is Down

Immediate (0-5 minutes)

  1. Stop the bleeding — Disable the dependent feature, show a maintenance message
  2. Document the start time — For SLA claims and postmortem
  3. Alert your team — Don't let 5 engineers debug independently for 20 minutes
  4. Set up a status page entry — Even if it's just "We're monitoring a third-party issue"
# Example: Feature flag for graceful degradation
if not payment_service.is_healthy():
    return {"error": "Payment processing temporarily unavailable",
            "retry_after": 300}

Short-term (5-60 minutes)

  • Implement circuit breakers — Stop hammering a down service, protect your own resources
  • Check SLA entitlements — Most vendors require you to report within the incident window
  • Consider fallbacks — Can you queue requests for retry? Use a backup provider?

Communication (ongoing)

Post a brief status update every 15-30 minutes. Users hate silence more than they hate downtime.

Template:

[SERVICE NAME] INCIDENT — [TIME]
Status: Monitoring
Impact: [describe]
Cause: Third-party vendor ([vendor]) experiencing [type] outage
Next update: [time]

Building Outage Detection Into Your Stack

Active Health Checks

Don't wait for users to report outages. Probe your dependencies:

import httpx
import asyncio
from datetime import datetime

async def check_dependency(name: str, url: str, timeout: float = 5.0):
    try:
        async with httpx.AsyncClient() as client:
            start = datetime.now()
            resp = await client.get(url, timeout=timeout)
            latency = (datetime.now() - start).total_seconds() * 1000
            return {
                "name": name,
                "status": "healthy" if resp.status_code < 400 else "degraded",
                "latency_ms": latency,
                "http_status": resp.status_code
            }
    except httpx.TimeoutException:
        return {"name": name, "status": "timeout", "latency_ms": None}
    except Exception as e:
        return {"name": name, "status": "error", "error": str(e)}

# Run every 60 seconds
dependencies = [
    ("stripe", "https://api.stripe.com/v1/charges"),
    ("github", "https://api.github.com/meta"),
    ("aws-s3", "https://s3.amazonaws.com"),
]

Multi-Region Probing

A service might be "up" from your server but down from where your users are. Always check from multiple geographic locations.

ezmon.com does this automatically — 15 probe locations, 60-second intervals, with incident reports published in real-time.


The Real Cost of Missing Outages Early

Detection Method Avg Time to Detection Cost at $10K/hr Revenue
User reports (Zendesk/support) 23 minutes $3,833
Internal monitoring ping 5 minutes $833
Real-time external probe (ezmon.com) <2 minutes $333

The math is simple: early detection saves money. A $30/month monitoring subscription that catches one AWS outage 20 minutes earlier pays for itself 10x annually.


Quick Reference Checklist

When you suspect a service is down:

  • [ ] Check your own logs first (eliminate self as cause)
  • [ ] Curl the specific endpoint from your servers
  • [ ] Check ezmon.com for real-time probe data
  • [ ] Check vendor status page (add "lag factor")
  • [ ] Search Twitter for "[service]" down last 15 minutes
  • [ ] Confirm from 3+ independent sources
  • [ ] Document start time for SLA/postmortem
  • [ ] Implement graceful degradation immediately
  • [ ] Communicate to users within 5 minutes

Stay Ahead of Outages

ezmon.com monitors 500+ services from 15 global locations, 60 seconds apart. When something breaks, we publish real-time incident reports with verified data — no speculation, no lag.

Subscribe to our weekly digest — every Monday, the top 5 outages of the week with impact analysis, SRE takeaways, and industry uptime stats.


Data sources: ezmon.com monitoring data, vendor status page APIs, SRE industry benchmarks.

devopssremonitoringincident-responseuptime