Is Gatus Down? Real-Time Status & Outage Checker
Is Gatus Down? Real-Time Status & Outage Checker
Gatus is an open-source automated health dashboard written in Go with over 7,000 GitHub stars. It monitors endpoints via HTTP, TCP, DNS, ICMP, and WebSocket checks with configurable thresholds — response time, status codes, response body content, and certificate expiry. Gatus renders a beautiful public-facing status page, supports rich alerting integrations (Slack, PagerDuty, Microsoft Teams, Discord, email, and more), and stores check history in SQLite or PostgreSQL. Created as a self-hosted alternative to Pingdom, Freshping, and Betteruptime, Gatus is widely used by indie developers, DevOps teams, and SRE teams who want full control over their monitoring infrastructure. Its single Go binary with a YAML configuration file makes deployment trivially simple — Docker, Kubernetes, or a bare VPS all work out of the box.
The irony of monitoring tools is that they can go down too — and when Gatus fails, it does so silently. Endpoint checks stop executing, alerts stop firing, and your status page goes stale or returns a 502. If Gatus is your only monitoring layer, any downstream outage during a Gatus failure goes completely undetected. Because Gatus is typically the tool you use to catch outages, running a separate check on Gatus itself is a critical reliability practice.
Quick Status Check
#!/bin/bash
# Gatus health check
# Checks health endpoint, API, process, port, and config file
GATUS_HOST="${GATUS_HOST:-localhost}"
GATUS_PORT="${GATUS_PORT:-8080}"
GATUS_CONFIG="${GATUS_CONFIG:-/etc/gatus/config.yaml}"
FAIL=0
echo "=== Gatus Status Check ==="
echo "Host: ${GATUS_HOST}:${GATUS_PORT}"
echo ""
# Check /health endpoint (returns {"healthy":true})
HEALTH_RESP=$(curl -sf --max-time 5 "http://${GATUS_HOST}:${GATUS_PORT}/health" 2>&1)
if echo "${HEALTH_RESP}" | grep -q '"healthy":true'; then
echo "[OK] Health endpoint: {\"healthy\":true}"
elif echo "${HEALTH_RESP}" | grep -q "healthy"; then
echo "[WARN] Health endpoint responded but may be degraded: ${HEALTH_RESP}"
else
echo "[FAIL] Health endpoint unreachable or returned unhealthy"
FAIL=1
fi
# Check API endpoint statuses
HTTP_CODE=$(curl -so /dev/null -w "%{http_code}" --max-time 5 \
"http://${GATUS_HOST}:${GATUS_PORT}/api/v1/endpoints/statuses" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
echo "[OK] Endpoint statuses API returned HTTP 200"
else
echo "[FAIL] Endpoint statuses API returned HTTP ${HTTP_CODE}"
FAIL=1
fi
# Check Gatus process
if pgrep -f "gatus" > /dev/null 2>&1; then
echo "[OK] Gatus process is running"
else
echo "[WARN] Gatus process not found via pgrep"
fi
# Check port is listening
if ss -tlnp 2>/dev/null | grep -q ":${GATUS_PORT}" || \
netstat -tlnp 2>/dev/null | grep -q ":${GATUS_PORT}"; then
echo "[OK] Port ${GATUS_PORT} is listening"
else
echo "[FAIL] Port ${GATUS_PORT} not listening"
FAIL=1
fi
# Check config file exists and is non-empty
if [ -f "${GATUS_CONFIG}" ]; then
LINES=$(wc -l < "${GATUS_CONFIG}" 2>/dev/null || echo 0)
echo "[OK] Config file found: ${GATUS_CONFIG} (${LINES} lines)"
else
# Try common alternative locations
for path in ./config.yaml /config/config.yaml /app/config.yaml; do
if [ -f "$path" ]; then
echo "[OK] Config found at ${path}"
GATUS_CONFIG="$path"
break
fi
done
if [ ! -f "${GATUS_CONFIG}" ]; then
echo "[WARN] Config file not found at ${GATUS_CONFIG}"
fi
fi
echo ""
if [ "$FAIL" -eq 0 ]; then
echo "Result: Gatus appears healthy"
else
echo "Result: Gatus has failures — review output above"
exit 1
fi
Python Health Check
#!/usr/bin/env python3
"""
Gatus health check
Verifies health endpoint, endpoint status API, and monitoring coverage
"""
import json
import os
import subprocess
import sys
import time
from pathlib import Path
try:
import urllib.request as urlreq
import urllib.error as urlerr
except ImportError:
print("ERROR: urllib not available")
sys.exit(1)
HOST = os.environ.get("GATUS_HOST", "localhost")
PORT = int(os.environ.get("GATUS_PORT", "8080"))
BASE_URL = f"http://{HOST}:{PORT}"
GATUS_CONFIG = os.environ.get("GATUS_CONFIG", "/etc/gatus/config.yaml")
WARN_LATENCY_MS = 2000
TIMEOUT = 8
results = []
def check(label, ok, detail=""):
status = "OK" if ok else ("WARN" if ok is None else "FAIL")
msg = f"[{status}] {label}"
if detail:
msg += f" — {detail}"
print(msg)
if ok is not None:
results.append(ok)
return ok
def fetch(path, timeout=TIMEOUT):
try:
url = f"{BASE_URL}{path}"
req = urlreq.Request(url, headers={"Accept": "application/json"})
with urlreq.urlopen(req, timeout=timeout) as resp:
return resp.status, resp.read().decode("utf-8", errors="replace")
except urlerr.HTTPError as e:
return e.code, e.read().decode("utf-8", errors="replace")
except Exception as e:
return 0, str(e)
print(f"=== Gatus Python Health Check ===")
print(f"Target: {BASE_URL}")
print()
# 1. Health endpoint
t0 = time.time()
status, body = fetch("/health")
latency_ms = (time.time() - t0) * 1000
if status == 200:
try:
data = json.loads(body)
healthy = data.get("healthy", False)
check("Health endpoint", healthy,
f'{{"healthy":{str(healthy).lower()}}} ({latency_ms:.0f}ms)')
except json.JSONDecodeError:
check("Health endpoint", False, f"HTTP 200 but body not valid JSON: {body[:80]}")
else:
check("Health endpoint", False, f"HTTP {status}")
# 2. Response latency (Go service — should be very fast)
if latency_ms > 0:
slow = latency_ms > WARN_LATENCY_MS
if slow:
check("Response latency", False,
f"{latency_ms:.0f}ms exceeds {WARN_LATENCY_MS}ms threshold — Go service may be overloaded")
else:
check("Response latency", True, f"{latency_ms:.0f}ms (Go service expected to be fast)")
# 3. Endpoint statuses API
status, body = fetch("/api/v1/endpoints/statuses")
if status == 200:
try:
data = json.loads(body)
if isinstance(data, list):
total = len(data)
unhealthy = []
for ep in data:
name = ep.get("name", ep.get("key", "unknown"))
results_list = ep.get("results", [])
if results_list:
latest = results_list[-1]
if not latest.get("success", True):
unhealthy.append(name)
check("Total endpoints monitored", total > 0, f"{total} endpoint(s) configured")
if unhealthy:
check("Endpoint health", False,
f"{len(unhealthy)}/{total} endpoint(s) currently failing: "
+ ", ".join(unhealthy[:5]))
else:
check("Endpoint health", True, f"All {total} endpoint(s) passing")
elif isinstance(data, dict):
# Paginated or wrapped response
check("Endpoint statuses API", True, f"HTTP 200, response shape: dict with keys {list(data.keys())[:4]}")
else:
check("Endpoint statuses API", False, "Unexpected response format")
except json.JSONDecodeError:
check("Endpoint statuses API", False, f"HTTP 200 but body not valid JSON")
else:
check("Endpoint statuses API", False, f"HTTP {status}")
# 4. At least one endpoint is being monitored
# (checked above via total > 0, but confirm separately)
# 5. Config file exists
config_paths = [
Path(GATUS_CONFIG),
Path("./config.yaml"),
Path("/config/config.yaml"),
Path("/app/config.yaml"),
]
config_found = next((p for p in config_paths if p.exists()), None)
if config_found:
size = config_found.stat().st_size
check("Config file", True, f"Found at {config_found} ({size} bytes)")
else:
check("Config file", None, f"Not found at common paths (may be mounted differently in Docker)")
# 6. Gatus process
try:
out = subprocess.run(["pgrep", "-f", "gatus"], capture_output=True, text=True)
check("Gatus process", out.returncode == 0,
"running" if out.returncode == 0 else "not found — may be in Docker")
except FileNotFoundError:
check("Gatus process", None, "pgrep not available")
# 7. Database file (SQLite default)
db_paths = [
Path("/data/data.db"),
Path("/app/data.db"),
Path("./data.db"),
Path("/etc/gatus/data.db"),
]
db_found = next((p for p in db_paths if p.exists()), None)
if db_found:
size_mb = db_found.stat().st_size / (1024 * 1024)
check("SQLite database", True, f"Found at {db_found} ({size_mb:.1f} MB)")
else:
check("SQLite database", None, "Not found at common paths — may use PostgreSQL or different mount")
print()
failures = [r for r in results if r is False]
if not failures:
print("Result: Gatus appears healthy")
sys.exit(0)
else:
print(f"Result: {len(failures)} check(s) failed — review output above")
sys.exit(1)
Common Gatus Outage Causes
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Gatus refuses to start, all monitoring stops | YAML configuration parse error — invalid syntax or unknown key after upgrade | Validate config with gatus --config config.yaml --dry-run; check Gatus release notes for deprecated config keys |
| Status page loads but history is missing or frozen | SQLite database locked — typically caused by two Gatus instances writing simultaneously | Ensure only one Gatus instance runs per SQLite file; for HA deployments, switch to PostgreSQL as the storage backend |
| Incidents occur but no alerts are sent | Alerting credentials expired — Slack webhook revoked, PagerDuty key rotated, SMTP password changed | Test alert credentials via Gatus' built-in alert testing; rotate and update secrets in config or environment variables |
| High false positive rate on endpoint checks | Endpoint timeout configured too low for the target's typical response time | Increase client.timeout per endpoint; review response-time condition thresholds; add failure-threshold: 3 to require consecutive failures before alerting |
| DNS check endpoints always failing | DNS resolver misconfigured — Gatus container using incorrect or unreachable nameserver | Set explicit dns.query-type and verify container DNS settings; use 8.8.8.8 for testing; check Docker DNS resolution from inside the container |
| External endpoint checks always timing out | Docker network isolation preventing outbound connections to monitored services | Ensure Gatus container is on a network with external internet access; use --network host for internal network monitoring; check egress firewall rules |
Architecture Overview
| Component | Function | Failure Impact |
|---|---|---|
| Gatus Core (Go binary) | Executes all endpoint checks on configured intervals; evaluates conditions and triggers alerts | All monitoring stops; no checks execute and no alerts fire |
| YAML Configuration | Defines endpoints, check intervals, conditions, alerting rules, and UI settings | Parse errors prevent startup; misconfigured endpoints produce false positives or missed failures |
| Storage Backend (SQLite/PostgreSQL) | Persists check history, uptime percentages, and incident timeline for status page | Status page shows no history; uptime calculations reset; SQLite lock blocks all writes |
| HTTP/TCP/DNS/ICMP Check Engine | Executes protocol-specific probes against monitored endpoints | Specific check type fails silently; endpoints using that protocol appear always healthy |
| Alerting Integrations | Sends notifications to Slack, PagerDuty, Teams, Discord, email on threshold breach | Incidents detected but team not notified; silent outage without alert delivery |
| Status Page (port 8080) | Public-facing or internal dashboard showing endpoint health, uptime, and incident history | Stakeholders cannot view service health; 502 from reverse proxy if Gatus crashes |
Uptime History
| Date | Incident Type | Duration | Impact |
|---|---|---|---|
| 2026-01 | Breaking config schema change in Gatus v5 — services key renamed to endpoints |
Until manually migrated | Existing configs caused startup failure after upgrade; all monitoring stopped until config updated |
| 2025-09 | SQLite WAL corruption after host power loss | Variable (user-managed) | Status page history lost; Gatus required database deletion and restart to recover |
| 2025-08 | Docker Hub rate limit blocking Gatus image pulls during CI/CD updates | ~3 hours | Automated Gatus deployments failed; running containers unaffected |
| 2025-07 | Slack alerting API changes breaking Gatus webhook format | Several days until patch release | Slack alerts silently failing; PagerDuty and email alerts continued normally |
Monitor Gatus Automatically
The fundamental problem with any monitoring tool is that it cannot alert you to its own failure. A crashed Gatus instance is indistinguishable from a perfect day with no incidents — checks simply stop executing silently. ezmon.com monitors your Gatus endpoints from multiple external probes and alerts your team via Slack, PagerDuty, or SMS the moment the /health endpoint stops returning {"healthy":true} or the status page becomes unreachable.