guide
SLA vs SLO vs SLI: A Practical Guide for DevOps Engineers in 2026
# SLA vs SLO vs SLI: A Practical Guide for DevOps Engineers in 2026
Everyone knows the acronyms. Few teams implement them correctly.
After analyzing uptime data for 500+ companies on ezmon.com, we see the same pattern repeatedly: engineers know *what* SLAs, SLOs, and SLIs are, but struggle with *how to set them* and *what to do when they breach*.
This guide is practical. No theory. Real examples, real numbers, real consequences.
---
## The 30-Second Version
| Term | What It Is | Who Cares |
|------|-----------|-----------|
| **SLI** (Service Level Indicator) | A metric you actually measure. E.g., "HTTP success rate over rolling 30 days." | Engineering |
| **SLO** (Service Level Objective) | The target you aim to hit. E.g., "99.9% HTTP success rate." | Engineering + Product |
| **SLA** (Service Level Agreement) | The contract with consequences. E.g., "99.9% uptime or we credit your bill." | Business + Legal |
The relationship: **SLIs measure → SLOs target → SLAs commit.**
---
## SLI: What You Measure
An SLI is just a number. It should be:
1. **Quantifiable** — a percentage, latency measurement, or count
2. **Directly tied to user experience** — not CPU usage, but request success rate
3. **Consistently measurable** — same methodology every time
### Common SLIs and how to measure them
**Availability SLI:**
```
availability = (successful_requests / total_requests) × 100
# Example over 30 days:
# total_requests = 8,640,000 (100 req/min × 60 × 24 × 30)
# failed_requests = 8,640 (simulating 0.1% error rate)
# availability = (8,631,360 / 8,640,000) × 100 = 99.90%
```
**Latency SLI:**
```
latency_sli = percentage_of_requests_under_threshold
# Example: "95% of requests complete in under 200ms"
# If p95 latency = 185ms → SLI = 100%
# If p95 latency = 210ms → SLI = 0% (threshold breached)
```
**Error rate SLI:**
```
error_rate = (5xx_responses / total_responses) × 100
# Target: error_rate < 0.1%
```
**The common mistake:** Measuring the wrong thing. CPU utilization at 80% doesn't tell you whether users are affected. Request success rate does.
---
## SLO: The Target You Set
An SLO is your internal reliability commitment. It should be:
- **Slightly harder to achieve than your SLA** — your SLO is the internal guardrail before you breach the external commitment
- **Based on real measurement data** — don't just pick 99.99% because it sounds good
- **Achievable, not aspirational** — if you've never hit 99.95%, don't set 99.99% as your SLO
### Setting your first SLO: the right process
1. **Measure your current baseline** — what's your actual availability over the past 90 days?
2. **Identify your worst month** — what's the floor you can reliably commit to?
3. **Set SLO = (worst month - 0.05%)** — give yourself headroom for variance
4. **Define the measurement window** — rolling 30 days is standard
### Translating percentages to downtime time
| Uptime % | Annual Downtime | Monthly Downtime | Weekly Downtime |
|----------|----------------|-----------------|----------------|
| 99.0% | 87.6 hours | 7.3 hours | 1.68 hours |
| 99.5% | 43.8 hours | 3.65 hours | 50 minutes |
| 99.9% | 8.76 hours | 43.8 minutes | 10 minutes |
| 99.95% | 4.38 hours | 21.9 minutes | 5 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes | 1 minute |
| 99.999% | 5.26 minutes | 26 seconds | 6 seconds |
**The 99.99% reality check:** You have 4.38 minutes of downtime budget per month. One slow deploy, one network blip, one bad config push — you're done. Most teams should not commit to four nines without serious investment in redundancy and runbooks.
---
## Error Budgets: SLOs Made Actionable
The SRE practice that makes SLOs actually useful.
**Your error budget = (1 - SLO) × time window**
If your SLO is 99.9% availability for 30 days:
```
error_budget = (1 - 0.999) × (30 × 24 × 60 minutes) = 43.2 minutes
```
You have **43.2 minutes** of downtime per month before you breach your SLO.
**How to use it:**
| Budget Remaining | Engineering Policy |
|-----------------|-------------------|
| > 50% | Deploy freely, run experiments, ship features |
| 25–50% | Normal deployments, extra monitoring |
| 10–25% | No risky deploys, investigate instability |
| < 10% | Freeze changes, focus on reliability only |
| 0% | Incident review mandatory, reliability sprint before new features |
This turns "are we reliable enough?" from a subjective argument into an objective decision.
---
## SLA: The External Commitment
An SLA is a business contract. Breaking it has financial and legal consequences. Rules for SLAs:
1. **SLA < SLO** — your SLA should be easier to hit than your internal SLO. Example: SLO = 99.9%, SLA = 99.5%.
2. **Define the measurement period explicitly** — "per calendar month" vs "rolling 30 days" matters
3. **Specify what counts as downtime** — planned maintenance? partial outages? degraded performance?
4. **Define the remedy** — service credits are standard; refunds are rare; termination clauses exist in enterprise contracts
### Typical SLA tiers by product type
| Product Type | Standard SLA | Premium SLA | Enterprise SLA |
|-------------|-------------|-------------|----------------|
| SaaS (free) | No SLA | — | — |
| SaaS (paid) | 99.5% | 99.9% | 99.95% |
| Cloud Infrastructure | 99.9% | 99.95% | 99.99% |
| Financial/Healthcare | 99.95% | 99.99% | Custom |
---
## The Real-World SLI/SLO/SLA Stack: An Example
**Service:** Customer-facing API for a B2B SaaS product
**SLIs defined:**
- Request success rate (HTTP 2xx/3xx ÷ total requests)
- p99 latency (99th percentile response time)
- Error rate (HTTP 5xx ÷ total requests)
**SLOs (internal targets, measured rolling 30 days):**
- Availability: 99.95% (26 minutes/month budget)
- p99 latency: < 500ms for 99% of requests
- Error rate: < 0.05%
**SLA (customer contract):**
- Uptime: 99.9% per calendar month
- Measurement: Excludes planned maintenance windows (announced 72h in advance)
- Remedy: 10% service credit for each 0.1% below SLA; cap at 30%
**Monitoring setup:**
- 60-second health checks from 8 global locations (ezmon.com)
- PagerDuty alert when error rate > 1% for 3+ consecutive minutes
- Weekly error budget report to engineering leads
- Monthly SLA compliance report to customers
---
## Common Mistakes That Get Teams Paged at 3am
**Mistake 1: Setting SLOs without measuring first**
You pick 99.99% because it sounds professional. Your actual baseline is 99.7%. You've committed to an SLO you'll breach constantly.
**Mistake 2: Measuring availability from one location**
Your health check server is in us-east-1. Your users are in APAC. You're measuring your own infra, not user experience.
**Mistake 3: Counting successful health check pings as uptime**
A health check returning 200 proves the server responds. It doesn't prove your app works. Use real user transaction monitoring.
**Mistake 4: Ignoring latency SLOs**
A page that loads in 8 seconds isn't "down" — but users experience it as broken. Include p95/p99 latency in your SLIs.
**Mistake 5: No error budget policy**
You set an SLO. You track it on a dashboard. Nothing changes based on it. Without a written error budget policy, the SLO is just a number.
---
## Getting Started Today
1. **Pick one SLI** — start with request success rate. It's the most meaningful and easiest to measure.
2. **Measure for 30 days** — don't set a target until you know your baseline.
3. **Set a conservative SLO** — current_average minus 0.5 percentage points.
4. **Write an error budget policy** — one paragraph: what happens when budget is at 50%, 10%, 0%.
5. **Monitor from outside your infrastructure** — use ezmon.com or similar to get an objective measurement your own metrics can't game.
The goal isn't perfect uptime. It's calibrated commitments and fast, clear decisions when things go wrong.
---
*ezmon.com monitors 500+ companies from 12 global probe locations. [See current uptime status →](/)*
*Tags: sla vs slo vs sli, what is slo, sre reliability targets, how to set uptime sla, error budget, service level objectives*