The SLI/SLO/SLA Hierarchy
Three terms are constantly confused, but they describe a clear hierarchy:
Service Level Indicator (SLI) — a quantitative measurement of service behavior. The raw number:
- HTTP request success rate (non-5xx responses / total requests)
- Request latency at p99 (99th percentile response time)
- Availability (minutes the service responded to probes / total minutes)
Service Level Objective (SLO) — a target for an SLI over a time window. The internal commitment:
success_rate >= 99.9%measured over 30 daysp99_latency < 500msmeasured over 7 days
Service Level Agreement (SLA) — a contractual commitment between you and your customers. SLAs typically have financial penalties for breach. SLAs should be looser than SLOs to provide a buffer:
- SLO: 99.9% (internal target)
- SLA: 99.5% (contractual commitment — breached only if SLO is badly missed)
Choosing Meaningful SLIs
The best SLIs are things users directly experience:
User experience → SLI to measure
────────────────────────────────
Page loads successfully → HTTP success rate (non-5xx / total)
Page loads quickly → p99 latency of HTML responses
Search returns results → search API success rate + p95 latency
Payment goes through → payment endpoint success rate
Avoid SLIs that do not correlate with user experience: CPU usage, memory consumption, and disk I/O are internal signals that may or may not indicate user-visible problems.
Calculating Error Budgets
The error budget is the flip side of an SLO: the amount of 'bad' behavior you are allowed before breaching your objective.
Error budget = 1 - SLO target
For common SLO targets over a 30-day window:
| SLO | Error budget | Monthly downtime equivalent |
|---|---|---|
| 99% | 1% | 7.2 hours |
| 99.5% | 0.5% | 3.6 hours |
| 99.9% | 0.1% | 43.8 minutes |
| 99.95% | 0.05% | 21.9 minutes |
| 99.99% | 0.01% | 4.38 minutes |
99.99% availability sounds impressive, but it means your entire error budget for the month is 4.38 minutes. A single 5-minute deployment that does not drain connections properly burns through the budget completely.
Error Budget Consumption Rate
Track how fast you are burning the budget:
def calculate_burn_rate(
error_rate_1h: float, # Errors in last 1 hour
slo_target: float, # e.g., 0.999 for 99.9%
) -> float:
"""
Burn rate of 1.0 = consuming budget at exactly the rate that will
exhaust it over the SLO window.
Burn rate of 2.0 = will exhaust budget in half the window.
"""
error_budget = 1 - slo_target # e.g., 0.001
return error_rate_1h / error_budget
# Example: 0.3% error rate, SLO is 99.9%
burn_rate = calculate_burn_rate(0.003, 0.999) # => 3.0
# Burning 3x the sustainable rate — will exhaust budget in 10 days, not 30
Google's SRE book recommends alerting at specific burn rates:
- Burn rate > 14.4 → page immediately (budget exhausted in 2 hours)
- Burn rate > 6 → page (exhausted in 5 hours)
- Burn rate > 3 → ticket (exhausted in 10 days)
Error Budget Policy
The error budget only has power if it drives decisions. Define a policy before you are in a crisis:
Error Budget Status → Policy Action
═══════════════════════════════════════
Budget healthy (>50%) → Normal development velocity, experiments allowed
Budget at risk (20-50%)→ Reliability work gets priority in sprint planning
Budget low (<20%) → Feature freeze; all engineers focus on reliability
Budget exhausted → No production changes; incident retrospective required
This policy makes reliability vs feature velocity trade-off explicit and data-driven. Instead of 'we can't deploy because reliability is bad' (subjective), you have 'we cannot deploy because the error budget is exhausted' (objective).
Measuring Error Rates in Practice
HTTP Success Rate
# Prometheus query for HTTP success rate
# success_rate = 1 - (5xx_rate / total_rate)
success_rate_query = """
1 - (
sum(rate(http_requests_total{status=~'5..'}[5m]))
/
sum(rate(http_requests_total[5m]))
)
"""
# Error budget consumed this month
budget_consumed_query = """
1 - (
sum(increase(http_requests_total{status!~'5..'}[30d]))
/
sum(increase(http_requests_total[30d]))
)
/ 0.001 # divide by error budget (1 - 0.999)
"""
gRPC Non-OK Rate
For gRPC services, success means a response with status OK (code 0). All other codes (UNAVAILABLE, INTERNAL, DEADLINE_EXCEEDED, etc.) count as errors:
grpc_success_rate = grpc_server_handled_total{grpc_code='OK'} /
grpc_server_handled_total
Latency SLOs
Latency SLOs use percentiles. Targeting p50 (median) hides the tail experience that most users with poor connectivity encounter. p99 captures the worst 1% of requests — where real users are actually frustrated:
# 99th percentile latency over 5-minute window
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Cultural Impact
Error budgets shift team incentives in a healthy direction:
Development teams are incentivized to ship reliably, not just fast. Burning the error budget with careless deployments has a concrete cost: the team loses the ability to ship features until the budget recovers.
Operations teams cannot demand unrealistic reliability targets. An SLO of 99.99% that costs 40 engineering-hours per month to maintain may not be worth the trade-off over 99.9% that costs 4 hours.
Blameless postmortems become easier: the question shifts from 'who caused the outage?' to 'how much budget did this incident consume, and what systemic changes prevent recurrence?'
Start conservatively. A 99.5% SLO for a new service gives you 3.6 hours of monthly error budget — enough to learn what actually fails in production before committing to tighter targets. Tighten the SLO as you invest in reliability, never the reverse.