Error Handling Patterns

Error Budgets and SLOs: When to Accept Errors

How to define Service Level Objectives, calculate error budgets, and make data-driven decisions about reliability vs feature velocity.

The SLI/SLO/SLA Hierarchy

Three terms are constantly confused, but they describe a clear hierarchy:

Service Level Indicator (SLI) — a quantitative measurement of service behavior. The raw number:

  • HTTP request success rate (non-5xx responses / total requests)
  • Request latency at p99 (99th percentile response time)
  • Availability (minutes the service responded to probes / total minutes)

Service Level Objective (SLO) — a target for an SLI over a time window. The internal commitment:

  • success_rate >= 99.9% measured over 30 days
  • p99_latency < 500ms measured over 7 days

Service Level Agreement (SLA) — a contractual commitment between you and your customers. SLAs typically have financial penalties for breach. SLAs should be looser than SLOs to provide a buffer:

  • SLO: 99.9% (internal target)
  • SLA: 99.5% (contractual commitment — breached only if SLO is badly missed)

Choosing Meaningful SLIs

The best SLIs are things users directly experience:

User experience → SLI to measure
────────────────────────────────
Page loads successfully → HTTP success rate (non-5xx / total)
Page loads quickly      → p99 latency of HTML responses
Search returns results  → search API success rate + p95 latency
Payment goes through    → payment endpoint success rate

Avoid SLIs that do not correlate with user experience: CPU usage, memory consumption, and disk I/O are internal signals that may or may not indicate user-visible problems.

Calculating Error Budgets

The error budget is the flip side of an SLO: the amount of 'bad' behavior you are allowed before breaching your objective.

Error budget = 1 - SLO target

For common SLO targets over a 30-day window:

SLOError budgetMonthly downtime equivalent
99%1%7.2 hours
99.5%0.5%3.6 hours
99.9%0.1%43.8 minutes
99.95%0.05%21.9 minutes
99.99%0.01%4.38 minutes

99.99% availability sounds impressive, but it means your entire error budget for the month is 4.38 minutes. A single 5-minute deployment that does not drain connections properly burns through the budget completely.

Error Budget Consumption Rate

Track how fast you are burning the budget:

def calculate_burn_rate(
    error_rate_1h: float,  # Errors in last 1 hour
    slo_target: float,     # e.g., 0.999 for 99.9%
) -> float:
    """
    Burn rate of 1.0 = consuming budget at exactly the rate that will
    exhaust it over the SLO window.
    Burn rate of 2.0 = will exhaust budget in half the window.
    """
    error_budget = 1 - slo_target      # e.g., 0.001
    return error_rate_1h / error_budget

# Example: 0.3% error rate, SLO is 99.9%
burn_rate = calculate_burn_rate(0.003, 0.999)  # => 3.0
# Burning 3x the sustainable rate — will exhaust budget in 10 days, not 30

Google's SRE book recommends alerting at specific burn rates:

  • Burn rate > 14.4 → page immediately (budget exhausted in 2 hours)
  • Burn rate > 6 → page (exhausted in 5 hours)
  • Burn rate > 3 → ticket (exhausted in 10 days)

Error Budget Policy

The error budget only has power if it drives decisions. Define a policy before you are in a crisis:

Error Budget Status    → Policy Action
═══════════════════════════════════════
Budget healthy (>50%)  → Normal development velocity, experiments allowed
Budget at risk (20-50%)→ Reliability work gets priority in sprint planning
Budget low (<20%)      → Feature freeze; all engineers focus on reliability
Budget exhausted       → No production changes; incident retrospective required

This policy makes reliability vs feature velocity trade-off explicit and data-driven. Instead of 'we can't deploy because reliability is bad' (subjective), you have 'we cannot deploy because the error budget is exhausted' (objective).

Measuring Error Rates in Practice

HTTP Success Rate

# Prometheus query for HTTP success rate
# success_rate = 1 - (5xx_rate / total_rate)
success_rate_query = """
    1 - (
      sum(rate(http_requests_total{status=~'5..'}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    )
"""

# Error budget consumed this month
budget_consumed_query = """
    1 - (
      sum(increase(http_requests_total{status!~'5..'}[30d]))
      /
      sum(increase(http_requests_total[30d]))
    )
    / 0.001  # divide by error budget (1 - 0.999)
"""

gRPC Non-OK Rate

For gRPC services, success means a response with status OK (code 0). All other codes (UNAVAILABLE, INTERNAL, DEADLINE_EXCEEDED, etc.) count as errors:

grpc_success_rate = grpc_server_handled_total{grpc_code='OK'} /
                    grpc_server_handled_total

Latency SLOs

Latency SLOs use percentiles. Targeting p50 (median) hides the tail experience that most users with poor connectivity encounter. p99 captures the worst 1% of requests — where real users are actually frustrated:

# 99th percentile latency over 5-minute window
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Cultural Impact

Error budgets shift team incentives in a healthy direction:

Development teams are incentivized to ship reliably, not just fast. Burning the error budget with careless deployments has a concrete cost: the team loses the ability to ship features until the budget recovers.

Operations teams cannot demand unrealistic reliability targets. An SLO of 99.99% that costs 40 engineering-hours per month to maintain may not be worth the trade-off over 99.9% that costs 4 hours.

Blameless postmortems become easier: the question shifts from 'who caused the outage?' to 'how much budget did this incident consume, and what systemic changes prevent recurrence?'

Start conservatively. A 99.5% SLO for a new service gives you 3.6 hours of monthly error budget — enough to learn what actually fails in production before committing to tighter targets. Tighten the SLO as you invest in reliability, never the reverse.

Related Protocols

Related Glossary Terms

More in Error Handling Patterns