Error Handling Patterns

Error Budgets and SLOs: When to Accept Errors

How to define Service Level Objectives, calculate error budgets, and make data-driven decisions about reliability vs feature velocity.

The SLI/SLO/SLA Hierarchy

Three terms are constantly confused, but they describe a clear hierarchy:

Service Level Indicator (SLI) — a quantitative measurement of service behavior. The raw number:

  • HTTP request success rate (non-5xx responses / total requests)
  • Request latency at p99 (99th percentile response time)
  • Availability (minutes the service responded to probes / total minutes)

Service Level Objective (SLO) — a target for an SLI over a time window. The internal commitment:

  • success_rate >= 99.9% measured over 30 days
  • p99_latency < 500ms measured over 7 days

Service Level Agreement (SLA) — a contractual commitment between you and your customers. SLAs typically have financial penalties for breach. SLAs should be looser than SLOs to provide a buffer:

  • SLO: 99.9% (internal target)
  • SLA: 99.5% (contractual commitment — breached only if SLO is badly missed)

Choosing Meaningful SLIs

The best SLIs are things users directly experience:

User experience → SLI to measure
────────────────────────────────
Page loads successfully → HTTP success rate (non-5xx / total)
Page loads quickly      → p99 latency of HTML responses
Search returns results  → search API success rate + p95 latency
Payment goes through    → payment endpoint success rate

Avoid SLIs that do not correlate with user experience: CPU usage, memory consumption, and disk I/O are internal signals that may or may not indicate user-visible problems.

Calculating Error Budgets

The error budget is the flip side of an SLO: the amount of 'bad' behavior you are allowed before breaching your objective.

Error budget = 1 - SLO target

For common SLO targets over a 30-day window:

SLOError budgetMonthly downtime equivalent
99%1%7.2 hours
99.5%0.5%3.6 hours
99.9%0.1%43.8 minutes
99.95%0.05%21.9 minutes
99.99%0.01%4.38 minutes

99.99% availability sounds impressive, but it means your entire error budget for the month is 4.38 minutes. A single 5-minute deployment that does not drain connections properly burns through the budget completely.

Error Budget Consumption Rate

Track how fast you are burning the budget:

def calculate_burn_rate(
    error_rate_1h: float,  # Errors in last 1 hour
    slo_target: float,     # e.g., 0.999 for 99.9%
) -> float:
    """
    Burn rate of 1.0 = consuming budget at exactly the rate that will
    exhaust it over the SLO window.
    Burn rate of 2.0 = will exhaust budget in half the window.
    """
    error_budget = 1 - slo_target      # e.g., 0.001
    return error_rate_1h / error_budget

# Example: 0.3% error rate, SLO is 99.9%
burn_rate = calculate_burn_rate(0.003, 0.999)  # => 3.0
# Burning 3x the sustainable rate — will exhaust budget in 10 days, not 30

Google's SRE book recommends alerting at specific burn rates:

  • Burn rate > 14.4 → page immediately (budget exhausted in 2 hours)
  • Burn rate > 6 → page (exhausted in 5 hours)
  • Burn rate > 3 → ticket (exhausted in 10 days)

Error Budget Policy

The error budget only has power if it drives decisions. Define a policy before you are in a crisis:

Error Budget Status    → Policy Action
═══════════════════════════════════════
Budget healthy (>50%)  → Normal development velocity, experiments allowed
Budget at risk (20-50%)→ Reliability work gets priority in sprint planning
Budget low (<20%)      → Feature freeze; all engineers focus on reliability
Budget exhausted       → No production changes; incident retrospective required

This policy makes reliability vs feature velocity trade-off explicit and data-driven. Instead of 'we can't deploy because reliability is bad' (subjective), you have 'we cannot deploy because the error budget is exhausted' (objective).

Measuring Error Rates in Practice

HTTP Success Rate

# Prometheus query for HTTP success rate
# success_rate = 1 - (5xx_rate / total_rate)
success_rate_query = """
    1 - (
      sum(rate(http_requests_total{status=~'5..'}[5m]))
      /
      sum(rate(http_requests_total[5m]))
    )
"""

# Error budget consumed this month
budget_consumed_query = """
    1 - (
      sum(increase(http_requests_total{status!~'5..'}[30d]))
      /
      sum(increase(http_requests_total[30d]))
    )
    / 0.001  # divide by error budget (1 - 0.999)
"""

gRPC Non-OK Rate

For gRPC services, success means a response with status OK (code 0). All other codes (UNAVAILABLE, INTERNAL, DEADLINE_EXCEEDED, etc.) count as errors:

grpc_success_rate = grpc_server_handled_total{grpc_code='OK'} /
                    grpc_server_handled_total

Latency SLOs

Latency SLOs use percentiles. Targeting p50 (median) hides the tail experience that most users with poor connectivity encounter. p99 captures the worst 1% of requests — where real users are actually frustrated:

# 99th percentile latency over 5-minute window
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Cultural Impact

Error budgets shift team incentives in a healthy direction:

Development teams are incentivized to ship reliably, not just fast. Burning the error budget with careless deployments has a concrete cost: the team loses the ability to ship features until the budget recovers.

Operations teams cannot demand unrealistic reliability targets. An SLO of 99.99% that costs 40 engineering-hours per month to maintain may not be worth the trade-off over 99.9% that costs 4 hours.

Blameless postmortems become easier: the question shifts from 'who caused the outage?' to 'how much budget did this incident consume, and what systemic changes prevent recurrence?'

Start conservatively. A 99.5% SLO for a new service gives you 3.6 hours of monthly error budget — enough to learn what actually fails in production before committing to tighter targets. Tighten the SLO as you invest in reliability, never the reverse.

관련 프로토콜

관련 용어

더 보기: Error Handling Patterns