Chaos Engineering for APIs: Injecting Faults and Status Codes

Chaos Engineering Principles

Chaos engineering is the discipline of deliberately injecting failures into a system to discover weaknesses before they manifest in production. Netflix coined the term with their Chaos Monkey tool, but the principles apply to any distributed system.

The core methodology:

Define steady state: Measure what normal looks like — requests/sec, error rate, p99 latency
Hypothesize: 'If service X receives 503s from service Y, our retry logic will handle it'
Inject fault: Make service Y return 503s for some percentage of requests
Observe: Does the system maintain steady state? Do alerts fire? Does the user experience degrade?
Learn: Fix weaknesses, document findings, increase blast radius next time

Minimize Blast Radius

Start small. Inject faults in a single data center, for a small percentage of requests, or for a short duration. Have a kill switch ready. Only expand scope once you have confidence in your observability and rollback mechanisms.

Toxiproxy

Toxiproxy is a TCP proxy that sits between your service and its dependencies. It intercepts traffic and can inject latency, limit bandwidth, drop connections, or close connections abruptly.

# Install
docker run -d -p 8474:8474 -p 5433:5433 shopify/toxiproxy

# Create a proxy for your database
toxiproxy-cli create --listen 0.0.0.0:5433 --upstream postgres:5432 postgres-proxy

# Add latency toxic (500ms +/- 100ms jitter)
toxiproxy-cli toxic add postgres-proxy -t latency -a latency=500 -a jitter=100

# Add timeout toxic (reset connection after 1 second)
toxiproxy-cli toxic add postgres-proxy -t timeout -a timeout=1000

# Limit bandwidth to 10KB/s
toxiproxy-cli toxic add postgres-proxy -t bandwidth -a rate=10

Using Toxiproxy in Tests

import requests
import pytest

TOXIPROXY_API = 'http://localhost:8474'

@pytest.fixture
def slow_database():
    # Add 2 second latency
    requests.post(f'{TOXIPROXY_API}/proxies/postgres-proxy/toxics', json={
        'name': 'slow_db', 'type': 'latency',
        'attributes': {'latency': 2000, 'jitter': 0}
    })
    yield
    # Remove toxic after test
    requests.delete(f'{TOXIPROXY_API}/proxies/postgres-proxy/toxics/slow_db')

def test_api_returns_503_when_db_is_slow(slow_database):
    response = requests.get('http://localhost:8000/api/users', timeout=1.5)
    # With a 2-second DB timeout and 1.5-second client timeout,
    # either a 503 or a client timeout is acceptable
    assert response.status_code in (503, 504)

Envoy Fault Injection

If you use Envoy as a sidecar proxy (common in Kubernetes/Istio setups), you can inject faults at the HTTP layer using Envoy's fault filter:

# Envoy config: inject 503 for 10% of /payment requests
http_filters:
  - name: envoy.filters.http.fault
    typed_config:
      '@type': type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault
      abort:
        http_status: 503
        percentage:
          numerator: 10
          denominator: HUNDRED
      headers:
        - name: ':path'
          string_match:
            prefix: '/payment'

Istio Fault Injection

With Istio, fault injection is even simpler via VirtualService resources:

# Inject 503 for 20% of requests to payment-service
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service-chaos
spec:
  hosts:
    - payment-service
  http:
    - fault:
        abort:
          httpStatus: 503
          percentage:
            value: 20
      route:
        - destination:
            host: payment-service

# Inject 5-second latency for 30% of requests
    - fault:
        delay:
          percentage:
            value: 30
          fixedDelay: 5s
      route:
        - destination:
            host: payment-service

Chaos Monkey and Instance-Level Chaos

Chaos Monkey operates at the infrastructure level, randomly terminating EC2 instances or Kubernetes pods. This tests your service's ability to survive the sudden loss of a node.

# Chaos Monkey for Kubernetes (kube-monkey)
# Deploy to your cluster:
kubectl apply -f kube-monkey/deployment.yaml

# Mark a deployment as a chaos target:
# Add labels to your deployment
kubectl label deployment payment-api kube-monkey/enabled=enabled
kubectl label deployment payment-api kube-monkey/mtbf=2  # mean time between failures (minutes)
kubectl label deployment payment-api kube-monkey/kill-mode=kill-all

Experiment Design

A well-designed chaos experiment has clear structure:

## Experiment: Payment Service 503 Resilience

**Hypothesis**: When payment-service returns 503 for 20% of requests,
order-service retries with exponential backoff and the user success
rate stays above 95%.

**Steady State**:
- Order creation success rate: >99%
- p99 latency for POST /orders: <2s
- Error rate: <0.5%

**Method**: Inject 503 via Istio VirtualService for 20% of calls
from order-service to payment-service. Duration: 10 minutes.

**Blast Radius**: 20% of orders during the test window.

**Abort Conditions**: If success rate drops below 90% or p99 > 5s.

**Rollback**: `kubectl delete virtualservice payment-service-chaos`

Learning from Chaos

Run chaos experiments regularly and document findings:

Unknown failure modes discovered: 'We found that the database connection pool exhausts silently after 30 seconds of latency, causing all subsequent requests to hang'
Retry amplification: 'Our retry-on-503 strategy caused a 3x amplification in load to the payment service during a degraded period'
Alert gaps: 'We had no alert for sustained 503 rates below 50%'

Each finding becomes a follow-up task: add circuit breakers, fix retry budgets, add missing alerts. Over time, your system genuinely becomes more resilient — not just hypothetically, but provably.