Chaos Engineering Principles
Chaos engineering is the discipline of deliberately injecting failures into a system to discover weaknesses before they manifest in production. Netflix coined the term with their Chaos Monkey tool, but the principles apply to any distributed system.
The core methodology:
- Define steady state: Measure what normal looks like — requests/sec, error rate, p99 latency
- Hypothesize: 'If service X receives 503s from service Y, our retry logic will handle it'
- Inject fault: Make service Y return 503s for some percentage of requests
- Observe: Does the system maintain steady state? Do alerts fire? Does the user experience degrade?
- Learn: Fix weaknesses, document findings, increase blast radius next time
Minimize Blast Radius
Start small. Inject faults in a single data center, for a small percentage of requests, or for a short duration. Have a kill switch ready. Only expand scope once you have confidence in your observability and rollback mechanisms.
Toxiproxy
Toxiproxy is a TCP proxy that sits between your service and its dependencies. It intercepts traffic and can inject latency, limit bandwidth, drop connections, or close connections abruptly.
# Install
docker run -d -p 8474:8474 -p 5433:5433 shopify/toxiproxy
# Create a proxy for your database
toxiproxy-cli create --listen 0.0.0.0:5433 --upstream postgres:5432 postgres-proxy
# Add latency toxic (500ms +/- 100ms jitter)
toxiproxy-cli toxic add postgres-proxy -t latency -a latency=500 -a jitter=100
# Add timeout toxic (reset connection after 1 second)
toxiproxy-cli toxic add postgres-proxy -t timeout -a timeout=1000
# Limit bandwidth to 10KB/s
toxiproxy-cli toxic add postgres-proxy -t bandwidth -a rate=10
Using Toxiproxy in Tests
import requests
import pytest
TOXIPROXY_API = 'http://localhost:8474'
@pytest.fixture
def slow_database():
# Add 2 second latency
requests.post(f'{TOXIPROXY_API}/proxies/postgres-proxy/toxics', json={
'name': 'slow_db', 'type': 'latency',
'attributes': {'latency': 2000, 'jitter': 0}
})
yield
# Remove toxic after test
requests.delete(f'{TOXIPROXY_API}/proxies/postgres-proxy/toxics/slow_db')
def test_api_returns_503_when_db_is_slow(slow_database):
response = requests.get('http://localhost:8000/api/users', timeout=1.5)
# With a 2-second DB timeout and 1.5-second client timeout,
# either a 503 or a client timeout is acceptable
assert response.status_code in (503, 504)
Envoy Fault Injection
If you use Envoy as a sidecar proxy (common in Kubernetes/Istio setups), you can inject faults at the HTTP layer using Envoy's fault filter:
# Envoy config: inject 503 for 10% of /payment requests
http_filters:
- name: envoy.filters.http.fault
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault
abort:
http_status: 503
percentage:
numerator: 10
denominator: HUNDRED
headers:
- name: ':path'
string_match:
prefix: '/payment'
Istio Fault Injection
With Istio, fault injection is even simpler via VirtualService resources:
# Inject 503 for 20% of requests to payment-service
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: payment-service-chaos
spec:
hosts:
- payment-service
http:
- fault:
abort:
httpStatus: 503
percentage:
value: 20
route:
- destination:
host: payment-service
# Inject 5-second latency for 30% of requests
- fault:
delay:
percentage:
value: 30
fixedDelay: 5s
route:
- destination:
host: payment-service
Chaos Monkey and Instance-Level Chaos
Chaos Monkey operates at the infrastructure level, randomly terminating EC2 instances or Kubernetes pods. This tests your service's ability to survive the sudden loss of a node.
# Chaos Monkey for Kubernetes (kube-monkey)
# Deploy to your cluster:
kubectl apply -f kube-monkey/deployment.yaml
# Mark a deployment as a chaos target:
# Add labels to your deployment
kubectl label deployment payment-api kube-monkey/enabled=enabled
kubectl label deployment payment-api kube-monkey/mtbf=2 # mean time between failures (minutes)
kubectl label deployment payment-api kube-monkey/kill-mode=kill-all
Experiment Design
A well-designed chaos experiment has clear structure:
## Experiment: Payment Service 503 Resilience
**Hypothesis**: When payment-service returns 503 for 20% of requests,
order-service retries with exponential backoff and the user success
rate stays above 95%.
**Steady State**:
- Order creation success rate: >99%
- p99 latency for POST /orders: <2s
- Error rate: <0.5%
**Method**: Inject 503 via Istio VirtualService for 20% of calls
from order-service to payment-service. Duration: 10 minutes.
**Blast Radius**: 20% of orders during the test window.
**Abort Conditions**: If success rate drops below 90% or p99 > 5s.
**Rollback**: `kubectl delete virtualservice payment-service-chaos`
Learning from Chaos
Run chaos experiments regularly and document findings:
- Unknown failure modes discovered: 'We found that the database connection pool exhausts silently after 30 seconds of latency, causing all subsequent requests to hang'
- Retry amplification: 'Our retry-on-503 strategy caused a 3x amplification in load to the payment service during a degraded period'
- Alert gaps: 'We had no alert for sustained 503 rates below 50%'
Each finding becomes a follow-up task: add circuit breakers, fix retry budgets, add missing alerts. Over time, your system genuinely becomes more resilient — not just hypothetically, but provably.