503 Service Unavailable During Traffic Spikes
Симптомы
- Users see "503 Service Unavailable" during peak hours but the site recovers on its own minutes later
- Nginx error log shows `no live upstreams while connecting to upstream` or `upstream timed out (110: Connection timed out)`
- Server CPU or memory spikes to 100% on dashboards during the outage window
- Gunicorn/uWSGI logs show workers stuck processing long-running requests or OOM kills
- Response headers contain `Retry-After: 30` indicating the server intentionally signals temporary unavailability
- Nginx error log shows `no live upstreams while connecting to upstream` or `upstream timed out (110: Connection timed out)`
- Server CPU or memory spikes to 100% on dashboards during the outage window
- Gunicorn/uWSGI logs show workers stuck processing long-running requests or OOM kills
- Response headers contain `Retry-After: 30` indicating the server intentionally signals temporary unavailability
Первопричины
- Insufficient application workers — all Gunicorn/uWSGI workers are busy processing slow requests, leaving no capacity for new ones
- Database connection pool exhausted — every worker is waiting for a DB connection, causing cascading slowness across the whole app
- A slow external API call (payment gateway, third-party webhook) blocks workers for 30+ seconds each, multiplying the impact of each request
- Memory exhaustion triggering OOM kills of worker processes — workers die faster than the supervisor can restart them
- Scheduled batch job (nightly import, cron task) fires during peak hours and monopolises CPU, leaving nothing for web requests
Диагностика
**Step 1: Check current system load**
```bash
top -bn1 | head -5
free -h
vmstat 1 5
```
**Step 2: Count active Gunicorn workers and their states**
```bash
ps aux | grep gunicorn | grep -v grep
# Check worker utilisation — 'D' state = blocked on I/O
ps -eo pid,stat,comm | grep gunicorn
```
**Step 3: Find slow queries consuming DB connections**
```sql
-- PostgreSQL: see currently running queries > 5s
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;
```
**Step 4: Inspect Nginx access log for request rate**
```bash
sudo awk '{print $4}' /var/log/nginx/access.log \
| cut -d: -f2 | sort | uniq -c | sort -rn | head -20
```
**Step 5: Check for OOM kills in kernel log**
```bash
sudo dmesg | grep -i 'oom\|killed process'
```
```bash
top -bn1 | head -5
free -h
vmstat 1 5
```
**Step 2: Count active Gunicorn workers and their states**
```bash
ps aux | grep gunicorn | grep -v grep
# Check worker utilisation — 'D' state = blocked on I/O
ps -eo pid,stat,comm | grep gunicorn
```
**Step 3: Find slow queries consuming DB connections**
```sql
-- PostgreSQL: see currently running queries > 5s
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '5 seconds'
ORDER BY duration DESC;
```
**Step 4: Inspect Nginx access log for request rate**
```bash
sudo awk '{print $4}' /var/log/nginx/access.log \
| cut -d: -f2 | sort | uniq -c | sort -rn | head -20
```
**Step 5: Check for OOM kills in kernel log**
```bash
sudo dmesg | grep -i 'oom\|killed process'
```
Решение
**Fix 1: Scale up Gunicorn workers**
```bash
# Rule of thumb: (2 × vCPU) + 1 synchronous workers
# or fewer + --worker-class=gevent for I/O-bound apps
ExecStart=...gunicorn --workers 5 --worker-class gevent \
--worker-connections 100 myapp.wsgi:application
```
**Fix 2: Add Nginx rate limiting to protect workers**
```nginx
# /etc/nginx/nginx.conf
limit_req_zone $binary_remote_addr zone=api:10m rate=20r/s;
server {
location /api/ {
limit_req zone=api burst=40 nodelay;
limit_req_status 429;
proxy_pass http://backend;
}
}
```
**Fix 3: Return a proper 503 with Retry-After**
```python
# Django middleware — shed load during overload
from django.http import HttpResponse
class LoadSheddingMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
if self._overloaded():
resp = HttpResponse('Service temporarily unavailable', status=503)
resp['Retry-After'] = '30'
return resp
return self.get_response(request)
```
**Fix 4: Add a DB connection pool**
```python
# settings.py — use pgBouncer or Django's CONN_MAX_AGE
DATABASES['default']['CONN_MAX_AGE'] = 60 # reuse connections
```
```bash
# Rule of thumb: (2 × vCPU) + 1 synchronous workers
# or fewer + --worker-class=gevent for I/O-bound apps
ExecStart=...gunicorn --workers 5 --worker-class gevent \
--worker-connections 100 myapp.wsgi:application
```
**Fix 2: Add Nginx rate limiting to protect workers**
```nginx
# /etc/nginx/nginx.conf
limit_req_zone $binary_remote_addr zone=api:10m rate=20r/s;
server {
location /api/ {
limit_req zone=api burst=40 nodelay;
limit_req_status 429;
proxy_pass http://backend;
}
}
```
**Fix 3: Return a proper 503 with Retry-After**
```python
# Django middleware — shed load during overload
from django.http import HttpResponse
class LoadSheddingMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
if self._overloaded():
resp = HttpResponse('Service temporarily unavailable', status=503)
resp['Retry-After'] = '30'
return resp
return self.get_response(request)
```
**Fix 4: Add a DB connection pool**
```python
# settings.py — use pgBouncer or Django's CONN_MAX_AGE
DATABASES['default']['CONN_MAX_AGE'] = 60 # reuse connections
```
Профилактика
- **Load test before launch**: Use `locust` or `k6` to find your breaking point before users do
- **Circuit breaker pattern**: Wrap slow external calls with a circuit breaker (e.g. `pybreaker`) so one slow dependency doesn't cascade
- **Async task queue**: Move slow work (email, PDF generation, API calls) to a Celery/django-tasks background worker so web workers stay free
- **Horizontal scaling**: Use an Auto Scaling Group or multiple Gunicorn instances behind a load balancer so capacity grows with demand
- **Add swap**: For low-memory servers, a 2 GB swapfile buys time for workers to complete rather than OOM-killing mid-request
- **Circuit breaker pattern**: Wrap slow external calls with a circuit breaker (e.g. `pybreaker`) so one slow dependency doesn't cascade
- **Async task queue**: Move slow work (email, PDF generation, API calls) to a Celery/django-tasks background worker so web workers stay free
- **Horizontal scaling**: Use an Auto Scaling Group or multiple Gunicorn instances behind a load balancer so capacity grows with demand
- **Add swap**: For low-memory servers, a 2 GB swapfile buys time for workers to complete rather than OOM-killing mid-request