Intermediate 15 min HTTP 504

504 Gateway Timeout — Upstream Not Responding

Symptômes

- Browser or API client receives `504 Gateway Timeout` after a long wait (default 60 s)
- Nginx error log shows: `upstream timed out (110: Connection timed out) while reading response header from upstream`
- AWS ALB access logs record `504` with `--` in the target response time column
- The endpoint works fine for simple requests but times out under load or for complex queries
- Health check pages return 200 while specific API routes return 504 intermittently

Causes profondes

  • Gunicorn worker processing a slow database query exceeds the proxy_read_timeout window
  • Database lock contention causing queries to wait indefinitely for row-level locks
  • External API call inside a Django view has no timeout and hangs the worker
  • Too few Gunicorn workers to handle concurrent requests, causing queue buildup
  • Nginx `proxy_read_timeout` set lower than the Gunicorn `--timeout`, creating a mismatch

Diagnostic

1. **Identify which endpoint is timing out** and measure its latency baseline:
```bash
time curl -sI https://example.com/api/slow-endpoint/
# If > 60 s, the Nginx default timeout is the boundary
```

2. **Check Nginx timeout configuration:**
```bash
grep -r 'proxy_read_timeout\|proxy_connect_timeout\|proxy_send_timeout' \
/etc/nginx/sites-available/ /etc/nginx/nginx.conf
```

3. **Monitor active database queries** to find long-running or locked queries:
```sql
-- PostgreSQL: find queries running > 10 s
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '10 seconds'
ORDER BY duration DESC;
```

4. **Check Gunicorn worker saturation:**
```bash
# Count busy vs idle workers
ps aux | grep gunicorn | grep -v grep | wc -l
# If all workers are busy, the queue is backing up
```

5. **Reproduce locally with a slow endpoint simulation** to confirm timeout boundary:
```bash
curl --max-time 70 -v https://example.com/api/slow/ 2>&1 | \
grep -E 'timeout|504|upstream'
```

Résolution

**Fix 1: Align Nginx and Gunicorn timeouts** (Nginx must be higher than Gunicorn):
```nginx
# /etc/nginx/sites-available/myapp
location / {
proxy_pass http://127.0.0.1:8000;
proxy_read_timeout 120s; # > gunicorn --timeout
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
}
```
```bash
sudo nginx -t && sudo systemctl reload nginx
```

**Fix 2: Increase Gunicorn worker count and timeout:**
```ini
# /etc/systemd/system/gunicorn-myapp.service
ExecStart=... gunicorn \
--workers 8 \
--timeout 90 \
--worker-class gthread \
--threads 2 \
config.wsgi:application
```

**Fix 3: Add a statement timeout to PostgreSQL** to surface slow queries early:
```python
# config/settings/base.py
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql',
'OPTIONS': {'options': '-c statement_timeout=30000'}, # 30 s
}
}
```

**Fix 4: Add timeouts to all external HTTP calls** in Django views:
```python
import httpx
# Never call without a timeout — a hung upstream hangs your Gunicorn worker
response = httpx.get('https://external-api.com/data', timeout=10.0)
```

Prévention

- **Set timeouts at every layer** (ALB → Nginx → Gunicorn → DB → external HTTP) and ensure each layer's timeout is larger than the one below it
- **Offload slow work to background tasks** — any view taking > 3 seconds is a 504 risk under load; use django-tasks or Celery
- **Add EXPLAIN ANALYZE** to slow Django queries and add missing indexes before they cause production timeouts
- **Use APM or Sentry performance monitoring** to catch p99 latency regressions before they manifest as 504s at the Nginx boundary

Codes de statut associés

Termes associés