Intermediate 15 min HTTP 408

408 Request Timeout on Slow API Calls

Triệu chứng

- Server closes the connection and returns 408 before the client finishes sending the request body
- `curl: (28) Operation timed out` or `requests.exceptions.ReadTimeout` in client logs
- Timeouts appear intermittently, correlating with large payloads or heavy server load
- Keep-alive connections go idle and are closed mid-request by the server
- Nginx or Apache logs show `upstream timed out (110: Connection timed out)` entries

Nguyên nhân gốc rễ

  • Server-side `client_header_timeout` or `client_body_timeout` set too low (Nginx default: 60 s)
  • Slow client upload speed causing the request body to arrive after the timeout window
  • Idle keep-alive connections being reused after the server has already closed them
  • Upstream application server (Gunicorn, uWSGI) worker taking longer than proxy timeout allows
  • Network congestion or high latency between client and server inflating transfer time

Chẩn đoán

1. **Reproduce with curl and measure timing** to isolate which phase is timing out:
```bash
curl -v --max-time 30 -w '\nTotal: %{time_total}s\n' \
-X POST https://api.example.com/upload \
-H 'Content-Type: application/json' \
-d @large_payload.json
```

2. **Check Nginx timeout settings** on the server:
```bash
grep -r 'timeout' /etc/nginx/sites-available/ /etc/nginx/nginx.conf
# Key directives: client_body_timeout, client_header_timeout,
# proxy_read_timeout, proxy_send_timeout
```

3. **Inspect Gunicorn worker timeout** in the systemd service or config:
```bash
systemctl cat gunicorn-myapp | grep -i timeout
# Default: --timeout 30 (kills workers taking longer than 30 s)
```

4. **Test with a keep-alive header** to see if reused connections are the culprit:
```bash
curl -v --keepalive-time 10 -K - <<< 'next' https://api.example.com/ping
# If the second request fails with 408, keep-alive reuse is stale
```

5. **Profile the slow endpoint** to find the bottleneck:
```bash
# Enable Django's slow query log
grep -i 'duration' logs/app.log | awk '$NF > 5000' | tail -20
```

Giải quyết

**1. Increase Nginx timeout directives** for slow endpoints:
```nginx
# /etc/nginx/sites-available/myapp
location /api/upload/ {
proxy_pass http://127.0.0.1:8000;
proxy_read_timeout 120s; # was 60s
proxy_send_timeout 120s;
client_body_timeout 60s;
}
```

**2. Increase Gunicorn worker timeout** in the systemd service:
```ini
# /etc/systemd/system/gunicorn-myapp.service
[Service]
ExecStart=/var/www/myapp/.venv/bin/gunicorn \
--workers 4 \
--timeout 120 \
--graceful-timeout 30 \
config.wsgi:application
```
```bash
sudo systemctl daemon-reload && sudo systemctl restart gunicorn-myapp
```

**3. Disable keep-alive on the client side** if stale connections cause spurious 408s:
```python
import httpx
# Use a transport with keepalive disabled for unreliable connections
transport = httpx.HTTPTransport(retries=3)
client = httpx.Client(transport=transport, timeout=60.0)
```

**4. Move long-running work off the request path** to a background task:
```python
# views.py
from django_tasks import task

@task()
def process_upload(file_id: int) -> None:
... # slow processing here

def upload_view(request):
file_id = save_file(request)
process_upload.enqueue(file_id) # immediate return
return JsonResponse({'status': 'queued', 'id': file_id})
```

Phòng ngừa

- **Always set explicit timeouts on HTTP clients** — never rely on the default (often infinite): `httpx.Client(timeout=httpx.Timeout(connect=5.0, read=30.0, write=30.0))`
- **Use connection pooling with health checks** so stale keep-alive connections are detected and replaced before they cause 408s
- **Offload anything over 5 seconds** to a background queue — HTTP is not designed for long-running synchronous work
- **Monitor p95/p99 endpoint latency** in production and set timeout thresholds at 2–3x the p99 to provide headroom without masking real slowness

Mã trạng thái liên quan

Thuật ngữ liên quan