Intermediate 10 min WebSocket 1013

1013 Try Again Later — Server Overloaded

Gejala

- New WebSocket connections are immediately closed with code 1013 and reason 'Try Again Later'
- Existing connections continue working while new ones are rejected
- Server CPU or memory is at or near capacity
- Connection count in monitoring shows the server has hit its configured max concurrent connections
- Burst of reconnection attempts after an outage triggers a thundering herd, amplifying the overload

Penyebab Utama

  • Server's maximum WebSocket connection limit has been reached
  • Memory usage is too high — server proactively closes connections to avoid OOM kill
  • Event loop lag in Node.js or thread pool exhaustion in a Java server, rejecting new connections
  • Thundering herd — many clients reconnecting simultaneously after an outage overload the server
  • No connection backpressure or admission control mechanism to gracefully shed load

Diagnosis

**Step 1 — Count active WebSocket connections**

```bash
# Count established WebSocket connections on the server:
ss -s
# Or for per-process breakdown:
ss -tp | grep ':8001' | wc -l

# For Node.js — check wss.clients.size:
# wss.clients is a Set of active connections
console.log('Active connections:', wss.clients.size);
```

**Step 2 — Check server resource usage**

```bash
# Memory:
free -h
cat /proc/$(pgrep -f gunicorn)/status | grep VmRSS

# CPU and memory per process:
ps aux --sort=-%mem | head -10

# File descriptor limits (each WebSocket = 1 fd):
cat /proc/$(pgrep -f yourapp)/limits | grep 'open files'
ulimit -n # current fd limit
```

**Step 3 — Observe the thundering herd pattern**

```bash
# Watch connection rate per second after an outage:
watch -n 1 'ss -s | grep estab'

# Check if all reconnects hit within the same 1-second window:
# In application logs, count connection events per second
awk '/WebSocket connected/{t=substr($1,1,19); count[t]++} END{for(k in count) print k, count[k]}' \
/var/log/myapp/app.log | sort | tail -20
```

**Step 4 — Identify the configured connection limit**

```javascript
// Node.js ws — check if a max is set:
const wss = new WebSocket.Server({
maxConnections: 10000 // ← find this value
});
```

```python
# Django Channels — check channel layer capacity:
CHANNEL_LAYERS = {
'default': {
'BACKEND': 'channels_redis.core.RedisChannelLayer',
'CONFIG': {
'capacity': 1000, # messages per channel before backpressure
},
}
}
```

**Step 5 — Check Node.js event loop lag**

```javascript
// Measure event loop lag:
const { monitorEventLoopDelay } = require('perf_hooks');
const h = monitorEventLoopDelay({ resolution: 10 });
h.enable();
setInterval(() => {
console.log('Event loop lag P99:', h.percentile(99), 'ms');
h.reset();
}, 5000);
// > 100ms lag = event loop is saturated
```

Resolusi

**Fix 1 — Client-side: exponential backoff on 1013**

```javascript
function connectWithBackoff(url) {
let attempt = 0;
const MAX_DELAY = 60_000;

function connect() {
const ws = new WebSocket(url);

ws.onclose = (e) => {
if (e.code === 1013) {
// Server asked us to try later — use backoff
const jitter = Math.random() * 1000;
const delay = Math.min(1000 * 2 ** attempt + jitter, MAX_DELAY);
attempt++;
console.log(`Server overloaded, retrying in ${Math.round(delay)}ms`);
setTimeout(connect, delay);
} else {
attempt = 0; // reset on non-1013 close
}
};

ws.onopen = () => { attempt = 0; };
}
connect();
}
```

**Fix 2 — Server-side: implement connection admission control**

```python
# Django Channels — limit concurrent connections:
MAX_CONNECTIONS = 5000
_active_connections = 0

class AdmissionConsumer(AsyncWebsocketConsumer):
async def connect(self):
global _active_connections
if _active_connections >= MAX_CONNECTIONS:
await self.close(code=1013)
return
_active_connections += 1
await self.accept()

async def disconnect(self, close_code: int) -> None:
global _active_connections
_active_connections = max(0, _active_connections - 1)
```

**Fix 3 — Scale horizontally with sticky sessions**

```nginx
# Nginx upstream with WebSocket-aware load balancing:
upstream websocket_backends {
ip_hash; # sticky sessions — client always hits same backend
server backend1.example.com:8001;
server backend2.example.com:8001;
server backend3.example.com:8001;
}

# Or use least_conn for more even distribution:
upstream websocket_backends {
least_conn;
server backend1.example.com:8001;
server backend2.example.com:8001;
}
```

**Fix 4 — Increase OS file descriptor limits**

```bash
# /etc/security/limits.conf — raise fd limit for the app user:
echo 'www-data soft nofile 65536' >> /etc/security/limits.conf
echo 'www-data hard nofile 65536' >> /etc/security/limits.conf

# systemd service override:
sudo systemctl edit gunicorn-myapp
# Add:
# [Service]
# LimitNOFILE=65536
sudo systemctl daemon-reload && sudo systemctl restart gunicorn-myapp
```

Pencegahan

- Implement client-side exponential backoff with jitter for all WebSocket reconnect logic
- Set a maximum connection limit on the server and monitor it against capacity headroom
- Use horizontal scaling with a shared pub/sub backend (Redis, NATS) so connections are distributed across multiple server processes
- Add connection rate limiting at the load balancer level to prevent thundering herd after outages
- Alert when active connection count exceeds 70% of the server's capacity ceiling

Kode Status Terkait

Istilah Terkait