Intermediate 15 min gRPC 4

gRPC DEADLINE_EXCEEDED — Timeout Issues

Síntomas

- Client error: `StatusCode.DEADLINE_EXCEEDED: Deadline Exceeded` after a fixed number of seconds
- Server log shows RPC still in progress when client has already timed out
- Streaming RPCs disconnect mid-transfer on large payloads
- Intermittent timeouts correlating with high server CPU or memory pressure
- Load testing shows p99 latency exceeds client deadline at moderate concurrency

Causas raíz

  • Client deadline is set too short relative to actual server processing time
  • Server is overloaded — high CPU, memory pressure, or I/O wait causing slow responses
  • Network latency between client and server exceeds the deadline budget
  • Large request/response payload triggers serialization delay or message fragmentation
  • Cascading deadlines — intermediate service passes the original deadline without propagating remaining budget

Diagnóstico

**Step 1 — Measure actual RPC latency**

```bash
# Use grpcurl with timing
time grpcurl -plaintext -d '{"name": "world"}' \
localhost:50051 helloworld.Greeter/SayHello

# For production: enable Prometheus gRPC metrics
# grpc_server_handling_seconds_bucket histogram shows latency distribution
```

**Step 2 — Check server-side deadline propagation**

```python
import grpc

class MyServicer(my_pb2_grpc.MyServiceServicer):
def MyMethod(self, request, context):
# Log remaining deadline
remaining = context.time_remaining()
print(f'Deadline remaining: {remaining:.3f}s')
if remaining < 0.1:
context.abort(grpc.StatusCode.DEADLINE_EXCEEDED, 'No time left')
```

**Step 3 — Profile server resource usage during RPC**

```bash
# CPU and memory during load
top -p $(pgrep -f my_grpc_server)

# I/O wait
iostat -x 2

# Check for DB slow queries if server calls a database
# PostgreSQL: SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
```

**Step 4 — Inspect payload size**

```python
import grpc

# Client: check default max message size (4MB)
channel = grpc.insecure_channel(
'localhost:50051',
options=[
('grpc.max_send_message_length', 50 * 1024 * 1024),
('grpc.max_receive_message_length', 50 * 1024 * 1024),
],
)
```

**Step 5 — Enable gRPC server reflection and inspect slow methods**

```bash
# List all services and methods
grpcurl -plaintext localhost:50051 list
grpcurl -plaintext localhost:50051 describe mypackage.MyService
```

Resolución

**Fix 1 — Increase client deadline**

```python
import grpc

stub = my_pb2_grpc.MyServiceStub(channel)
# Set deadline as seconds from now
response = stub.MyMethod(request, timeout=30) # 30 seconds
```

**Fix 2 — Propagate deadline budget in server-to-server calls**

```python
def MyMethod(self, request, context):
# Propagate remaining deadline minus 100ms buffer
downstream_timeout = max(0, context.time_remaining() - 0.1)
result = downstream_stub.OtherMethod(sub_request, timeout=downstream_timeout)
return result
```

**Fix 3 — Use server-side streaming for large responses**

```protobuf
service MyService {
// Instead of returning large list at once:
rpc ListItems(ListRequest) returns (stream Item);
}
```

**Fix 4 — Add hedging for latency-critical RPCs**

```python
import json

# Hedging sends the same RPC to multiple servers after a delay
service_config = json.dumps({
'methodConfig': [{
'name': [{'service': 'mypackage.MyService', 'method': 'LatencyCritical'}],
'hedgingPolicy': {
'maxAttempts': 3,
'hedgingDelay': '0.5s',
'nonFatalStatusCodes': ['UNAVAILABLE'],
},
}]
})
```

Prevención

- Measure p95/p99 latency in staging before setting production deadlines; add 2x buffer
- Always propagate deadline context through service chains — never re-use the full original deadline
- Use server-side streaming for responses over 1MB to avoid single large message delays
- Set CPU and memory autoscaling thresholds to prevent overload-induced deadline failures
- Instrument every RPC with OpenTelemetry traces to identify slow spans quickly

Códigos de estado relacionados

Términos relacionados