CANCELLED — Client Abandoning Slow gRPC Requests
Симптомы
- Server logs show a high rate of `StatusCode.CANCELLED` errors that do not correspond to application failures
- The CANCELLED errors spike when users navigate away from a page or when mobile apps are backgrounded
- Server is processing requests to completion even though the client cancelled
- CANCELLED rate increases with server latency — slower responses = more cancellations
- gRPC-Web applications show CANCELLED errors in the browser console when the user closes a tab or navigates
- The CANCELLED errors spike when users navigate away from a page or when mobile apps are backgrounded
- Server is processing requests to completion even though the client cancelled
- CANCELLED rate increases with server latency — slower responses = more cancellations
- gRPC-Web applications show CANCELLED errors in the browser console when the user closes a tab or navigates
Первопричины
- User navigating away from a page before the response arrives, cancelling the browser's gRPC-Web request
- Client-side timeout (distinct from gRPC deadline) cancelling the Go/Python context passed to the gRPC call
- Mobile app moving to the background, causing the OS to cancel pending network requests
- Load balancer routing a retry to a different server instance while the original server is still processing
- Streaming RPC where the client stops reading (buffer full or error) causing the stream to be cancelled server-side
Диагностика
**Step 1 — Distinguish CANCELLED from DEADLINE_EXCEEDED**
- `CANCELLED` (code 1): the **client** explicitly cancelled the call
- `DEADLINE_EXCEEDED` (code 4): the deadline set on the **call** expired
Both appear as errors in server logs, but only DEADLINE_EXCEEDED indicates your server is too slow.
**Step 2 — Check if the server is doing wasted work after cancellation**
```python
# Python gRPC server — check context.is_active() before expensive operations
def GetReport(self, request, context):
data = fetch_raw_data(request.id) # fast
if not context.is_active():
return GetReportResponse() # client cancelled, stop processing
report = build_report(data) # slow — only do this if client is still there
return GetReportResponse(report=report)
```
**Step 3 — Measure cancellation rate vs total requests**
```python
# In your gRPC interceptor, count CANCELLED vs total RPCs
# If CANCELLED / total > 5%, investigate server latency
# If CANCELLED / total < 1%, it's normal user behavior — suppress from error alerts
```
**Step 4 — Identify the slowest RPCs causing most cancellations**
```bash
# Correlate P99 latency with CANCELLED rate per RPC method
# In Prometheus/Grafana:
# histogram_quantile(0.99, grpc_server_handling_seconds_bucket)
# grpc_server_handled_total{grpc_code='Canceled'}
```
- `CANCELLED` (code 1): the **client** explicitly cancelled the call
- `DEADLINE_EXCEEDED` (code 4): the deadline set on the **call** expired
Both appear as errors in server logs, but only DEADLINE_EXCEEDED indicates your server is too slow.
**Step 2 — Check if the server is doing wasted work after cancellation**
```python
# Python gRPC server — check context.is_active() before expensive operations
def GetReport(self, request, context):
data = fetch_raw_data(request.id) # fast
if not context.is_active():
return GetReportResponse() # client cancelled, stop processing
report = build_report(data) # slow — only do this if client is still there
return GetReportResponse(report=report)
```
**Step 3 — Measure cancellation rate vs total requests**
```python
# In your gRPC interceptor, count CANCELLED vs total RPCs
# If CANCELLED / total > 5%, investigate server latency
# If CANCELLED / total < 1%, it's normal user behavior — suppress from error alerts
```
**Step 4 — Identify the slowest RPCs causing most cancellations**
```bash
# Correlate P99 latency with CANCELLED rate per RPC method
# In Prometheus/Grafana:
# histogram_quantile(0.99, grpc_server_handling_seconds_bucket)
# grpc_server_handled_total{grpc_code='Canceled'}
```
Решение
**Fix 1 — Propagate context cancellation to stop wasted work**
```python
# Python: pass context cancellation to downstream calls
def GetUserReport(self, request, context):
# Create a threading event from the gRPC context
stop_event = threading.Event()
context.add_callback(stop_event.set) # called when client cancels
for step in long_running_steps():
if stop_event.is_set() or not context.is_active():
return GetUserReportResponse() # abort early
process_step(step)
```
**Fix 2 — Reduce latency to reduce cancellation rate**
```python
# Profile your slowest RPCs and add caching for read-heavy calls
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_cached_report(report_id: str):
return build_expensive_report(report_id)
def GetReport(self, request, context):
return GetReportResponse(report=get_cached_report(request.report_id))
```
**Fix 3 — Suppress CANCELLED from error alerting**
```python
# In your server interceptor, log CANCELLED at DEBUG not ERROR level
import grpc
import logging
def intercept_service(self, handler, method_name, request_streaming, response_streaming):
def wrapper(request, context):
try:
return handler(request, context)
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.CANCELLED:
logging.debug('Client cancelled RPC %s', method_name)
else:
logging.error('RPC error %s: %s', method_name, e)
raise
return wrapper
```
**Fix 4 — For gRPC-Web: use AbortController client-side**
```javascript
// Cancel old requests when a new search is triggered
let controller = null;
function search(query) {
if (controller) controller.abort(); // cancel previous
controller = new AbortController();
grpcClient.search({ query }, { signal: controller.signal })
.then(handleResults)
.catch(err => { if (err.name !== 'AbortError') handleError(err); });
}
```
```python
# Python: pass context cancellation to downstream calls
def GetUserReport(self, request, context):
# Create a threading event from the gRPC context
stop_event = threading.Event()
context.add_callback(stop_event.set) # called when client cancels
for step in long_running_steps():
if stop_event.is_set() or not context.is_active():
return GetUserReportResponse() # abort early
process_step(step)
```
**Fix 2 — Reduce latency to reduce cancellation rate**
```python
# Profile your slowest RPCs and add caching for read-heavy calls
from functools import lru_cache
@lru_cache(maxsize=1000)
def get_cached_report(report_id: str):
return build_expensive_report(report_id)
def GetReport(self, request, context):
return GetReportResponse(report=get_cached_report(request.report_id))
```
**Fix 3 — Suppress CANCELLED from error alerting**
```python
# In your server interceptor, log CANCELLED at DEBUG not ERROR level
import grpc
import logging
def intercept_service(self, handler, method_name, request_streaming, response_streaming):
def wrapper(request, context):
try:
return handler(request, context)
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.CANCELLED:
logging.debug('Client cancelled RPC %s', method_name)
else:
logging.error('RPC error %s: %s', method_name, e)
raise
return wrapper
```
**Fix 4 — For gRPC-Web: use AbortController client-side**
```javascript
// Cancel old requests when a new search is triggered
let controller = null;
function search(query) {
if (controller) controller.abort(); // cancel previous
controller = new AbortController();
grpcClient.search({ query }, { signal: controller.signal })
.then(handleResults)
.catch(err => { if (err.name !== 'AbortError') handleError(err); });
}
```
Профилактика
- Check `context.is_active()` before every expensive operation in long-running RPCs to stop wasted work when clients cancel
- Log CANCELLED at DEBUG level, not ERROR — it is normal user behavior, not a server fault; alerting on CANCELLED creates alert fatigue
- Reduce your P99 latency for high-traffic RPCs — the longer requests take, the more clients cancel them
- For search/autocomplete endpoints, implement client-side debouncing to reduce spurious requests that will likely be cancelled
- Track the cancellation ratio per RPC method in your metrics — a sudden spike in CANCELLED rate is an early signal of latency regression
- Log CANCELLED at DEBUG level, not ERROR — it is normal user behavior, not a server fault; alerting on CANCELLED creates alert fatigue
- Reduce your P99 latency for high-traffic RPCs — the longer requests take, the more clients cancel them
- For search/autocomplete endpoints, implement client-side debouncing to reduce spurious requests that will likely be cancelled
- Track the cancellation ratio per RPC method in your metrics — a sudden spike in CANCELLED rate is an early signal of latency regression