Intermediate 10 min gRPC 1

CANCELLED — Client Abandoning Slow gRPC Requests

الأعراض

- Server logs show a high rate of `StatusCode.CANCELLED` errors that do not correspond to application failures
- The CANCELLED errors spike when users navigate away from a page or when mobile apps are backgrounded
- Server is processing requests to completion even though the client cancelled
- CANCELLED rate increases with server latency — slower responses = more cancellations
- gRPC-Web applications show CANCELLED errors in the browser console when the user closes a tab or navigates

الأسباب الجذرية

  • User navigating away from a page before the response arrives, cancelling the browser's gRPC-Web request
  • Client-side timeout (distinct from gRPC deadline) cancelling the Go/Python context passed to the gRPC call
  • Mobile app moving to the background, causing the OS to cancel pending network requests
  • Load balancer routing a retry to a different server instance while the original server is still processing
  • Streaming RPC where the client stops reading (buffer full or error) causing the stream to be cancelled server-side

التشخيص

**Step 1 — Distinguish CANCELLED from DEADLINE_EXCEEDED**

- `CANCELLED` (code 1): the **client** explicitly cancelled the call
- `DEADLINE_EXCEEDED` (code 4): the deadline set on the **call** expired

Both appear as errors in server logs, but only DEADLINE_EXCEEDED indicates your server is too slow.

**Step 2 — Check if the server is doing wasted work after cancellation**

```python
# Python gRPC server — check context.is_active() before expensive operations
def GetReport(self, request, context):
data = fetch_raw_data(request.id) # fast

if not context.is_active():
return GetReportResponse() # client cancelled, stop processing

report = build_report(data) # slow — only do this if client is still there
return GetReportResponse(report=report)
```

**Step 3 — Measure cancellation rate vs total requests**

```python
# In your gRPC interceptor, count CANCELLED vs total RPCs
# If CANCELLED / total > 5%, investigate server latency
# If CANCELLED / total < 1%, it's normal user behavior — suppress from error alerts
```

**Step 4 — Identify the slowest RPCs causing most cancellations**

```bash
# Correlate P99 latency with CANCELLED rate per RPC method
# In Prometheus/Grafana:
# histogram_quantile(0.99, grpc_server_handling_seconds_bucket)
# grpc_server_handled_total{grpc_code='Canceled'}
```

الحل

**Fix 1 — Propagate context cancellation to stop wasted work**

```python
# Python: pass context cancellation to downstream calls
def GetUserReport(self, request, context):
# Create a threading event from the gRPC context
stop_event = threading.Event()
context.add_callback(stop_event.set) # called when client cancels

for step in long_running_steps():
if stop_event.is_set() or not context.is_active():
return GetUserReportResponse() # abort early
process_step(step)
```

**Fix 2 — Reduce latency to reduce cancellation rate**

```python
# Profile your slowest RPCs and add caching for read-heavy calls
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_cached_report(report_id: str):
return build_expensive_report(report_id)

def GetReport(self, request, context):
return GetReportResponse(report=get_cached_report(request.report_id))
```

**Fix 3 — Suppress CANCELLED from error alerting**

```python
# In your server interceptor, log CANCELLED at DEBUG not ERROR level
import grpc
import logging

def intercept_service(self, handler, method_name, request_streaming, response_streaming):
def wrapper(request, context):
try:
return handler(request, context)
except grpc.RpcError as e:
if e.code() == grpc.StatusCode.CANCELLED:
logging.debug('Client cancelled RPC %s', method_name)
else:
logging.error('RPC error %s: %s', method_name, e)
raise
return wrapper
```

**Fix 4 — For gRPC-Web: use AbortController client-side**

```javascript
// Cancel old requests when a new search is triggered
let controller = null;

function search(query) {
if (controller) controller.abort(); // cancel previous
controller = new AbortController();
grpcClient.search({ query }, { signal: controller.signal })
.then(handleResults)
.catch(err => { if (err.name !== 'AbortError') handleError(err); });
}
```

الوقاية

- Check `context.is_active()` before every expensive operation in long-running RPCs to stop wasted work when clients cancel
- Log CANCELLED at DEBUG level, not ERROR — it is normal user behavior, not a server fault; alerting on CANCELLED creates alert fatigue
- Reduce your P99 latency for high-traffic RPCs — the longer requests take, the more clients cancel them
- For search/autocomplete endpoints, implement client-side debouncing to reduce spurious requests that will likely be cancelled
- Track the cancellation ratio per RPC method in your metrics — a sudden spike in CANCELLED rate is an early signal of latency regression

رموز الحالة ذات الصلة

المصطلحات ذات الصلة