[BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes

## Related Issue
Follow-up to #2355 (High-load performance degradation)

## Summary
After fixing the database locking issues in #2355, a new strange behavior emerged: **CPU usage increases AFTER load stops**, and **decreases when load resumes**.

## Behavior Observed

| State | CPU per Gateway | Notes |
|-------|-----------------|-------|
| Under load (4000 users) | ~600% | Normal, expected with 16 workers |
| **After load stops** | **~800%** | 🔴 Spin loop - HIGHER than under load! |
| Load resumes | ~600% | Recovers immediately |
| Container restart | ~0.5% | Also recovers |

## Root Cause Analysis

### Investigation Findings
1. **py-spy profiling** showed Python threads mostly idle - the spin is in **Granian's Rust code**, not Python
2. **800-1700 ESTABLISHED TCP connections** remain open after clients disconnect
3. **SendError spam** from Granian initially: `[INFO] ASGI transport error: SendError { .. }`
4. `request.is_disconnected()` returns `False` even for dead connections
5. Python SSE code correctly waits 30s between keepalive yields
6. Granian accepts the yield but silently fails to send to dead sockets

### Why Python-level detection doesn't work
- Our rapid yield detection looks for yields <100ms apart
- Python yields are 30s apart (keepalive interval works correctly)
- The CPU burn is in Granian's Rust event loop, not Python's asyncio

## Likely Granian Bug

This appears related to known Granian issues:
- https://github.com/emmett-framework/granian/issues/286 - HTTP disconnect events queued, not sent immediately
- https://github.com/paperless-ngx/paperless-ngx/discussions/9592 - Similar high CPU after client disconnects

## Environment
- Granian with 16 workers
- SSE (Server-Sent Events) transport
- sse-starlette EventSourceResponse
- Load test: 4000 concurrent users via Locust

## Potential Solutions to Investigate

1. **Test with Gunicorn** - Confirm this is Granian-specific (Gunicorn is ~50% slower but may not have this issue)
2. **Granian connection timeout** - Find/add idle connection timeout option
3. **TCP keepalive tuning** - Detect dead connections at OS level
4. **Aggressive send_timeout** - Force faster failure on dead connections
5. **Report upstream** - File issue with Granian if confirmed as their bug

## Workaround
Container restart immediately recovers CPU to normal levels.

## What was fixed in #2355
- ✅ Root Cause #1: PR #2211 - Cascading FOR UPDATE Locks
- ✅ Root Cause #2: PR #2253 - Server Deactivation (WORST)  
- ✅ Root Cause #3: PR #2170 - Logging Changes
- ✅ Database locking issues resolved
- ❌ Post-load CPU spike remains (this issue)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes #2357

Related Issue

Summary

Behavior Observed

Root Cause Analysis

Investigation Findings

Why Python-level detection doesn't work

Likely Granian Bug

Environment

Potential Solutions to Investigate

Workaround

What was fixed in #2355

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

State	CPU per Gateway	Notes
Under load (4000 users)	~600%	Normal, expected with 16 workers
After load stops	~800%	🔴 Spin loop - HIGHER than under load!
Load resumes	~600%	Recovers immediately
Container restart	~0.5%	Also recovers

[BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes #2357

Description

Related Issue

Summary

Behavior Observed

Root Cause Analysis

Investigation Findings

Why Python-level detection doesn't work

Likely Granian Bug

Environment

Potential Solutions to Investigate

Workaround

What was fixed in #2355

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions