-
Notifications
You must be signed in to change notification settings - Fork 615
[BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes #2357
Copy link
Copy link
Closed
Copy link
Labels
MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafeP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingSomething isn't workingperformancePerformance related itemsPerformance related itemspythonPython / backend development (FastAPI)Python / backend development (FastAPI)
Milestone
Description
Related Issue
Follow-up to #2355 (High-load performance degradation)
Summary
After fixing the database locking issues in #2355, a new strange behavior emerged: CPU usage increases AFTER load stops, and decreases when load resumes.
Behavior Observed
| State | CPU per Gateway | Notes |
|---|---|---|
| Under load (4000 users) | ~600% | Normal, expected with 16 workers |
| After load stops | ~800% | 🔴 Spin loop - HIGHER than under load! |
| Load resumes | ~600% | Recovers immediately |
| Container restart | ~0.5% | Also recovers |
Root Cause Analysis
Investigation Findings
- py-spy profiling showed Python threads mostly idle - the spin is in Granian's Rust code, not Python
- 800-1700 ESTABLISHED TCP connections remain open after clients disconnect
- SendError spam from Granian initially:
[INFO] ASGI transport error: SendError { .. } request.is_disconnected()returnsFalseeven for dead connections- Python SSE code correctly waits 30s between keepalive yields
- Granian accepts the yield but silently fails to send to dead sockets
Why Python-level detection doesn't work
- Our rapid yield detection looks for yields <100ms apart
- Python yields are 30s apart (keepalive interval works correctly)
- The CPU burn is in Granian's Rust event loop, not Python's asyncio
Likely Granian Bug
This appears related to known Granian issues:
- Disconnect event is not propogated to the middleware while a request is processing emmett-framework/granian#286 - HTTP disconnect events queued, not sent immediately
- [WARNING] [_granian.asgi.io] ASGI transport error: SendError { .. } paperless-ngx/paperless-ngx#9592 - Similar high CPU after client disconnects
Environment
- Granian with 16 workers
- SSE (Server-Sent Events) transport
- sse-starlette EventSourceResponse
- Load test: 4000 concurrent users via Locust
Potential Solutions to Investigate
- Test with Gunicorn - Confirm this is Granian-specific (Gunicorn is ~50% slower but may not have this issue)
- Granian connection timeout - Find/add idle connection timeout option
- TCP keepalive tuning - Detect dead connections at OS level
- Aggressive send_timeout - Force faster failure on dead connections
- Report upstream - File issue with Granian if confirmed as their bug
Workaround
Container restart immediately recovers CPU to normal levels.
What was fixed in #2355
- ✅ Root Cause Configure Renovate #1: PR Synchronize Gateway Activation State Across Tools, Prompts, and Resources #2211 - Cascading FOR UPDATE Locks
- ✅ Root Cause Open Source release of MCP Context Forge - MCP Gateway #2: PR 2251 - Resolve server deactivation error for servers with email teams #2253 - Server Deactivation (WORST)
- ✅ Root Cause Update Makefile and pyproject.toml with packaging steps #3: PR 1865_logging-cpu-optimization #2170 - Logging Changes
- ✅ Database locking issues resolved
- ❌ Post-load CPU spike remains (this issue)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafeP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingSomething isn't workingperformancePerformance related itemsPerformance related itemspythonPython / backend development (FastAPI)Python / backend development (FastAPI)