Skip to content

[BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes #2357

@crivetimihai

Description

@crivetimihai

Related Issue

Follow-up to #2355 (High-load performance degradation)

Summary

After fixing the database locking issues in #2355, a new strange behavior emerged: CPU usage increases AFTER load stops, and decreases when load resumes.

Behavior Observed

State CPU per Gateway Notes
Under load (4000 users) ~600% Normal, expected with 16 workers
After load stops ~800% 🔴 Spin loop - HIGHER than under load!
Load resumes ~600% Recovers immediately
Container restart ~0.5% Also recovers

Root Cause Analysis

Investigation Findings

  1. py-spy profiling showed Python threads mostly idle - the spin is in Granian's Rust code, not Python
  2. 800-1700 ESTABLISHED TCP connections remain open after clients disconnect
  3. SendError spam from Granian initially: [INFO] ASGI transport error: SendError { .. }
  4. request.is_disconnected() returns False even for dead connections
  5. Python SSE code correctly waits 30s between keepalive yields
  6. Granian accepts the yield but silently fails to send to dead sockets

Why Python-level detection doesn't work

  • Our rapid yield detection looks for yields <100ms apart
  • Python yields are 30s apart (keepalive interval works correctly)
  • The CPU burn is in Granian's Rust event loop, not Python's asyncio

Likely Granian Bug

This appears related to known Granian issues:

Environment

  • Granian with 16 workers
  • SSE (Server-Sent Events) transport
  • sse-starlette EventSourceResponse
  • Load test: 4000 concurrent users via Locust

Potential Solutions to Investigate

  1. Test with Gunicorn - Confirm this is Granian-specific (Gunicorn is ~50% slower but may not have this issue)
  2. Granian connection timeout - Find/add idle connection timeout option
  3. TCP keepalive tuning - Detect dead connections at OS level
  4. Aggressive send_timeout - Force faster failure on dead connections
  5. Report upstream - File issue with Granian if confirmed as their bug

Workaround

Container restart immediately recovers CPU to normal levels.

What was fixed in #2355

Metadata

Metadata

Assignees

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingperformancePerformance related itemspythonPython / backend development (FastAPI)

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions