Skip to content

fix(gateway): clear stale agent slot after session_reset to prevent zombie thread#28689

Open
yifengingit wants to merge 2 commits into
NousResearch:mainfrom
yifengingit:fix/session-zombie-after-session-reset
Open

fix(gateway): clear stale agent slot after session_reset to prevent zombie thread#28689
yifengingit wants to merge 2 commits into
NousResearch:mainfrom
yifengingit:fix/session-zombie-after-session-reset

Conversation

@yifengingit

Copy link
Copy Markdown
Contributor

Closes #28686

Problem

When a session_reset skill fires while an agent turn is in-flight and the credential pool is simultaneously exhausted, the affected Telegram thread enters a permanent zombie state: every subsequent message is silently dropped as "agent busy", requiring a gateway restart to recover.

The root cause is a gap in the outer _process_message_or_command finally block: when the run-generation guard correctly blocks the inner _release_running_agent_state call (to protect a newer run), the outer else branch only clears the metadata dicts (_running_agents_ts, _busy_ack_ts) but leaves the dead agent reference in _running_agents[session_key]. The staleness-eviction path can't recover it either because _running_agents_ts was already popped.

See issue #28686 for the full step-by-step trace and log evidence.

Changes

gateway/run.py — Fix 1: outer finally in _process_message_or_command

Replace the sentinel-conditional cleanup with an unconditional _release_running_agent_state call. This is safe because the method is idempotent (pop on an absent key is harmless) and no new agent for the same session key can start while the outer frame is unwinding.

gateway/run.py — Fix 2: _handle_reset_command

Add an explicit _release_running_agent_state(session_key) call immediately after _invalidate_session_run_generation. This makes the reset path self-contained: even if the outer finally doesn't run for this session key (e.g. the reset arrives from a different coroutine context), the slot is still cleared.

Test plan

  • tests/gateway/test_pending_event_none.py — existing tests still pass (guard for the related pending-event path)
  • Manual: trigger /cc on a thread while an agent turn is in-flight with a depleted credential pool → thread continues accepting messages after the reset
  • Manual: trigger /new while an agent turn is in-flight (normal path) → behavior unchanged
  • No regressions on other threads during/after the reset

…ombie thread

When a skill fires session_reset (e.g. /cc) exactly as the credential pool
exhausts, the bumped run-generation makes the in-flight run's
generation-guarded _release_running_agent_state a no-op.  The outer
_process_message_or_command finally only popped the metadata dicts, leaving
the dead agent reference in _running_agents.  Every subsequent message to
the thread hit the busy-guard and was silently dropped, requiring a gateway
restart to recover.

Fix 1 – outer finally in _process_message_or_command: unconditionally call
_release_running_agent_state instead of the sentinel-only branch so the slot
is always cleared when the outer frame unwinds.

Fix 2 – _handle_reset_command: explicitly release the agent slot immediately
after _invalidate_session_run_generation, before any async work, ensuring
the reset path is self-contained.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter area/auth Authentication, OAuth, credential pools labels May 19, 2026
Required by check-attribution CI for PR NousResearch#28689.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/auth Authentication, OAuth, credential pools comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: session_reset + credential pool exhaustion leaves thread session in zombie state — subsequent messages silently dropped

2 participants