fix(gateway): clear stale agent slot after session_reset to prevent zombie thread#28689
Open
yifengingit wants to merge 2 commits into
Open
fix(gateway): clear stale agent slot after session_reset to prevent zombie thread#28689yifengingit wants to merge 2 commits into
yifengingit wants to merge 2 commits into
Conversation
…ombie thread When a skill fires session_reset (e.g. /cc) exactly as the credential pool exhausts, the bumped run-generation makes the in-flight run's generation-guarded _release_running_agent_state a no-op. The outer _process_message_or_command finally only popped the metadata dicts, leaving the dead agent reference in _running_agents. Every subsequent message to the thread hit the busy-guard and was silently dropped, requiring a gateway restart to recover. Fix 1 – outer finally in _process_message_or_command: unconditionally call _release_running_agent_state instead of the sentinel-only branch so the slot is always cleared when the outer frame unwinds. Fix 2 – _handle_reset_command: explicitly release the agent slot immediately after _invalidate_session_run_generation, before any async work, ensuring the reset path is self-contained. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Required by check-attribution CI for PR NousResearch#28689. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #28686
Problem
When a
session_resetskill fires while an agent turn is in-flight and the credential pool is simultaneously exhausted, the affected Telegram thread enters a permanent zombie state: every subsequent message is silently dropped as "agent busy", requiring a gateway restart to recover.The root cause is a gap in the outer
_process_message_or_commandfinally block: when the run-generation guard correctly blocks the inner_release_running_agent_statecall (to protect a newer run), the outerelsebranch only clears the metadata dicts (_running_agents_ts,_busy_ack_ts) but leaves the dead agent reference in_running_agents[session_key]. The staleness-eviction path can't recover it either because_running_agents_tswas already popped.See issue #28686 for the full step-by-step trace and log evidence.
Changes
gateway/run.py— Fix 1: outer finally in_process_message_or_commandReplace the sentinel-conditional cleanup with an unconditional
_release_running_agent_statecall. This is safe because the method is idempotent (pop on an absent key is harmless) and no new agent for the same session key can start while the outer frame is unwinding.gateway/run.py— Fix 2:_handle_reset_commandAdd an explicit
_release_running_agent_state(session_key)call immediately after_invalidate_session_run_generation. This makes the reset path self-contained: even if the outer finally doesn't run for this session key (e.g. the reset arrives from a different coroutine context), the slot is still cleared.Test plan
tests/gateway/test_pending_event_none.py— existing tests still pass (guard for the related pending-event path)/ccon a thread while an agent turn is in-flight with a depleted credential pool → thread continues accepting messages after the reset/newwhile an agent turn is in-flight (normal path) → behavior unchanged