fix(gateway): prevent concurrent agent runs for the same session via sentinel guard#2086
Closed
Gutslabs wants to merge 1 commit into
Closed
fix(gateway): prevent concurrent agent runs for the same session via sentinel guard#2086Gutslabs wants to merge 1 commit into
Gutslabs wants to merge 1 commit into
Conversation
Place a sentinel in _running_agents immediately after the "already running" guard check passes — before any await. Without this, the numerous await points between the guard (line 1324) and agent registration (track_agent at line 4790) create a window where a second message for the same session can bypass the guard and start a duplicate agent, corrupting the transcript. The await gap includes: hook emissions, vision enrichment (external API call), audio transcription (external API call), session hygiene compression, and the run_in_executor call itself. For messages with media attachments the window can be several seconds wide. The sentinel is wrapped in try/finally so it is always cleaned up — even if the handler raises or takes an early-return path. When the real AIAgent is created, track_agent() overwrites the sentinel with the actual instance (preserving interrupt support). Also handles the edge case where a message arrives while the sentinel is set but no real agent exists yet: the message is queued via the adapter's pending-message mechanism instead of attempting to call interrupt() on the sentinel object.
Contributor
|
Merged via PR #2113 with your commit cherry-picked onto current main (authorship preserved). Added follow-up hardening for two edge cases (/stop during sentinel, shutdown skipping sentinel) with additional tests. Thanks for the thorough analysis and fix! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix a race condition in
_handle_messagewhere two messages arriving in rapid succession for the same session can both bypass the_running_agentsguard and start duplicate agent instances — corrupting the session transcript.Root Cause
The guard at line 1324 checks
if _quick_key in self._running_agents, but the session key is only registered in_running_agentsmuch later — insidetrack_agent()(line 4790), which polls asynchronously until the agent object is created in a thread pool.Between the guard check and registration, there are 10+
awaitpoints:hooks.emit("session:start")hooks.emit("command:*")_enrich_message_with_vision()_enrich_message_with_transcription()run_in_executor(compression)hooks.emit("agent:start")run_in_executor(run_sync)During any of these yields, the event loop can process a second message for the same session. That message sees
_quick_key not in self._running_agents, passes the guard, and starts a second concurrent agent — both reading the same transcript and both appending results.Consequences:
Fix
Place a sentinel object (
_AGENT_PENDING_SENTINEL) into_running_agentsimmediately after all command dispatch returns (before any await), wrapped intry/finallyfor guaranteed cleanup:The long async message processing path (
_handle_message_with_agent) is extracted into its own method so thetry/finallycleanly wraps the entire flow. Whentrack_agent()fires, it overwrites the sentinel with the realAIAgentinstance — interrupt support works as before.Also handles the edge case where a message arrives while the sentinel is set (agent not yet created): instead of calling
interrupt()on the sentinel, the message is queued via the adapter's pending-message mechanism.How to Reproduce
Test Plan
pytest tests/gateway/test_session_race_guard.py -v— 5 new tests:/help,/status) do not leave stale sentinelspytest tests/gateway/test_telegram_photo_interrupts.py— existing photo interrupt tests passpytest tests/gateway/test_interrupt_key_match.py— existing interrupt key tests passpytest tests/gateway/test_session.py— existing session tests passpytest tests/gateway/test_session_env.py— existing session env tests pass