Skip to content

fix(gateway): prevent concurrent agent runs for the same session via sentinel guard#2086

Closed
Gutslabs wants to merge 1 commit into
NousResearch:mainfrom
Gutslabs:fix/gateway-session-race-guard
Closed

fix(gateway): prevent concurrent agent runs for the same session via sentinel guard#2086
Gutslabs wants to merge 1 commit into
NousResearch:mainfrom
Gutslabs:fix/gateway-session-race-guard

Conversation

@Gutslabs

Copy link
Copy Markdown
Contributor

Summary

Fix a race condition in _handle_message where two messages arriving in rapid succession for the same session can both bypass the _running_agents guard and start duplicate agent instances — corrupting the session transcript.

Root Cause

The guard at line 1324 checks if _quick_key in self._running_agents, but the session key is only registered in _running_agents much later — inside track_agent() (line 4790), which polls asynchronously until the agent object is created in a thread pool.

Between the guard check and registration, there are 10+ await points:

Await What it does Typical latency
hooks.emit("session:start") Hook callbacks <10ms
hooks.emit("command:*") Command hooks <10ms
_enrich_message_with_vision() Vision API call 1-5s
_enrich_message_with_transcription() STT API call 1-5s
run_in_executor(compression) Session hygiene 0.5-3s
hooks.emit("agent:start") Pre-agent hooks <10ms
run_in_executor(run_sync) Agent construction 0.1-1s

During any of these yields, the event loop can process a second message for the same session. That message sees _quick_key not in self._running_agents, passes the guard, and starts a second concurrent agent — both reading the same transcript and both appending results.

Consequences:

  • Duplicate responses sent to the user
  • Transcript corruption (interleaved messages from two agent runs)
  • Potential crashes from concurrent file writes to the same JSONL transcript

Fix

Place a sentinel object (_AGENT_PENDING_SENTINEL) into _running_agents immediately after all command dispatch returns (before any await), wrapped in try/finally for guaranteed cleanup:

_AGENT_PENDING_SENTINEL = object()

# In _handle_message, after command handlers return:
self._running_agents[_quick_key] = _AGENT_PENDING_SENTINEL
try:
    return await self._handle_message_with_agent(event, source, _quick_key)
finally:
    if self._running_agents.get(_quick_key) is _AGENT_PENDING_SENTINEL:
        del self._running_agents[_quick_key]

The long async message processing path (_handle_message_with_agent) is extracted into its own method so the try/finally cleanly wraps the entire flow. When track_agent() fires, it overwrites the sentinel with the real AIAgent instance — interrupt support works as before.

Also handles the edge case where a message arrives while the sentinel is set (agent not yet created): instead of calling interrupt() on the sentinel, the message is queued via the adapter's pending-message mechanism.

How to Reproduce

  1. Run the gateway with any platform (Telegram, Discord, etc.)
  2. Send two messages in rapid succession (within ~100ms)
  3. If the first message triggers vision/STT enrichment or session hygiene compression, the window widens to several seconds
  4. Both messages start separate agent runs for the same session
  5. User receives duplicate (often contradictory) responses; transcript is corrupted

Test Plan

  • pytest tests/gateway/test_session_race_guard.py -v — 5 new tests:
    • Sentinel is placed before any await in the agent setup path
    • Sentinel is cleaned up after normal completion
    • Sentinel is cleaned up on exception (no permanent session lockout)
    • Second message during sentinel is queued, not duplicated
    • Command messages (/help, /status) do not leave stale sentinels
  • pytest tests/gateway/test_telegram_photo_interrupts.py — existing photo interrupt tests pass
  • pytest tests/gateway/test_interrupt_key_match.py — existing interrupt key tests pass
  • pytest tests/gateway/test_session.py — existing session tests pass
  • pytest tests/gateway/test_session_env.py — existing session env tests pass

Place a sentinel in _running_agents immediately after the "already
running" guard check passes — before any await.  Without this, the
numerous await points between the guard (line 1324) and agent
registration (track_agent at line 4790) create a window where a
second message for the same session can bypass the guard and start
a duplicate agent, corrupting the transcript.

The await gap includes: hook emissions, vision enrichment (external
API call), audio transcription (external API call), session hygiene
compression, and the run_in_executor call itself.  For messages with
media attachments the window can be several seconds wide.

The sentinel is wrapped in try/finally so it is always cleaned up —
even if the handler raises or takes an early-return path.  When the
real AIAgent is created, track_agent() overwrites the sentinel with
the actual instance (preserving interrupt support).

Also handles the edge case where a message arrives while the sentinel
is set but no real agent exists yet: the message is queued via the
adapter's pending-message mechanism instead of attempting to call
interrupt() on the sentinel object.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #2113 with your commit cherry-picked onto current main (authorship preserved). Added follow-up hardening for two edge cases (/stop during sentinel, shutdown skipping sentinel) with additional tests. Thanks for the thorough analysis and fix!

@teknium1 teknium1 closed this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants