fix(gateway): prevent concurrent agent runs for the same session via sentinel guard by Gutslabs · Pull Request #2086 · NousResearch/hermes-agent

Gutslabs · 2026-03-19T19:33:37Z

Summary

Fix a race condition in _handle_message where two messages arriving in rapid succession for the same session can both bypass the _running_agents guard and start duplicate agent instances — corrupting the session transcript.

Root Cause

The guard at line 1324 checks if _quick_key in self._running_agents, but the session key is only registered in _running_agents much later — inside track_agent() (line 4790), which polls asynchronously until the agent object is created in a thread pool.

Between the guard check and registration, there are 10+ await points:

Await	What it does	Typical latency
`hooks.emit("session:start")`	Hook callbacks	<10ms
`hooks.emit("command:*")`	Command hooks	<10ms
`_enrich_message_with_vision()`	Vision API call	1-5s
`_enrich_message_with_transcription()`	STT API call	1-5s
`run_in_executor(compression)`	Session hygiene	0.5-3s
`hooks.emit("agent:start")`	Pre-agent hooks	<10ms
`run_in_executor(run_sync)`	Agent construction	0.1-1s

During any of these yields, the event loop can process a second message for the same session. That message sees _quick_key not in self._running_agents, passes the guard, and starts a second concurrent agent — both reading the same transcript and both appending results.

Consequences:

Duplicate responses sent to the user
Transcript corruption (interleaved messages from two agent runs)
Potential crashes from concurrent file writes to the same JSONL transcript

Fix

Place a sentinel object (_AGENT_PENDING_SENTINEL) into _running_agents immediately after all command dispatch returns (before any await), wrapped in try/finally for guaranteed cleanup:

_AGENT_PENDING_SENTINEL = object()

# In _handle_message, after command handlers return:
self._running_agents[_quick_key] = _AGENT_PENDING_SENTINEL
try:
    return await self._handle_message_with_agent(event, source, _quick_key)
finally:
    if self._running_agents.get(_quick_key) is _AGENT_PENDING_SENTINEL:
        del self._running_agents[_quick_key]

The long async message processing path (_handle_message_with_agent) is extracted into its own method so the try/finally cleanly wraps the entire flow. When track_agent() fires, it overwrites the sentinel with the real AIAgent instance — interrupt support works as before.

Also handles the edge case where a message arrives while the sentinel is set (agent not yet created): instead of calling interrupt() on the sentinel, the message is queued via the adapter's pending-message mechanism.

How to Reproduce

Run the gateway with any platform (Telegram, Discord, etc.)
Send two messages in rapid succession (within ~100ms)
If the first message triggers vision/STT enrichment or session hygiene compression, the window widens to several seconds
Both messages start separate agent runs for the same session
User receives duplicate (often contradictory) responses; transcript is corrupted

Test Plan

pytest tests/gateway/test_session_race_guard.py -v — 5 new tests:
- Sentinel is placed before any await in the agent setup path
- Sentinel is cleaned up after normal completion
- Sentinel is cleaned up on exception (no permanent session lockout)
- Second message during sentinel is queued, not duplicated
- Command messages (/help, /status) do not leave stale sentinels
pytest tests/gateway/test_telegram_photo_interrupts.py — existing photo interrupt tests pass
pytest tests/gateway/test_interrupt_key_match.py — existing interrupt key tests pass
pytest tests/gateway/test_session.py — existing session tests pass
pytest tests/gateway/test_session_env.py — existing session env tests pass

Place a sentinel in _running_agents immediately after the "already running" guard check passes — before any await. Without this, the numerous await points between the guard (line 1324) and agent registration (track_agent at line 4790) create a window where a second message for the same session can bypass the guard and start a duplicate agent, corrupting the transcript. The await gap includes: hook emissions, vision enrichment (external API call), audio transcription (external API call), session hygiene compression, and the run_in_executor call itself. For messages with media attachments the window can be several seconds wide. The sentinel is wrapped in try/finally so it is always cleaned up — even if the handler raises or takes an early-return path. When the real AIAgent is created, track_agent() overwrites the sentinel with the actual instance (preserving interrupt support). Also handles the edge case where a message arrives while the sentinel is set but no real agent exists yet: the message is queued via the adapter's pending-message mechanism instead of attempting to call interrupt() on the sentinel object.

teknium1 · 2026-03-20T01:27:00Z

Merged via PR #2113 with your commit cherry-picked onto current main (authorship preserved). Added follow-up hardening for two edge cases (/stop during sentinel, shutdown skipping sentinel) with additional tests. Thanks for the thorough analysis and fix!

teknium1 mentioned this pull request Mar 20, 2026

feat: optional FastMCP skill + fix: gateway session race guard #2113

Merged

teknium1 closed this Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): prevent concurrent agent runs for the same session via sentinel guard#2086

fix(gateway): prevent concurrent agent runs for the same session via sentinel guard#2086
Gutslabs wants to merge 1 commit into
NousResearch:mainfrom
Gutslabs:fix/gateway-session-race-guard

Gutslabs commented Mar 19, 2026

Uh oh!

teknium1 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gutslabs commented Mar 19, 2026

Summary

Root Cause

Fix

How to Reproduce

Test Plan

Uh oh!

teknium1 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants