Skip to content

fix(gateway): bypass active-session guard for /approve and /deny commands#4926

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-a82207c8
Apr 4, 2026
Merged

fix(gateway): bypass active-session guard for /approve and /deny commands#4926
teknium1 merged 1 commit into
mainfrom
hermes/hermes-a82207c8

Conversation

@teknium1

@teknium1 teknium1 commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a deadlock where /approve and /deny commands on Telegram (and all gateway platforms) are silently queued while the agent is waiting for approval, creating a circular dependency that times out.

Based on the diagnosis from PR #4904 by @mechovation (excellent root cause analysis in issue #4898). Reimplemented with a cleaner dispatch approach.

The deadlock

  1. Agent hits dangerous command → blocks on threading.Event.wait() in tools/approval.py
  2. User sends /approve via Telegram
  3. Base adapter sees session_key in self._active_sessions → queues in _pending_messages, returns
  4. Queued message waits for agent to finish → agent waits for approval → deadlock

The fix

/approve and /deny now bypass the active-session guard and dispatch directly to self._message_handler(event). This reaches gateway/run.py's _handle_approve_command() which calls resolve_gateway_approval() and unblocks the agent thread.

Key design choice: dispatches directly via await self._message_handler(event) instead of spawning a second _process_message_background task. The original PR (#4904) used _process_message_background which would create a competing background task that manages session lifecycle — its finally block removes the session guard, opening a race window where new messages could spawn duplicate agent tasks.

E2E verified

  • /approve dispatched to handler while session active, response sent, guard intact
  • /deny dispatched, guard intact
  • Normal text still queued (not bypassed)
  • /approve@botname suffix stripped correctly
  • Handler errors caught gracefully, guard survives
  • /stop and other commands still queued (only approve/deny bypass)

Test results

  • 1,866 gateway tests passed (5 pre-existing failures in signal/DAVE)

Fixes #4898, closes #4904

…ands

The base adapter's active-session guard queues all messages when an agent
is running. This creates a deadlock for /approve and /deny: the agent
thread is blocked on threading.Event.wait() in tools/approval.py waiting
for resolve_gateway_approval(), but the /approve command is queued waiting
for the agent to finish.

Dispatch /approve and /deny directly to the message handler (which routes
to gateway/run.py's _handle_approve_command) without going through
_process_message_background — avoids spawning a competing background task
that would mess with session lifecycle/guards.

Fixes #4898
Co-authored-by: mechovation (original diagnosis in PR #4904)
@teknium1 teknium1 merged commit fb68c22 into main Apr 4, 2026
3 of 4 checks passed
Tommyeds pushed a commit to Tommyeds/hermes-agent that referenced this pull request Apr 12, 2026
…ands (NousResearch#4926)

The base adapter's active-session guard queues all messages when an agent
is running. This creates a deadlock for /approve and /deny: the agent
thread is blocked on threading.Event.wait() in tools/approval.py waiting
for resolve_gateway_approval(), but the /approve command is queued waiting
for the agent to finish.

Dispatch /approve and /deny directly to the message handler (which routes
to gateway/run.py's _handle_approve_command) without going through
_process_message_background — avoids spawning a competing background task
that would mess with session lifecycle/guards.

Fixes NousResearch#4898
Co-authored-by: mechovation (original diagnosis in PR NousResearch#4904)
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…ands (NousResearch#4926)

The base adapter's active-session guard queues all messages when an agent
is running. This creates a deadlock for /approve and /deny: the agent
thread is blocked on threading.Event.wait() in tools/approval.py waiting
for resolve_gateway_approval(), but the /approve command is queued waiting
for the agent to finish.

Dispatch /approve and /deny directly to the message handler (which routes
to gateway/run.py's _handle_approve_command) without going through
_process_message_background — avoids spawning a competing background task
that would mess with session lifecycle/guards.

Fixes NousResearch#4898
Co-authored-by: mechovation (original diagnosis in PR NousResearch#4904)
teknium1 added a commit that referenced this pull request May 12, 2026
…gram

The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.

Changes:

- tools/clarify_gateway.py — new event-based primitive mirroring
  tools/approval.py: register/wait_for_response/resolve_gateway_clarify
  with per-session FIFO, threading.Event blocking with 1s heartbeat
  slices (so the inactivity watchdog keeps ticking), and
  clear_session for boundary cleanup.

- gateway/platforms/base.py — abstract send_clarify with a numbered-text
  fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
  etc.) gets a working clarify out of the box. Plus an active-session
  bypass: when the agent is blocked on a text-awaiting clarify, the next
  non-command message routes inline to the runner's intercept instead
  of being queued + triggering an interrupt. Same shape as the /approve
  deadlock fix from PR #4926.

- gateway/platforms/telegram.py — concrete send_clarify renders one
  inline button per choice plus '✏️ Other (type answer)'. cl: callback
  handler resolves numeric choices immediately, flips to text-capture
  mode for Other, with the same authorization guards as exec/slash
  approvals.

- gateway/run.py — clarify_callback wired at the cached-agent per-turn
  callback assignment site (only the user-facing agent path; cron and
  hygiene-compress agents have no human attached). Bridges sync→async
  via run_coroutine_threadsafe, blocks with the configured timeout, and
  returns a '[user did not respond within Xm]' sentinel on timeout so
  the agent adapts rather than pinning the running-agent guard. Text-
  intercept added to _handle_message before slash-confirm intercept
  (skipping slash commands). clear_session called in the run's finally
  to cancel any orphan entries.

- hermes_cli/config.py — agent.clarify_timeout default 600s.

- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
  section.

Tests:

- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
  coverage: button resolve, open-ended auto-await, Other flip, timeout
  None, unknown-id idempotency, clear_session cancellation, FIFO
  ordering, register/unregister notify, config default.

- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
  paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
  callback dispatch (numeric resolve/Other flip/already-resolved/
  unauthorized/invalid-token), and base-adapter text fallback.

Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes #24191.
teknium1 added a commit that referenced this pull request May 12, 2026
…gram (#24199)

The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.

Changes:

- tools/clarify_gateway.py — new event-based primitive mirroring
  tools/approval.py: register/wait_for_response/resolve_gateway_clarify
  with per-session FIFO, threading.Event blocking with 1s heartbeat
  slices (so the inactivity watchdog keeps ticking), and
  clear_session for boundary cleanup.

- gateway/platforms/base.py — abstract send_clarify with a numbered-text
  fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
  etc.) gets a working clarify out of the box. Plus an active-session
  bypass: when the agent is blocked on a text-awaiting clarify, the next
  non-command message routes inline to the runner's intercept instead
  of being queued + triggering an interrupt. Same shape as the /approve
  deadlock fix from PR #4926.

- gateway/platforms/telegram.py — concrete send_clarify renders one
  inline button per choice plus '✏️ Other (type answer)'. cl: callback
  handler resolves numeric choices immediately, flips to text-capture
  mode for Other, with the same authorization guards as exec/slash
  approvals.

- gateway/run.py — clarify_callback wired at the cached-agent per-turn
  callback assignment site (only the user-facing agent path; cron and
  hygiene-compress agents have no human attached). Bridges sync→async
  via run_coroutine_threadsafe, blocks with the configured timeout, and
  returns a '[user did not respond within Xm]' sentinel on timeout so
  the agent adapts rather than pinning the running-agent guard. Text-
  intercept added to _handle_message before slash-confirm intercept
  (skipping slash commands). clear_session called in the run's finally
  to cancel any orphan entries.

- hermes_cli/config.py — agent.clarify_timeout default 600s.

- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
  section.

Tests:

- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
  coverage: button resolve, open-ended auto-await, Other flip, timeout
  None, unknown-id idempotency, clear_session cancellation, FIFO
  ordering, register/unregister notify, config default.

- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
  paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
  callback dispatch (numeric resolve/Other flip/already-resolved/
  unauthorized/invalid-token), and base-adapter text fallback.

Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes #24191.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…ands (NousResearch#4926)

The base adapter's active-session guard queues all messages when an agent
is running. This creates a deadlock for /approve and /deny: the agent
thread is blocked on threading.Event.wait() in tools/approval.py waiting
for resolve_gateway_approval(), but the /approve command is queued waiting
for the agent to finish.

Dispatch /approve and /deny directly to the message handler (which routes
to gateway/run.py's _handle_approve_command) without going through
_process_message_background — avoids spawning a competing background task
that would mess with session lifecycle/guards.

Fixes NousResearch#4898
Co-authored-by: mechovation (original diagnosis in PR NousResearch#4904)
sunJose pushed a commit to sunJose/hermes-agent that referenced this pull request May 14, 2026
…gram (NousResearch#24199)

The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.

Changes:

- tools/clarify_gateway.py — new event-based primitive mirroring
  tools/approval.py: register/wait_for_response/resolve_gateway_clarify
  with per-session FIFO, threading.Event blocking with 1s heartbeat
  slices (so the inactivity watchdog keeps ticking), and
  clear_session for boundary cleanup.

- gateway/platforms/base.py — abstract send_clarify with a numbered-text
  fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
  etc.) gets a working clarify out of the box. Plus an active-session
  bypass: when the agent is blocked on a text-awaiting clarify, the next
  non-command message routes inline to the runner's intercept instead
  of being queued + triggering an interrupt. Same shape as the /approve
  deadlock fix from PR NousResearch#4926.

- gateway/platforms/telegram.py — concrete send_clarify renders one
  inline button per choice plus '✏️ Other (type answer)'. cl: callback
  handler resolves numeric choices immediately, flips to text-capture
  mode for Other, with the same authorization guards as exec/slash
  approvals.

- gateway/run.py — clarify_callback wired at the cached-agent per-turn
  callback assignment site (only the user-facing agent path; cron and
  hygiene-compress agents have no human attached). Bridges sync→async
  via run_coroutine_threadsafe, blocks with the configured timeout, and
  returns a '[user did not respond within Xm]' sentinel on timeout so
  the agent adapts rather than pinning the running-agent guard. Text-
  intercept added to _handle_message before slash-confirm intercept
  (skipping slash commands). clear_session called in the run's finally
  to cancel any orphan entries.

- hermes_cli/config.py — agent.clarify_timeout default 600s.

- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
  section.

Tests:

- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
  coverage: button resolve, open-ended auto-await, Other flip, timeout
  None, unknown-id idempotency, clear_session cancellation, FIFO
  ordering, register/unregister notify, config default.

- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
  paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
  callback dispatch (numeric resolve/Other flip/already-resolved/
  unauthorized/invalid-token), and base-adapter text fallback.

Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes NousResearch#24191.
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…gram (NousResearch#24199)

The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.

Changes:

- tools/clarify_gateway.py — new event-based primitive mirroring
  tools/approval.py: register/wait_for_response/resolve_gateway_clarify
  with per-session FIFO, threading.Event blocking with 1s heartbeat
  slices (so the inactivity watchdog keeps ticking), and
  clear_session for boundary cleanup.

- gateway/platforms/base.py — abstract send_clarify with a numbered-text
  fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
  etc.) gets a working clarify out of the box. Plus an active-session
  bypass: when the agent is blocked on a text-awaiting clarify, the next
  non-command message routes inline to the runner's intercept instead
  of being queued + triggering an interrupt. Same shape as the /approve
  deadlock fix from PR NousResearch#4926.

- gateway/platforms/telegram.py — concrete send_clarify renders one
  inline button per choice plus '✏️ Other (type answer)'. cl: callback
  handler resolves numeric choices immediately, flips to text-capture
  mode for Other, with the same authorization guards as exec/slash
  approvals.

- gateway/run.py — clarify_callback wired at the cached-agent per-turn
  callback assignment site (only the user-facing agent path; cron and
  hygiene-compress agents have no human attached). Bridges sync→async
  via run_coroutine_threadsafe, blocks with the configured timeout, and
  returns a '[user did not respond within Xm]' sentinel on timeout so
  the agent adapts rather than pinning the running-agent guard. Text-
  intercept added to _handle_message before slash-confirm intercept
  (skipping slash commands). clear_session called in the run's finally
  to cancel any orphan entries.

- hermes_cli/config.py — agent.clarify_timeout default 600s.

- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
  section.

Tests:

- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
  coverage: button resolve, open-ended auto-await, Other flip, timeout
  None, unknown-id idempotency, clear_session cancellation, FIFO
  ordering, register/unregister notify, config default.

- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
  paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
  callback dispatch (numeric resolve/Other flip/already-resolved/
  unauthorized/invalid-token), and base-adapter text fallback.

Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes NousResearch#24191.
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…ands (NousResearch#4926)

The base adapter's active-session guard queues all messages when an agent
is running. This creates a deadlock for /approve and /deny: the agent
thread is blocked on threading.Event.wait() in tools/approval.py waiting
for resolve_gateway_approval(), but the /approve command is queued waiting
for the agent to finish.

Dispatch /approve and /deny directly to the message handler (which routes
to gateway/run.py's _handle_approve_command) without going through
_process_message_background — avoids spawning a competing background task
that would mess with session lifecycle/guards.

Fixes NousResearch#4898
Co-authored-by: mechovation (original diagnosis in PR NousResearch#4904)
AlexFoxD pushed a commit to AlexFoxD/hermes-agent that referenced this pull request May 21, 2026
…gram (NousResearch#24199)

The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.

Changes:

- tools/clarify_gateway.py — new event-based primitive mirroring
  tools/approval.py: register/wait_for_response/resolve_gateway_clarify
  with per-session FIFO, threading.Event blocking with 1s heartbeat
  slices (so the inactivity watchdog keeps ticking), and
  clear_session for boundary cleanup.

- gateway/platforms/base.py — abstract send_clarify with a numbered-text
  fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
  etc.) gets a working clarify out of the box. Plus an active-session
  bypass: when the agent is blocked on a text-awaiting clarify, the next
  non-command message routes inline to the runner's intercept instead
  of being queued + triggering an interrupt. Same shape as the /approve
  deadlock fix from PR NousResearch#4926.

- gateway/platforms/telegram.py — concrete send_clarify renders one
  inline button per choice plus '✏️ Other (type answer)'. cl: callback
  handler resolves numeric choices immediately, flips to text-capture
  mode for Other, with the same authorization guards as exec/slash
  approvals.

- gateway/run.py — clarify_callback wired at the cached-agent per-turn
  callback assignment site (only the user-facing agent path; cron and
  hygiene-compress agents have no human attached). Bridges sync→async
  via run_coroutine_threadsafe, blocks with the configured timeout, and
  returns a '[user did not respond within Xm]' sentinel on timeout so
  the agent adapts rather than pinning the running-agent guard. Text-
  intercept added to _handle_message before slash-confirm intercept
  (skipping slash commands). clear_session called in the run's finally
  to cancel any orphan entries.

- hermes_cli/config.py — agent.clarify_timeout default 600s.

- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
  section.

Tests:

- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
  coverage: button resolve, open-ended auto-await, Other flip, timeout
  None, unknown-id idempotency, clear_session cancellation, FIFO
  ordering, register/unregister notify, config default.

- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
  paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
  callback dispatch (numeric resolve/Other flip/already-resolved/
  unauthorized/invalid-token), and base-adapter text fallback.

Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes NousResearch#24191.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…ands (NousResearch#4926)

The base adapter's active-session guard queues all messages when an agent
is running. This creates a deadlock for /approve and /deny: the agent
thread is blocked on threading.Event.wait() in tools/approval.py waiting
for resolve_gateway_approval(), but the /approve command is queued waiting
for the agent to finish.

Dispatch /approve and /deny directly to the message handler (which routes
to gateway/run.py's _handle_approve_command) without going through
_process_message_background — avoids spawning a competing background task
that would mess with session lifecycle/guards.

Fixes NousResearch#4898
Co-authored-by: mechovation (original diagnosis in PR NousResearch#4904)
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…gram (NousResearch#24199)

The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.

Changes:

- tools/clarify_gateway.py — new event-based primitive mirroring
  tools/approval.py: register/wait_for_response/resolve_gateway_clarify
  with per-session FIFO, threading.Event blocking with 1s heartbeat
  slices (so the inactivity watchdog keeps ticking), and
  clear_session for boundary cleanup.

- gateway/platforms/base.py — abstract send_clarify with a numbered-text
  fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
  etc.) gets a working clarify out of the box. Plus an active-session
  bypass: when the agent is blocked on a text-awaiting clarify, the next
  non-command message routes inline to the runner's intercept instead
  of being queued + triggering an interrupt. Same shape as the /approve
  deadlock fix from PR NousResearch#4926.

- gateway/platforms/telegram.py — concrete send_clarify renders one
  inline button per choice plus '✏️ Other (type answer)'. cl: callback
  handler resolves numeric choices immediately, flips to text-capture
  mode for Other, with the same authorization guards as exec/slash
  approvals.

- gateway/run.py — clarify_callback wired at the cached-agent per-turn
  callback assignment site (only the user-facing agent path; cron and
  hygiene-compress agents have no human attached). Bridges sync→async
  via run_coroutine_threadsafe, blocks with the configured timeout, and
  returns a '[user did not respond within Xm]' sentinel on timeout so
  the agent adapts rather than pinning the running-agent guard. Text-
  intercept added to _handle_message before slash-confirm intercept
  (skipping slash commands). clear_session called in the run's finally
  to cancel any orphan entries.

- hermes_cli/config.py — agent.clarify_timeout default 600s.

- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
  section.

Tests:

- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
  coverage: button resolve, open-ended auto-await, Other flip, timeout
  None, unknown-id idempotency, clear_session cancellation, FIFO
  ordering, register/unregister notify, config default.

- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
  paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
  callback dispatch (numeric resolve/Other flip/already-resolved/
  unauthorized/invalid-token), and base-adapter text fallback.

Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes NousResearch#24191.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…ands (NousResearch#4926)

The base adapter's active-session guard queues all messages when an agent
is running. This creates a deadlock for /approve and /deny: the agent
thread is blocked on threading.Event.wait() in tools/approval.py waiting
for resolve_gateway_approval(), but the /approve command is queued waiting
for the agent to finish.

Dispatch /approve and /deny directly to the message handler (which routes
to gateway/run.py's _handle_approve_command) without going through
_process_message_background — avoids spawning a competing background task
that would mess with session lifecycle/guards.

Fixes NousResearch#4898
Co-authored-by: mechovation (original diagnosis in PR NousResearch#4904)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Telegram /approve and /deny Commands Silently Drop During Active Agent Runs

1 participant