Skip to content

fix(security): restore approval/sudo context in execute_code RPC threads + guard entry points#34131

Closed
banditburai wants to merge 4 commits into
NousResearch:mainfrom
banditburai:fix/execute-code-approval-cluster
Closed

fix(security): restore approval/sudo context in execute_code RPC threads + guard entry points#34131
banditburai wants to merge 4 commits into
NousResearch:mainfrom
banditburai:fix/execute-code-approval-cluster

Conversation

@banditburai

Copy link
Copy Markdown
Contributor

Closes #4146 · Closes #27303 · Closes #33057 · Closes #30882 · supersedes #27304, #30893, #33246

execute_code's RPC worker threads were raw threading.Threads — they began with an empty context and dropped both the HERMES_SESSION_PLATFORM ContextVar (routing gateway sessions into check_dangerous_command's non-interactive auto-approve branch) and the thread-local approval/sudo callbacks (so the CLI prompt couldn't reach the user). Dangerous commands a sandbox script issued therefore ran without approval.

The approval gate is a documented heuristic in SECURITY.md, not a security boundary; this restores its intended behavior across execution entry points — it doesn't add one.

Fixes

Issue Vector Fix
#4146 sandbox terminal() skips the approval guard guard fires once context is restored; terminal stays in the allowlist
#27303 broad HERMES_* env passthrough leaks config + unguarded terminal() env scrub → operational allowlist + DSN/WEBHOOK; guard restored
#30882 gateway manual-approval silently bypassed, incl. via execute_code context propagation + fail-closed entry guard
#33057 approval/session context lost in execute_code RPC threads shared context+callback propagation into both RPC threads

How

  • tools/thread_context.py — new propagate_context_to_thread factoring the ContextVar + thread-local approval/sudo capture→install→clear lifecycle (mirrors the GHSA-qg5c-hvr5-hjgr pattern). Applied to both execute_code RPC threads; agent/tool_executor.py refactored onto it (one audited copy, not three). Strictly more complete than [codex] Preserve context in execute_code RPC threads #33246, which propagated the ContextVar only and would leave the CLI callbacks unrestored.
  • check_execute_code_guard (adapted from [codex] Guard execute_code behind gateway approvals #30893) — one-shot whole-script approval in gateway/ask/cron-deny before the child spawns; fail-closed on deny/timeout/missing-notify; skips isolated container backends.
  • Env scrub — drop the broad HERMES_ prefix → explicit operational allowlist (HERMES_HOME/PROFILE/CONFIG/ENV) plus DSN/WEBHOOK secret substrings. Skill/config env_passthrough vars remain an intentional opt-in (checked first).
  • Gateway — startup warning when approvals.mode=manual with neither tirith nor auxiliary.approval configured.

Limitation

Purely-local, non-interactive, non-gateway sessions still auto-run code — there is no operator surface to prompt against, matching the existing terminal auto-approve contract; the hardline floor still blocks catastrophic patterns. This makes the gateway case consistent with interactive sessions; it does not make headless-local fail-closed (see below).

Out of scope

  • Least-privilege default gateway toolset (execute_code opt-in)
  • tirith as default when a gateway platform is active
  • OS-level isolation (systemd) guidance in docs
  • Flipping headless-local to fail-closed

Notes

Validation

451 passed / 3 skipped across the cluster suites; ruff clean. Repro (#30882): a gateway session whose script ran terminal("<dangerous>") was silently auto-approved before this change; now it routes to gateway approval and blocks on deny/timeout.

Worker threads that dispatch Hermes tools started with an empty contextvars.Context and no thread-local approval/sudo callbacks. Add tools/thread_context.propagate_context_to_thread factoring that capture/install/clear lifecycle (mirrors the GHSA-qg5c-hvr5-hjgr pattern), and refactor agent/tool_executor onto it so the security-critical logic lives in one audited place. Update the contextvar-propagation source guard for the new call shape.

Refs NousResearch#33057
…+ guard entry

Wrap both execute_code RPC threads (local UDS + remote file-RPC) with propagate_context_to_thread so gateway sessions no longer fall into check_dangerous_command's non-interactive auto-approve branch and the CLI approval prompt stays reachable. Add check_execute_code_guard: one-shot fail-closed approval of the whole script in gateway/ask/cron-deny before the child spawns (skips isolated backends; command-string built only past the early returns). Drop the broad HERMES_ env passthrough for an explicit operational allowlist plus DSN/WEBHOOK secret substrings, and update the POSIX-equivalence oracle.

Refs NousResearch#4146, NousResearch#27303, NousResearch#30882, NousResearch#33057
When approvals.mode=manual with security.tirith_enabled off and no auxiliary.approval model, dangerous commands and execute_code scripts can only be gated by live in-chat approval; with routing fixed they now fail closed (block) rather than silently auto-run. Surface that at startup so operators knowingly enable tirith or auxiliary.approval for unattended gateways.

Refs NousResearch#30882
Cover context+callback propagation and teardown-clears, a source guard that both RPC threads stay wrapped, the check_execute_code_guard decision matrix (isolated backend, headless-local, cron-deny, gateway approve/deny/timeout/missing-notify, smart mode, session-yolo), the env-scrub allowlist/secret rules, and a behavioral test that execute_code() blocks before spawning on denial.

Refs NousResearch#4146, NousResearch#27303, NousResearch#30882, NousResearch#33057
@alt-glitch alt-glitch added type/security Security vulnerability or hardening P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder tool/code-exec execute_code sandbox comp/gateway Gateway runner, session dispatch, delivery labels May 29, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Closing as redundant — the fix this PR implements has already landed on main via a parallel contribution, so there's nothing left to merge here.

Your diagnosis was exactly right: the execute_code RPC worker threads started with an empty contextvars.Context and no approval/sudo callbacks, so gateway sessions fell into check_dangerous_command's non-interactive auto-approve branch and dangerous commands ran without prompting (#4146, #27303, #30882, #33057). That's the correct root cause and the correct fix mechanism.

It landed across two commits, both by @firefly:

  • 21aeefe5f — introduced tools/thread_context.py::propagate_context_to_thread and wired it into agent/tool_executor.py
  • 108397726 — wrapped both execute_code RPC threads (local UDS + remote file-RPC) with it, and added check_execute_code_guard for one-shot fail-closed whole-script approval in gateway/ask/cron-deny contexts

That closed #4146, #27303, #30882, and #33057. Your PR is now CONFLICTING because main moved past the shape it targets — the agent/tool_executor.py portion here is a refactor toward code that's already in place.

This was a genuine multi-contributor race on a real P0, and your independent diagnosis confirming the same root cause and fix direction was valuable corroboration. Thank you for the thorough writeup — sorry it got pipped at the post by a parallel PR.

For visibility, I've also documented the one user-facing behavior change that came with the env-scrub hardening (the narrowed HERMES_* passthrough and how to opt a var back in) in #34594.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround tool/code-exec execute_code sandbox type/security Security vulnerability or hardening

Projects

None yet

3 participants