Async hooks#5
Draft
pefontana wants to merge 13 commits into
Draft
Conversation
Rewrites _spawn to use subprocess.Popen + Popen.communicate instead of subprocess.run, and registers every running child in a module-level _live_procs set guarded by _async_tracking_lock. Sync callers see semantically identical behaviour (same result-dict shape, same timeout and error dispatch), but the child handle is now reachable from a future async-hook shutdown path so a blocked worker thread can be unblocked by terminating its subprocess externally. Adds TestPopenRefactorPreservesSyncSemantics as a regression gate: the result-dict shape, timeout path (terminate + kill with drained output), missing-command error, and _live_procs drain invariant are all pinned.
Fires once per delegate_task call, after every per-child subagent_stop, with a five-counter partition of the status keyspace (completed/failed/errored/interrupted/timeout_count) that sums to child_count exactly. Role is stashed into _child_role_for_batch before the per-child loop pops _child_role, then stripped before serialisation so it doesn't leak into the tool result. Adds the synthetic _DEFAULT_PAYLOADS entry so `hermes hooks test subagent_batch_complete` works out of the box, registers the event name in VALID_HOOKS, and pins the contract with a test suite covering: single-mode and batch-mode firing, counter partitioning including the timeout status, per-child role preservation, invocation order after subagent_stop, parent-thread serialisation, extra.* stdin nesting for shell hooks, and the tool-result not leaking the helper field.
Declaring `async: true` on a hook entry now runs the subprocess in a
background thread pool so the agent loop doesn't wait. Introduces:
* A three-tier compatibility gate at config-parse time — `post_*`,
session lifecycle, and `subagent_*` events are accepted silently;
`pre_tool_call` is warned (its late block directive is logged and
discarded); `pre_llm_call` and the transform_* events are rejected
with an ERROR log because their return value is the whole product.
* A ThreadPoolExecutor fronted by a BoundedSemaphore so saturation
drops firings with a WARN instead of queueing unboundedly — real
backpressure under a chatty `post_tool_call`.
* Process-handle tracking via the _live_procs set (introduced in the
Popen refactor) and a _live_futures set, both guarded by
_async_tracking_lock.
* shutdown_async_hooks() sweeps tracked subprocesses with SIGTERM,
waits up to `hooks_async_shutdown_grace_seconds`, then SIGKILLs
survivors. Registered via atexit on first pool materialisation.
* CLI mode installs a SIGINT handler that terminates tracked
subprocesses immediately and chains to the previous handler so
KeyboardInterrupt propagation still happens; gateway mode plugs
shutdown_async_hooks() into its asyncio-native teardown sequence
instead (asymmetry documented in the SIGINT handling comment).
* _reset_async_pool() test-only helper for teardowns that varied pool
size, grace, or signal disposition — without it module-level
singletons leak across cases.
* Two new config keys — `hooks_async_pool_size` (default 10,
[1,100]) and `hooks_async_shutdown_grace_seconds` (default 5,
[0,60]) — with clamp-on-read so bad values warn and coerce.
* Runtime compat: a sync hook that emits `{"async": true}` on stdout
(the Claude-Code runtime marker) logs a warning and is treated as
a no-op — copy-pasted Claude Code scripts don't silently change
semantics.
* `hermes hooks list` tags async entries with a trailing ``[async]`` marker so operators can spot fire-and-forget hooks at a glance. * `hermes hooks doctor` emits a warning line for async `pre_tool_call` entries — block directives would be discarded because the tool has already executed by the time the background subprocess completes. * `hermes hooks test <event>` now handles async specs sanely: by default it runs synchronously via run_once so the caller sees stdout + parsed response. The new ``--no-wait`` flag exercises the real fire-and-forget path (via _make_async_callback) and reports the schedule latency, useful for testing that the bounded semaphore and tracking set behave under load.
New tests/agent/test_shell_hooks_async.py pins the full async bridge
contract with 16 cases:
* config-parse validation tiers — accepted, warned (`pre_tool_call`),
rejected-at-config-parse (`pre_llm_call`, `transform_*`)
* fire-and-forget callback returns in under ~100 ms while the
subprocess runs in the background
* the BoundedSemaphore drops firings on saturation, the subprocess is
never spawned, and a later firing after release succeeds
* async `pre_tool_call` block directives are logged WARN and discarded
* shutdown with a small grace terminates running subprocesses via
SIGTERM and returns promptly, never burning the full grace when the
child honors the signal
* the Claude-Code runtime marker `{"async": true}` on a sync hook
emits a warning and is otherwise a no-op
* thread-safety — 50 concurrent firings from 10 threads drain to zero
* CLI-mode SIGINT handler terminates tracked subprocesses and chains
to the previous handler (including the SIG_DFL → KeyboardInterrupt
fallback)
The autouse `_reset_async_state` fixture calls `_reset_async_pool` in
teardown so module-level singletons never leak across cases.
* New "Async hooks" section in the shell-hooks chapter with the three-tier compatibility table (accepted / warned / rejected), notes on the `hooks_async_pool_size` and `hooks_async_shutdown_grace_seconds` config keys, the backpressure semantic (drop-on-saturation with a WARN), the sync/async allowlist invariant, and the CLI-vs-gateway SIGINT asymmetry. * New worked example `slack-swarm-summary.sh` alongside the existing `log-orchestration.sh` — the first uses subagent_batch_complete for single-firing aggregate notifications, the second uses subagent_stop for per-child auditing. Contrast is explicit. * New `### subagent_batch_complete` plugin-hook reference section with the full callback signature, the five-counter partition and its invariant (sum equals child_count), and an example plugin. * Quick-reference table updated with the new event; shell-hooks configuration schema shows the `async` flag and the two new config keys. * subagent_stop status-keyspace table updated from four values to five — the `timeout` status has been live since the delegation hard-timeout landed (PR NousResearch#13770) but wasn't previously documented.
Three small correctness fixes on the async-hook bridge, surfaced by a
review pass:
* Double-checked locking in ``_async_pool_get`` and ``_async_sem_get``.
Without the lock, two concurrent firings could race past the
``is None`` check and each construct their own pool/semaphore —
the first would leak. A shared ``_async_init_lock`` guards the
creation window for both getters.
* ``_on_async_future_done`` now snapshots ``_async_sem_inst`` directly
instead of round-tripping through ``_async_sem_get``. If
``_reset_async_pool`` clears ``_async_sem_inst`` between submit and
callback, the getter would lazy-create a *new* semaphore just to
release it — leaving that sem one over its bound. Snapshot + None
check + swallow ``ValueError`` leaves the release safely idempotent
against reset.
* ``_reset_async_pool`` only restores the SIGINT handler it actually
installed. Tests that flip ``_sigint_handler_installed`` manually
(to exercise ``_async_pool_sigint_handler`` without running
``_maybe_install_sigint_handler``) never mutated the process's real
signal disposition; the teardown was writing a fake handler back
into it and leaking across tests. A new
``_sigint_handler_owned_by_us`` flag gates the restore.
Also simplifies a bit:
* Drop the ``_ACCEPTED_ASYNC_EVENTS`` frozenset — defined with a
documentation comment but never referenced anywhere.
* Drop the ``"ShellHookSpec"`` forward-reference in
``_run_async_body`` — the class is defined above in the same file.
* Remove the dead ``except KeyboardInterrupt: raise`` in the SIGINT
handler (re-raising the same exception is a no-op).
* Use the exception arg via ``exc_info=`` rather than both ``%s`` and
``exc_info=`` (double-formats the same exception).
* Fold the Claude-Code runtime ``{"async": true}`` check into a new
``_parse_response_with_extras`` so ``_make_sync_callback`` decodes
the stdout once instead of twice.
* ``tools/delegate_tool.py`` no longer smuggles ``_child_role`` through a sibling ``_child_role_for_batch`` key. The subagent_stop loop now reads ``_child_role`` with ``get()`` so the batch aggregate below can still see it; both loops are followed by one strip pass that pops ``_child_role`` from every entry before the tool result is serialised. * ``hermes_cli/hooks.py`` imports ``time`` at module level instead of wrapping ``time.monotonic()`` in a ``_time_now`` helper whose only content was a local ``import time``. No behaviour change — same firing order, same payload shape, same tool-result JSON.
Three independent reviews surfaced a handful of real bugs. Fixing all of them here: * **SIGTERM orphans hook subprocesses (codex #1).** The CLI only installed a SIGINT handler — SIGTERM (from ``kill``, ``timeout``, systemd stop, CI harnesses) skips atexit entirely and leaves every in-flight hook subprocess running as an orphan owned by init. Adds ``_async_pool_sigterm_handler`` which terminates tracked subprocess groups inline, then routes to ``sys.exit(128 + SIGTERM)``. Inline termination is required because ``ThreadPoolExecutor`` uses non-daemon threads: Python waits for every worker to return before running atexit, and workers block inside ``proc.communicate(timeout=spec.timeout)`` until the subprocess dies. Renamed ``_maybe_install_sigint_handler`` → ``_maybe_install_signal_handlers`` (with back-compat alias). Verified: ``kill -TERM`` on a hermes CLI running a 4 s ``sleep`` hook now exits in ~0.7 s with no orphan, was 4 s + orphan. * **Subprocess groups for reliable termination.** Hooks are now spawned with ``start_new_session=True`` so the subprocess is its own PGID leader. Shutdown / SIGINT / SIGTERM paths call ``os.killpg`` on the group instead of ``proc.terminate()`` — without this, a bash script's orphaned ``sleep`` child kept the parent stdout FD open and blocked ``proc.communicate`` for the full sleep duration. ``_terminate_group`` / ``_kill_group`` helpers fall back to plain ``terminate`` / ``kill`` on edge cases where ``getpgid`` fails (already-exited proc, non-POSIX). * **``hermes hooks test --no-wait`` blocks for full hook runtime (codex #2).** The flag advertised fire-and-forget but the CLI's ``ThreadPoolExecutor`` atexit ``pool.shutdown(wait=True)`` joined the worker anyway, which in turn waited for the subprocess. ``_cmd_test`` now polls briefly for ``_live_procs`` to fill (so the subprocess definitely spawned), then ``os._exit(0)`` — skipping atexit entirely. The subprocess keeps running under init because of ``start_new_session=True``. Verified: CLI exit dropped from 2.3 s to 76 ms for a 2-second hook, and the hook still writes its audit log 3 s later after the CLI is gone. * **Stale ``_child_role_for_batch`` test (claude #1 / hermes #2).** The test from commit 76d3ffd4 asserted the *old* helper field name — no code path sets it post-refactor (455c136f), so the test passed trivially without verifying anything. Fixed to assert ``_child_role`` (the real field) is stripped, and added an explanatory message so a future failure is easier to diagnose. Module-header docstring updated too. * **``submit()`` RuntimeError branch: stale-semaphore parity fix (claude #3).** Same pattern I already fixed in ``_on_async_future_done``, missed here: a concurrent ``_reset_async_pool`` between ``acquire`` and ``release`` would cause ``_async_sem_get()`` to lazy-create a fresh sem and over- release on it. Snapshot ``_async_sem_inst`` + swallow ``ValueError`` like the symmetric path. * **Shutdown race: proc registered after the snapshot (claude #4 / hermes #1).** Worker that got between ``subprocess.Popen()`` and ``_register_live_proc(proc)`` would miss the shutdown-sweep snapshot and block for the full ``spec.timeout``. After registering, the worker now checks ``_async_shutting_down`` and self-terminates its subprocess group. * **WARN log noise on SIGTERM'd children (claude #5).** Shutdown-induced exits (rc = -15 / -9) no longer spam a per-proc ``WARNING`` — demoted to ``DEBUG`` when ``_async_shutting_down`` is set. Both the atexit path and the signal handlers now set the flag before terminating, so a Ctrl-C or a ``kill -TERM`` with 10 running hooks emits zero warn lines instead of 10. Still outstanding (documented trade-offs, not fixed here): * Gateway shutdown blocks the event loop for up to ``grace_seconds`` (claude #2). Acknowledged as a follow-up candidate via ``loop.run_in_executor``. * ``_maybe_install_signal_handlers`` is still leading-underscore (claude NousResearch#6). Cosmetic; kept consistent with the rest of the module's private-by-convention API. All 101 hook tests still pass.
Two more real bugs surfaced by a follow-up review round: * **Windows regression (codex #1 / hermes).** The subprocess termination helpers called ``os.killpg(os.getpgid(proc.pid), signal.SIGKILL)`` guarded only by ``except (ProcessLookupError, OSError)``. On Windows those module attributes don't exist (``AttributeError``) and ``signal.SIGKILL`` is undefined, so any timeout / shutdown path would crash instead of cleaning up. Adds ``_IS_WINDOWS`` and the platform-guarded ``proc.terminate()`` / ``proc.kill()`` fallback, matching the convention in ``tools/process_registry.py``. Adds ``agent/shell_hooks.py`` to ``tests/tools/test_windows_compat.py`` ``GUARDED_FILES`` so the AST check enforces this going forward. * **SIGKILL escalation skipped under signal-initiated shutdown (codex #2).** ``_async_pool_sig{int,term}_handler`` flip ``_async_shutting_down = True`` before atexit / gateway shutdown runs ``shutdown_async_hooks``, which previously used that same flag for idempotency. Result: the documented SIGTERM-wait-SIGKILL escalation was silently skipped on every signal-initiated shutdown — a hook script with ``trap '' TERM`` only died via the per-hook timeout (60 s default) instead of the shutdown grace (5 s). Decouples the two concerns: ``_async_shutting_down`` still gates new submissions and demotes worker-exit log levels; idempotency now lives on a separate ``_async_cleanup_ran`` under ``_async_cleanup_lock`` (which also makes concurrent shutdown callers thread-safe, which the bare-bool gate wasn't). Regression test (``test_shutdown_escalates_even_when_shutting_down_flag_preset``) covers the TERM-trap path: pre-sets the flag, spawns a TERM-trapping subprocess, calls shutdown, and asserts the child was killed within the grace window. 112 hook tests pass. The 24 broader failures seen in tests/tools and tests/gateway reproduce identically on the pre-change branch — they are unrelated pre-existing breaks from being cut from an older merge base.
Live-testing the prior commit's flag decoupling surfaced that the CLI
signal path still blocked for up to ``spec.timeout`` (60 s default,
300 s max) whenever a hook script ignored SIGTERM — because the
handlers only sent SIGTERM and relied on atexit to run the SIGKILL
escalation, but atexit fires *after* non-daemon worker threads
finish. A worker stuck in ``proc.communicate`` waiting on a
TERM-ignoring subprocess kept the interpreter alive for the full
per-hook timeout.
Run ``shutdown_async_hooks()`` inline from
``_async_pool_sig{int,term}_handler`` so the escalation
(SIGTERM → wait ``grace_seconds`` → SIGKILL survivors →
``pool.shutdown(wait=True)``) completes before the handler chains /
``sys.exit``s. atexit's subsequent call to ``shutdown_async_hooks``
sees ``_async_cleanup_ran=True`` and no-ops.
End-to-end measurement on a harness that registers a Python hook
which truly ignores SIGTERM (``signal.signal(SIGTERM, lambda *a:
None)`` — bash's ``trap '' TERM`` doesn't qualify because bash's
external ``sleep`` child still dies from SIGTERM):
before: harness exited after 60.551s (blocked on spec.timeout)
after: harness exited after 2.104s (grace_seconds=2)
Test ``test_sigint_escalates_to_sigkill_when_term_is_ignored``
locks this in: spawns the Python-ignores-TERM subprocess, calls
the handler directly, asserts the handler returns within
``grace_seconds + buffer`` and that ``returncode == -SIGKILL`` (the
signature of the escalation running — a TERM-responsive exit would
have been ``-SIGTERM``). Uses a sentinel file to close the race
where SIGTERM can arrive before the child's Python-level handler
is installed (in which case Python's default disposition would
terminate the child on SIGTERM and miss the code path under test).
125 hook tests pass.
``cli.py`` installs its own SIGTERM handlers (``_signal_handler`` for interactive chat, ``_signal_handler_q`` for single-query ``-q`` mode) *after* ``hermes_cli/main.py`` calls ``agent.shell_hooks._maybe_install_signal_handlers``. ``signal.signal`` overwrites — no chaining — so our ``_async_pool_sigterm_handler`` is never reached when the agent is actually running. The live-test harness didn't hit this because it doesn't install cli.py's handler. Symptom: ``kill -TERM`` on ``hermes chat -q ...`` with a hook that ignores SIGTERM took 62.8 s to exit (``spec.timeout`` default), not the 2 s ``hooks_async_shutdown_grace_seconds`` the PR description advertises. The interpreter waits for non-daemon ``ThreadPoolExecutor`` workers, which block in ``proc.communicate`` until the subprocess dies. Fix: call ``shutdown_async_hooks()`` inline in both cli.py handlers before raising ``KeyboardInterrupt``. Idempotent — no-op if the async pool was never materialised. Verified end-to-end: before: agent exit 62.818s (TERM-ignoring hook, grace=2s, timeout=60s) after : agent exit 4.487s (= 1.5s cli.py grace + 2s hook grace + ~1s unwind) Happy path untouched — a TERM-responsive hook still exits in 2.6 s (cli.py grace + pool shutdown, no grace wait).
# Conflicts: # hermes_cli/plugins.py # website/docs/user-guide/features/hooks.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on #4. Adds two things to shell hooks:
async: trueso slow hooks don't block the agent loop, and a newsubagent_batch_completeevent that fires once perdelegate_taskcall with an aggregate view of the whole batch.Two things you can do now that were awkward before:
async: trueto a hook entry and the subprocess runs on a background thread pool. The agent loop doesn't wait — a 2-second webhook round-trip costs ~2 ms of schedule time instead of 2 s of blocked tool call.delegate_taskswarm finishes, not per child. The newsubagent_batch_completeevent fires exactly once perdelegate_task— whether you fanned out to 1 worker or 20 — with aggregate counters and per-child summaries. No more de-duping N per-child notifications.Example — async Slack audit on every tool call
Example — one Slack post when a batch finishes
Which events can go async
Not every event makes sense as async — some only propagate through their return value.
post_tool_call,post_llm_call,pre_api_request,post_api_request,on_session_*,subagent_stop,subagent_batch_complete.pre_tool_call— accepted for audit-style use, but any{"action": "block"}return is discarded because the tool has already executed.pre_llm_call,transform_tool_result,transform_terminal_output— the return value IS the product, so async would make the hook useless. Logged as anERRORand not registered.subagent_batch_completepayloadStdin shape (fires once per
delegate_task, after every per-childsubagent_stop):{ "hook_event_name": "subagent_batch_complete", "session_id": "parent-session", "extra": { "child_count": 3, "completed_count": 2, "failed_count": 0, "errored_count": 0, "interrupted_count": 1, "timeout_count": 0, "total_duration_ms": 5000, "children": [ {"task_index": 0, "role": null, "status": "completed", "duration_ms": 1200, "summary": "..."} ] } }Invariant: the five counters partition the status keyspace exactly —
completed + failed + errored + interrupted + timeout == child_count. No "unknown" bucket.Use it alongside
subagent_stop(still fires per child) when you want both per-child audit and a single batch summary.Backpressure + shutdown
hooks_async_pool_size(default 10, range[1, 100]). Over-saturation drops firings with aWARNlog instead of queuing unboundedly.SIGTERM. Survivors pasthooks_async_shutdown_grace_seconds(default 5, range[0, 60]) getSIGKILL— no webhook posts left orphaned after the agent exits.hermes hooksCLI updateshermes hooks listtags async entries with a trailing[async].hermes hooks doctorwarns whenpre_tool_callis async (block directives will be ignored).hermes hooks test <event>runs async hooks synchronously by default so you see stdout;--no-waitexercises the real fire-and-forget path.Config keys
Full docs + worked examples in
website/docs/user-guide/features/hooks.md.Type of Change