Skip to content

Async hooks#5

Draft
pefontana wants to merge 13 commits into
mainfrom
async-hooks
Draft

Async hooks#5
pefontana wants to merge 13 commits into
mainfrom
async-hooks

Conversation

@pefontana

@pefontana pefontana commented Apr 22, 2026

Copy link
Copy Markdown
Owner

Builds on #4. Adds two things to shell hooks: async: true so slow hooks don't block the agent loop, and a new subagent_batch_complete event that fires once per delegate_task call with an aggregate view of the whole batch.

Two things you can do now that were awkward before:

  • Post to webhooks, Slack, or any slow sink without slowing the agent. Add async: true to a hook entry and the subprocess runs on a background thread pool. The agent loop doesn't wait — a 2-second webhook round-trip costs ~2 ms of schedule time instead of 2 s of blocked tool call.
  • Notify once when the whole delegate_task swarm finishes, not per child. The new subagent_batch_complete event fires exactly once per delegate_task — whether you fanned out to 1 worker or 20 — with aggregate counters and per-child summaries. No more de-duping N per-child notifications.

Example — async Slack audit on every tool call

# ~/.hermes/config.yaml
hooks:
  post_tool_call:
    - command: "~/.hermes/agent-hooks/audit-slack.sh"
      async: true
      timeout: 120
hooks_async_pool_size: 10
hooks_async_shutdown_grace_seconds: 5

Example — one Slack post when a batch finishes

# ~/.hermes/agent-hooks/slack-swarm-summary.sh
payload=$(cat)
total=$(echo  "$payload" | jq -r '.extra.child_count')
ok=$(echo     "$payload" | jq -r '.extra.completed_count')
failed=$(echo "$payload" | jq -r '.extra.failed_count + .extra.errored_count')
curl -sS -X POST -H 'Content-Type: application/json' \
     -d "{\"text\": \"Batch: $ok/$total ok, $failed failed\"}" "$SLACK_WEBHOOK_URL"
echo '{}'
hooks:
  subagent_batch_complete:
    - command: "~/.hermes/agent-hooks/slack-swarm-summary.sh"

Which events can go async

Not every event makes sense as async — some only propagate through their return value.

  • Accepted silently: post_tool_call, post_llm_call, pre_api_request, post_api_request, on_session_*, subagent_stop, subagent_batch_complete.
  • ⚠️ Warned: pre_tool_call — accepted for audit-style use, but any {"action": "block"} return is discarded because the tool has already executed.
  • Rejected at config-parse: pre_llm_call, transform_tool_result, transform_terminal_output — the return value IS the product, so async would make the hook useless. Logged as an ERROR and not registered.

subagent_batch_complete payload

Stdin shape (fires once per delegate_task, after every per-child subagent_stop):

{
  "hook_event_name": "subagent_batch_complete",
  "session_id": "parent-session",
  "extra": {
    "child_count": 3,
    "completed_count": 2,
    "failed_count": 0,
    "errored_count": 0,
    "interrupted_count": 1,
    "timeout_count": 0,
    "total_duration_ms": 5000,
    "children": [
      {"task_index": 0, "role": null, "status": "completed",
       "duration_ms": 1200, "summary": "..."}
    ]
  }
}

Invariant: the five counters partition the status keyspace exactly — completed + failed + errored + interrupted + timeout == child_count. No "unknown" bucket.

Use it alongside subagent_stop (still fires per child) when you want both per-child audit and a single batch summary.

Backpressure + shutdown

  • Bounded thread pool sized to hooks_async_pool_size (default 10, range [1, 100]). Over-saturation drops firings with a WARN log instead of queuing unboundedly.
  • On shutdown (Ctrl-C, systemd stop, normal exit), every tracked hook subprocess gets SIGTERM. Survivors past hooks_async_shutdown_grace_seconds (default 5, range [0, 60]) get SIGKILL — no webhook posts left orphaned after the agent exits.
  • Ctrl-C in CLI mode terminates running hooks immediately. Gateway mode integrates the sweep into its asyncio-native shutdown path.

hermes hooks CLI updates

  • hermes hooks list tags async entries with a trailing [async].
  • hermes hooks doctor warns when pre_tool_call is async (block directives will be ignored).
  • hermes hooks test <event> runs async hooks synchronously by default so you see stdout; --no-wait exercises the real fire-and-forget path.

Config keys

hooks_async_pool_size: 10                      # [1, 100]
hooks_async_shutdown_grace_seconds: 5          # [0, 60]

Full docs + worked examples in website/docs/user-guide/features/hooks.md.

Type of Change

  • ✨ New feature (non-breaking change that adds functionality)
  • 📝 Documentation update
  • ✅ Tests

Rewrites _spawn to use subprocess.Popen + Popen.communicate instead of
subprocess.run, and registers every running child in a module-level
_live_procs set guarded by _async_tracking_lock. Sync callers see
semantically identical behaviour (same result-dict shape, same timeout
and error dispatch), but the child handle is now reachable from a
future async-hook shutdown path so a blocked worker thread can be
unblocked by terminating its subprocess externally.

Adds TestPopenRefactorPreservesSyncSemantics as a regression gate: the
result-dict shape, timeout path (terminate + kill with drained
output), missing-command error, and _live_procs drain invariant are
all pinned.
Fires once per delegate_task call, after every per-child subagent_stop,
with a five-counter partition of the status keyspace
(completed/failed/errored/interrupted/timeout_count) that sums to
child_count exactly. Role is stashed into _child_role_for_batch before
the per-child loop pops _child_role, then stripped before serialisation
so it doesn't leak into the tool result.

Adds the synthetic _DEFAULT_PAYLOADS entry so `hermes hooks test
subagent_batch_complete` works out of the box, registers the event name
in VALID_HOOKS, and pins the contract with a test suite covering:
single-mode and batch-mode firing, counter partitioning including the
timeout status, per-child role preservation, invocation order after
subagent_stop, parent-thread serialisation, extra.* stdin nesting for
shell hooks, and the tool-result not leaking the helper field.
Declaring `async: true` on a hook entry now runs the subprocess in a
background thread pool so the agent loop doesn't wait. Introduces:

* A three-tier compatibility gate at config-parse time — `post_*`,
  session lifecycle, and `subagent_*` events are accepted silently;
  `pre_tool_call` is warned (its late block directive is logged and
  discarded); `pre_llm_call` and the transform_* events are rejected
  with an ERROR log because their return value is the whole product.
* A ThreadPoolExecutor fronted by a BoundedSemaphore so saturation
  drops firings with a WARN instead of queueing unboundedly — real
  backpressure under a chatty `post_tool_call`.
* Process-handle tracking via the _live_procs set (introduced in the
  Popen refactor) and a _live_futures set, both guarded by
  _async_tracking_lock.
* shutdown_async_hooks() sweeps tracked subprocesses with SIGTERM,
  waits up to `hooks_async_shutdown_grace_seconds`, then SIGKILLs
  survivors. Registered via atexit on first pool materialisation.
* CLI mode installs a SIGINT handler that terminates tracked
  subprocesses immediately and chains to the previous handler so
  KeyboardInterrupt propagation still happens; gateway mode plugs
  shutdown_async_hooks() into its asyncio-native teardown sequence
  instead (asymmetry documented in the SIGINT handling comment).
* _reset_async_pool() test-only helper for teardowns that varied pool
  size, grace, or signal disposition — without it module-level
  singletons leak across cases.
* Two new config keys — `hooks_async_pool_size` (default 10,
  [1,100]) and `hooks_async_shutdown_grace_seconds` (default 5,
  [0,60]) — with clamp-on-read so bad values warn and coerce.
* Runtime compat: a sync hook that emits `{"async": true}` on stdout
  (the Claude-Code runtime marker) logs a warning and is treated as
  a no-op — copy-pasted Claude Code scripts don't silently change
  semantics.
* `hermes hooks list` tags async entries with a trailing ``[async]``
  marker so operators can spot fire-and-forget hooks at a glance.
* `hermes hooks doctor` emits a warning line for async `pre_tool_call`
  entries — block directives would be discarded because the tool has
  already executed by the time the background subprocess completes.
* `hermes hooks test <event>` now handles async specs sanely: by
  default it runs synchronously via run_once so the caller sees
  stdout + parsed response. The new ``--no-wait`` flag exercises the
  real fire-and-forget path (via _make_async_callback) and reports
  the schedule latency, useful for testing that the bounded semaphore
  and tracking set behave under load.
New tests/agent/test_shell_hooks_async.py pins the full async bridge
contract with 16 cases:

* config-parse validation tiers — accepted, warned (`pre_tool_call`),
  rejected-at-config-parse (`pre_llm_call`, `transform_*`)
* fire-and-forget callback returns in under ~100 ms while the
  subprocess runs in the background
* the BoundedSemaphore drops firings on saturation, the subprocess is
  never spawned, and a later firing after release succeeds
* async `pre_tool_call` block directives are logged WARN and discarded
* shutdown with a small grace terminates running subprocesses via
  SIGTERM and returns promptly, never burning the full grace when the
  child honors the signal
* the Claude-Code runtime marker `{"async": true}` on a sync hook
  emits a warning and is otherwise a no-op
* thread-safety — 50 concurrent firings from 10 threads drain to zero
* CLI-mode SIGINT handler terminates tracked subprocesses and chains
  to the previous handler (including the SIG_DFL → KeyboardInterrupt
  fallback)

The autouse `_reset_async_state` fixture calls `_reset_async_pool` in
teardown so module-level singletons never leak across cases.
* New "Async hooks" section in the shell-hooks chapter with the
  three-tier compatibility table (accepted / warned / rejected),
  notes on the `hooks_async_pool_size` and
  `hooks_async_shutdown_grace_seconds` config keys, the backpressure
  semantic (drop-on-saturation with a WARN), the sync/async
  allowlist invariant, and the CLI-vs-gateway SIGINT asymmetry.
* New worked example `slack-swarm-summary.sh` alongside the existing
  `log-orchestration.sh` — the first uses subagent_batch_complete
  for single-firing aggregate notifications, the second uses
  subagent_stop for per-child auditing. Contrast is explicit.
* New `### subagent_batch_complete` plugin-hook reference section
  with the full callback signature, the five-counter partition and
  its invariant (sum equals child_count), and an example plugin.
* Quick-reference table updated with the new event; shell-hooks
  configuration schema shows the `async` flag and the two new
  config keys.
* subagent_stop status-keyspace table updated from four values to
  five — the `timeout` status has been live since the delegation
  hard-timeout landed (PR NousResearch#13770) but wasn't previously documented.
Three small correctness fixes on the async-hook bridge, surfaced by a
review pass:

* Double-checked locking in ``_async_pool_get`` and ``_async_sem_get``.
  Without the lock, two concurrent firings could race past the
  ``is None`` check and each construct their own pool/semaphore —
  the first would leak.  A shared ``_async_init_lock`` guards the
  creation window for both getters.

* ``_on_async_future_done`` now snapshots ``_async_sem_inst`` directly
  instead of round-tripping through ``_async_sem_get``.  If
  ``_reset_async_pool`` clears ``_async_sem_inst`` between submit and
  callback, the getter would lazy-create a *new* semaphore just to
  release it — leaving that sem one over its bound.  Snapshot + None
  check + swallow ``ValueError`` leaves the release safely idempotent
  against reset.

* ``_reset_async_pool`` only restores the SIGINT handler it actually
  installed.  Tests that flip ``_sigint_handler_installed`` manually
  (to exercise ``_async_pool_sigint_handler`` without running
  ``_maybe_install_sigint_handler``) never mutated the process's real
  signal disposition; the teardown was writing a fake handler back
  into it and leaking across tests.  A new
  ``_sigint_handler_owned_by_us`` flag gates the restore.

Also simplifies a bit:

* Drop the ``_ACCEPTED_ASYNC_EVENTS`` frozenset — defined with a
  documentation comment but never referenced anywhere.
* Drop the ``"ShellHookSpec"`` forward-reference in
  ``_run_async_body`` — the class is defined above in the same file.
* Remove the dead ``except KeyboardInterrupt: raise`` in the SIGINT
  handler (re-raising the same exception is a no-op).
* Use the exception arg via ``exc_info=`` rather than both ``%s`` and
  ``exc_info=`` (double-formats the same exception).
* Fold the Claude-Code runtime ``{"async": true}`` check into a new
  ``_parse_response_with_extras`` so ``_make_sync_callback`` decodes
  the stdout once instead of twice.
* ``tools/delegate_tool.py`` no longer smuggles ``_child_role`` through
  a sibling ``_child_role_for_batch`` key.  The subagent_stop loop
  now reads ``_child_role`` with ``get()`` so the batch aggregate
  below can still see it; both loops are followed by one strip pass
  that pops ``_child_role`` from every entry before the tool result
  is serialised.

* ``hermes_cli/hooks.py`` imports ``time`` at module level instead of
  wrapping ``time.monotonic()`` in a ``_time_now`` helper whose only
  content was a local ``import time``.

No behaviour change — same firing order, same payload shape, same
tool-result JSON.
Three independent reviews surfaced a handful of real bugs. Fixing all
of them here:

* **SIGTERM orphans hook subprocesses (codex #1).** The CLI only
  installed a SIGINT handler — SIGTERM (from ``kill``, ``timeout``,
  systemd stop, CI harnesses) skips atexit entirely and leaves every
  in-flight hook subprocess running as an orphan owned by init.  Adds
  ``_async_pool_sigterm_handler`` which terminates tracked subprocess
  groups inline, then routes to ``sys.exit(128 + SIGTERM)``.  Inline
  termination is required because ``ThreadPoolExecutor`` uses
  non-daemon threads: Python waits for every worker to return before
  running atexit, and workers block inside
  ``proc.communicate(timeout=spec.timeout)`` until the subprocess
  dies.  Renamed ``_maybe_install_sigint_handler`` →
  ``_maybe_install_signal_handlers`` (with back-compat alias).
  Verified: ``kill -TERM`` on a hermes CLI running a 4 s ``sleep``
  hook now exits in ~0.7 s with no orphan, was 4 s + orphan.

* **Subprocess groups for reliable termination.** Hooks are now
  spawned with ``start_new_session=True`` so the subprocess is its
  own PGID leader.  Shutdown / SIGINT / SIGTERM paths call
  ``os.killpg`` on the group instead of ``proc.terminate()`` — without
  this, a bash script's orphaned ``sleep`` child kept the parent
  stdout FD open and blocked ``proc.communicate`` for the full sleep
  duration.  ``_terminate_group`` / ``_kill_group`` helpers fall back
  to plain ``terminate`` / ``kill`` on edge cases where ``getpgid``
  fails (already-exited proc, non-POSIX).

* **``hermes hooks test --no-wait`` blocks for full hook runtime
  (codex #2).** The flag advertised fire-and-forget but the CLI's
  ``ThreadPoolExecutor`` atexit ``pool.shutdown(wait=True)`` joined
  the worker anyway, which in turn waited for the subprocess.
  ``_cmd_test`` now polls briefly for ``_live_procs`` to fill (so the
  subprocess definitely spawned), then ``os._exit(0)`` — skipping
  atexit entirely.  The subprocess keeps running under init because
  of ``start_new_session=True``.  Verified: CLI exit dropped from 2.3
  s to 76 ms for a 2-second hook, and the hook still writes its
  audit log 3 s later after the CLI is gone.

* **Stale ``_child_role_for_batch`` test (claude #1 / hermes #2).**
  The test from commit 76d3ffd4 asserted the *old* helper field name
  — no code path sets it post-refactor (455c136f), so the test
  passed trivially without verifying anything.  Fixed to assert
  ``_child_role`` (the real field) is stripped, and added an
  explanatory message so a future failure is easier to diagnose.
  Module-header docstring updated too.

* **``submit()`` RuntimeError branch: stale-semaphore parity fix
  (claude #3).**  Same pattern I already fixed in
  ``_on_async_future_done``, missed here: a concurrent
  ``_reset_async_pool`` between ``acquire`` and ``release`` would
  cause ``_async_sem_get()`` to lazy-create a fresh sem and over-
  release on it.  Snapshot ``_async_sem_inst`` + swallow
  ``ValueError`` like the symmetric path.

* **Shutdown race: proc registered after the snapshot (claude #4 /
  hermes #1).** Worker that got between ``subprocess.Popen()`` and
  ``_register_live_proc(proc)`` would miss the shutdown-sweep
  snapshot and block for the full ``spec.timeout``.  After
  registering, the worker now checks ``_async_shutting_down`` and
  self-terminates its subprocess group.

* **WARN log noise on SIGTERM'd children (claude #5).**
  Shutdown-induced exits (rc = -15 / -9) no longer spam a per-proc
  ``WARNING`` — demoted to ``DEBUG`` when ``_async_shutting_down``
  is set.  Both the atexit path and the signal handlers now set the
  flag before terminating, so a Ctrl-C or a ``kill -TERM`` with 10
  running hooks emits zero warn lines instead of 10.

Still outstanding (documented trade-offs, not fixed here):

* Gateway shutdown blocks the event loop for up to ``grace_seconds``
  (claude #2).  Acknowledged as a follow-up candidate via
  ``loop.run_in_executor``.
* ``_maybe_install_signal_handlers`` is still leading-underscore
  (claude NousResearch#6).  Cosmetic; kept consistent with the rest of the
  module's private-by-convention API.

All 101 hook tests still pass.
Two more real bugs surfaced by a follow-up review round:

* **Windows regression (codex #1 / hermes).** The subprocess termination
  helpers called ``os.killpg(os.getpgid(proc.pid), signal.SIGKILL)``
  guarded only by ``except (ProcessLookupError, OSError)``.  On Windows
  those module attributes don't exist (``AttributeError``) and
  ``signal.SIGKILL`` is undefined, so any timeout / shutdown path would
  crash instead of cleaning up.  Adds ``_IS_WINDOWS`` and the
  platform-guarded ``proc.terminate()`` / ``proc.kill()`` fallback,
  matching the convention in ``tools/process_registry.py``.  Adds
  ``agent/shell_hooks.py`` to ``tests/tools/test_windows_compat.py``
  ``GUARDED_FILES`` so the AST check enforces this going forward.

* **SIGKILL escalation skipped under signal-initiated shutdown
  (codex #2).** ``_async_pool_sig{int,term}_handler`` flip
  ``_async_shutting_down = True`` before atexit / gateway shutdown
  runs ``shutdown_async_hooks``, which previously used that same flag
  for idempotency.  Result: the documented SIGTERM-wait-SIGKILL
  escalation was silently skipped on every signal-initiated shutdown —
  a hook script with ``trap '' TERM`` only died via the per-hook
  timeout (60 s default) instead of the shutdown grace (5 s).
  Decouples the two concerns: ``_async_shutting_down`` still gates new
  submissions and demotes worker-exit log levels; idempotency now lives
  on a separate ``_async_cleanup_ran`` under ``_async_cleanup_lock``
  (which also makes concurrent shutdown callers thread-safe, which the
  bare-bool gate wasn't).

Regression test (``test_shutdown_escalates_even_when_shutting_down_flag_preset``)
covers the TERM-trap path: pre-sets the flag, spawns a TERM-trapping
subprocess, calls shutdown, and asserts the child was killed within the
grace window.

112 hook tests pass.  The 24 broader failures seen in tests/tools and
tests/gateway reproduce identically on the pre-change branch — they are
unrelated pre-existing breaks from being cut from an older merge base.
Live-testing the prior commit's flag decoupling surfaced that the CLI
signal path still blocked for up to ``spec.timeout`` (60 s default,
300 s max) whenever a hook script ignored SIGTERM — because the
handlers only sent SIGTERM and relied on atexit to run the SIGKILL
escalation, but atexit fires *after* non-daemon worker threads
finish.  A worker stuck in ``proc.communicate`` waiting on a
TERM-ignoring subprocess kept the interpreter alive for the full
per-hook timeout.

Run ``shutdown_async_hooks()`` inline from
``_async_pool_sig{int,term}_handler`` so the escalation
(SIGTERM → wait ``grace_seconds`` → SIGKILL survivors →
``pool.shutdown(wait=True)``) completes before the handler chains /
``sys.exit``s.  atexit's subsequent call to ``shutdown_async_hooks``
sees ``_async_cleanup_ran=True`` and no-ops.

End-to-end measurement on a harness that registers a Python hook
which truly ignores SIGTERM (``signal.signal(SIGTERM, lambda *a:
None)`` — bash's ``trap '' TERM`` doesn't qualify because bash's
external ``sleep`` child still dies from SIGTERM):

  before: harness exited after 60.551s (blocked on spec.timeout)
  after:  harness exited after 2.104s  (grace_seconds=2)

Test ``test_sigint_escalates_to_sigkill_when_term_is_ignored``
locks this in: spawns the Python-ignores-TERM subprocess, calls
the handler directly, asserts the handler returns within
``grace_seconds + buffer`` and that ``returncode == -SIGKILL`` (the
signature of the escalation running — a TERM-responsive exit would
have been ``-SIGTERM``).  Uses a sentinel file to close the race
where SIGTERM can arrive before the child's Python-level handler
is installed (in which case Python's default disposition would
terminate the child on SIGTERM and miss the code path under test).

125 hook tests pass.
``cli.py`` installs its own SIGTERM handlers (``_signal_handler`` for
interactive chat, ``_signal_handler_q`` for single-query ``-q`` mode)
*after* ``hermes_cli/main.py`` calls
``agent.shell_hooks._maybe_install_signal_handlers``.  ``signal.signal``
overwrites — no chaining — so our ``_async_pool_sigterm_handler`` is
never reached when the agent is actually running.  The live-test
harness didn't hit this because it doesn't install cli.py's handler.

Symptom: ``kill -TERM`` on ``hermes chat -q ...`` with a hook that
ignores SIGTERM took 62.8 s to exit (``spec.timeout`` default), not
the 2 s ``hooks_async_shutdown_grace_seconds`` the PR description
advertises.  The interpreter waits for non-daemon ``ThreadPoolExecutor``
workers, which block in ``proc.communicate`` until the subprocess dies.

Fix: call ``shutdown_async_hooks()`` inline in both cli.py handlers
before raising ``KeyboardInterrupt``.  Idempotent — no-op if the async
pool was never materialised.  Verified end-to-end:

  before: agent exit 62.818s (TERM-ignoring hook, grace=2s, timeout=60s)
  after : agent exit  4.487s  (= 1.5s cli.py grace + 2s hook grace + ~1s unwind)

Happy path untouched — a TERM-responsive hook still exits in 2.6 s
(cli.py grace + pool shutdown, no grace wait).
# Conflicts:
#	hermes_cli/plugins.py
#	website/docs/user-guide/features/hooks.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant