Async hooks by pefontana · Pull Request #5 · pefontana/hermes-agent

pefontana · 2026-04-22T21:20:16Z

Builds on #4. Adds two things to shell hooks: async: true so slow hooks don't block the agent loop, and a new subagent_batch_complete event that fires once per delegate_task call with an aggregate view of the whole batch.

Two things you can do now that were awkward before:

Post to webhooks, Slack, or any slow sink without slowing the agent. Add async: true to a hook entry and the subprocess runs on a background thread pool. The agent loop doesn't wait — a 2-second webhook round-trip costs ~2 ms of schedule time instead of 2 s of blocked tool call.
Notify once when the whole delegate_task swarm finishes, not per child. The new subagent_batch_complete event fires exactly once per delegate_task — whether you fanned out to 1 worker or 20 — with aggregate counters and per-child summaries. No more de-duping N per-child notifications.

Example — async Slack audit on every tool call

# ~/.hermes/config.yaml
hooks:
  post_tool_call:
    - command: "~/.hermes/agent-hooks/audit-slack.sh"
      async: true
      timeout: 120
hooks_async_pool_size: 10
hooks_async_shutdown_grace_seconds: 5

Example — one Slack post when a batch finishes

# ~/.hermes/agent-hooks/slack-swarm-summary.sh
payload=$(cat)
total=$(echo  "$payload" | jq -r '.extra.child_count')
ok=$(echo     "$payload" | jq -r '.extra.completed_count')
failed=$(echo "$payload" | jq -r '.extra.failed_count + .extra.errored_count')
curl -sS -X POST -H 'Content-Type: application/json' \
     -d "{\"text\": \"Batch: $ok/$total ok, $failed failed\"}" "$SLACK_WEBHOOK_URL"
echo '{}'

hooks:
  subagent_batch_complete:
    - command: "~/.hermes/agent-hooks/slack-swarm-summary.sh"

Which events can go async

Not every event makes sense as async — some only propagate through their return value.

✅ Accepted silently: post_tool_call, post_llm_call, pre_api_request, post_api_request, on_session_*, subagent_stop, subagent_batch_complete.
⚠️ Warned: pre_tool_call — accepted for audit-style use, but any {"action": "block"} return is discarded because the tool has already executed.
❌ Rejected at config-parse: pre_llm_call, transform_tool_result, transform_terminal_output — the return value IS the product, so async would make the hook useless. Logged as an ERROR and not registered.

`subagent_batch_complete` payload

Stdin shape (fires once per delegate_task, after every per-child subagent_stop):

{
  "hook_event_name": "subagent_batch_complete",
  "session_id": "parent-session",
  "extra": {
    "child_count": 3,
    "completed_count": 2,
    "failed_count": 0,
    "errored_count": 0,
    "interrupted_count": 1,
    "timeout_count": 0,
    "total_duration_ms": 5000,
    "children": [
      {"task_index": 0, "role": null, "status": "completed",
       "duration_ms": 1200, "summary": "..."}
    ]
  }
}

Invariant: the five counters partition the status keyspace exactly — completed + failed + errored + interrupted + timeout == child_count. No "unknown" bucket.

Use it alongside subagent_stop (still fires per child) when you want both per-child audit and a single batch summary.

Backpressure + shutdown

Bounded thread pool sized to hooks_async_pool_size (default 10, range [1, 100]). Over-saturation drops firings with a WARN log instead of queuing unboundedly.
On shutdown (Ctrl-C, systemd stop, normal exit), every tracked hook subprocess gets SIGTERM. Survivors past hooks_async_shutdown_grace_seconds (default 5, range [0, 60]) get SIGKILL — no webhook posts left orphaned after the agent exits.
Ctrl-C in CLI mode terminates running hooks immediately. Gateway mode integrates the sweep into its asyncio-native shutdown path.

`hermes hooks` CLI updates

hermes hooks list tags async entries with a trailing [async].
hermes hooks doctor warns when pre_tool_call is async (block directives will be ignored).
hermes hooks test <event> runs async hooks synchronously by default so you see stdout; --no-wait exercises the real fire-and-forget path.

Config keys

hooks_async_pool_size: 10                      # [1, 100]
hooks_async_shutdown_grace_seconds: 5          # [0, 60]

Full docs + worked examples in website/docs/user-guide/features/hooks.md.

Type of Change

✨ New feature (non-breaking change that adds functionality)
📝 Documentation update
✅ Tests

Rewrites _spawn to use subprocess.Popen + Popen.communicate instead of subprocess.run, and registers every running child in a module-level _live_procs set guarded by _async_tracking_lock. Sync callers see semantically identical behaviour (same result-dict shape, same timeout and error dispatch), but the child handle is now reachable from a future async-hook shutdown path so a blocked worker thread can be unblocked by terminating its subprocess externally. Adds TestPopenRefactorPreservesSyncSemantics as a regression gate: the result-dict shape, timeout path (terminate + kill with drained output), missing-command error, and _live_procs drain invariant are all pinned.

Fires once per delegate_task call, after every per-child subagent_stop, with a five-counter partition of the status keyspace (completed/failed/errored/interrupted/timeout_count) that sums to child_count exactly. Role is stashed into _child_role_for_batch before the per-child loop pops _child_role, then stripped before serialisation so it doesn't leak into the tool result. Adds the synthetic _DEFAULT_PAYLOADS entry so `hermes hooks test subagent_batch_complete` works out of the box, registers the event name in VALID_HOOKS, and pins the contract with a test suite covering: single-mode and batch-mode firing, counter partitioning including the timeout status, per-child role preservation, invocation order after subagent_stop, parent-thread serialisation, extra.* stdin nesting for shell hooks, and the tool-result not leaking the helper field.

Declaring `async: true` on a hook entry now runs the subprocess in a background thread pool so the agent loop doesn't wait. Introduces: * A three-tier compatibility gate at config-parse time — `post_*`, session lifecycle, and `subagent_*` events are accepted silently; `pre_tool_call` is warned (its late block directive is logged and discarded); `pre_llm_call` and the transform_* events are rejected with an ERROR log because their return value is the whole product. * A ThreadPoolExecutor fronted by a BoundedSemaphore so saturation drops firings with a WARN instead of queueing unboundedly — real backpressure under a chatty `post_tool_call`. * Process-handle tracking via the _live_procs set (introduced in the Popen refactor) and a _live_futures set, both guarded by _async_tracking_lock. * shutdown_async_hooks() sweeps tracked subprocesses with SIGTERM, waits up to `hooks_async_shutdown_grace_seconds`, then SIGKILLs survivors. Registered via atexit on first pool materialisation. * CLI mode installs a SIGINT handler that terminates tracked subprocesses immediately and chains to the previous handler so KeyboardInterrupt propagation still happens; gateway mode plugs shutdown_async_hooks() into its asyncio-native teardown sequence instead (asymmetry documented in the SIGINT handling comment). * _reset_async_pool() test-only helper for teardowns that varied pool size, grace, or signal disposition — without it module-level singletons leak across cases. * Two new config keys — `hooks_async_pool_size` (default 10, [1,100]) and `hooks_async_shutdown_grace_seconds` (default 5, [0,60]) — with clamp-on-read so bad values warn and coerce. * Runtime compat: a sync hook that emits `{"async": true}` on stdout (the Claude-Code runtime marker) logs a warning and is treated as a no-op — copy-pasted Claude Code scripts don't silently change semantics.

* `hermes hooks list` tags async entries with a trailing ``[async]`` marker so operators can spot fire-and-forget hooks at a glance. * `hermes hooks doctor` emits a warning line for async `pre_tool_call` entries — block directives would be discarded because the tool has already executed by the time the background subprocess completes. * `hermes hooks test <event>` now handles async specs sanely: by default it runs synchronously via run_once so the caller sees stdout + parsed response. The new ``--no-wait`` flag exercises the real fire-and-forget path (via _make_async_callback) and reports the schedule latency, useful for testing that the bounded semaphore and tracking set behave under load.

New tests/agent/test_shell_hooks_async.py pins the full async bridge contract with 16 cases: * config-parse validation tiers — accepted, warned (`pre_tool_call`), rejected-at-config-parse (`pre_llm_call`, `transform_*`) * fire-and-forget callback returns in under ~100 ms while the subprocess runs in the background * the BoundedSemaphore drops firings on saturation, the subprocess is never spawned, and a later firing after release succeeds * async `pre_tool_call` block directives are logged WARN and discarded * shutdown with a small grace terminates running subprocesses via SIGTERM and returns promptly, never burning the full grace when the child honors the signal * the Claude-Code runtime marker `{"async": true}` on a sync hook emits a warning and is otherwise a no-op * thread-safety — 50 concurrent firings from 10 threads drain to zero * CLI-mode SIGINT handler terminates tracked subprocesses and chains to the previous handler (including the SIG_DFL → KeyboardInterrupt fallback) The autouse `_reset_async_state` fixture calls `_reset_async_pool` in teardown so module-level singletons never leak across cases.

* New "Async hooks" section in the shell-hooks chapter with the three-tier compatibility table (accepted / warned / rejected), notes on the `hooks_async_pool_size` and `hooks_async_shutdown_grace_seconds` config keys, the backpressure semantic (drop-on-saturation with a WARN), the sync/async allowlist invariant, and the CLI-vs-gateway SIGINT asymmetry. * New worked example `slack-swarm-summary.sh` alongside the existing `log-orchestration.sh` — the first uses subagent_batch_complete for single-firing aggregate notifications, the second uses subagent_stop for per-child auditing. Contrast is explicit. * New `### subagent_batch_complete` plugin-hook reference section with the full callback signature, the five-counter partition and its invariant (sum equals child_count), and an example plugin. * Quick-reference table updated with the new event; shell-hooks configuration schema shows the `async` flag and the two new config keys. * subagent_stop status-keyspace table updated from four values to five — the `timeout` status has been live since the delegation hard-timeout landed (PR NousResearch#13770) but wasn't previously documented.

Three small correctness fixes on the async-hook bridge, surfaced by a review pass: * Double-checked locking in ``_async_pool_get`` and ``_async_sem_get``. Without the lock, two concurrent firings could race past the ``is None`` check and each construct their own pool/semaphore — the first would leak. A shared ``_async_init_lock`` guards the creation window for both getters. * ``_on_async_future_done`` now snapshots ``_async_sem_inst`` directly instead of round-tripping through ``_async_sem_get``. If ``_reset_async_pool`` clears ``_async_sem_inst`` between submit and callback, the getter would lazy-create a *new* semaphore just to release it — leaving that sem one over its bound. Snapshot + None check + swallow ``ValueError`` leaves the release safely idempotent against reset. * ``_reset_async_pool`` only restores the SIGINT handler it actually installed. Tests that flip ``_sigint_handler_installed`` manually (to exercise ``_async_pool_sigint_handler`` without running ``_maybe_install_sigint_handler``) never mutated the process's real signal disposition; the teardown was writing a fake handler back into it and leaking across tests. A new ``_sigint_handler_owned_by_us`` flag gates the restore. Also simplifies a bit: * Drop the ``_ACCEPTED_ASYNC_EVENTS`` frozenset — defined with a documentation comment but never referenced anywhere. * Drop the ``"ShellHookSpec"`` forward-reference in ``_run_async_body`` — the class is defined above in the same file. * Remove the dead ``except KeyboardInterrupt: raise`` in the SIGINT handler (re-raising the same exception is a no-op). * Use the exception arg via ``exc_info=`` rather than both ``%s`` and ``exc_info=`` (double-formats the same exception). * Fold the Claude-Code runtime ``{"async": true}`` check into a new ``_parse_response_with_extras`` so ``_make_sync_callback`` decodes the stdout once instead of twice.

* ``tools/delegate_tool.py`` no longer smuggles ``_child_role`` through a sibling ``_child_role_for_batch`` key. The subagent_stop loop now reads ``_child_role`` with ``get()`` so the batch aggregate below can still see it; both loops are followed by one strip pass that pops ``_child_role`` from every entry before the tool result is serialised. * ``hermes_cli/hooks.py`` imports ``time`` at module level instead of wrapping ``time.monotonic()`` in a ``_time_now`` helper whose only content was a local ``import time``. No behaviour change — same firing order, same payload shape, same tool-result JSON.

Three independent reviews surfaced a handful of real bugs. Fixing all of them here: * **SIGTERM orphans hook subprocesses (codex #1).** The CLI only installed a SIGINT handler — SIGTERM (from ``kill``, ``timeout``, systemd stop, CI harnesses) skips atexit entirely and leaves every in-flight hook subprocess running as an orphan owned by init. Adds ``_async_pool_sigterm_handler`` which terminates tracked subprocess groups inline, then routes to ``sys.exit(128 + SIGTERM)``. Inline termination is required because ``ThreadPoolExecutor`` uses non-daemon threads: Python waits for every worker to return before running atexit, and workers block inside ``proc.communicate(timeout=spec.timeout)`` until the subprocess dies. Renamed ``_maybe_install_sigint_handler`` → ``_maybe_install_signal_handlers`` (with back-compat alias). Verified: ``kill -TERM`` on a hermes CLI running a 4 s ``sleep`` hook now exits in ~0.7 s with no orphan, was 4 s + orphan. * **Subprocess groups for reliable termination.** Hooks are now spawned with ``start_new_session=True`` so the subprocess is its own PGID leader. Shutdown / SIGINT / SIGTERM paths call ``os.killpg`` on the group instead of ``proc.terminate()`` — without this, a bash script's orphaned ``sleep`` child kept the parent stdout FD open and blocked ``proc.communicate`` for the full sleep duration. ``_terminate_group`` / ``_kill_group`` helpers fall back to plain ``terminate`` / ``kill`` on edge cases where ``getpgid`` fails (already-exited proc, non-POSIX). * **``hermes hooks test --no-wait`` blocks for full hook runtime (codex #2).** The flag advertised fire-and-forget but the CLI's ``ThreadPoolExecutor`` atexit ``pool.shutdown(wait=True)`` joined the worker anyway, which in turn waited for the subprocess. ``_cmd_test`` now polls briefly for ``_live_procs`` to fill (so the subprocess definitely spawned), then ``os._exit(0)`` — skipping atexit entirely. The subprocess keeps running under init because of ``start_new_session=True``. Verified: CLI exit dropped from 2.3 s to 76 ms for a 2-second hook, and the hook still writes its audit log 3 s later after the CLI is gone. * **Stale ``_child_role_for_batch`` test (claude #1 / hermes #2).** The test from commit 76d3ffd4 asserted the *old* helper field name — no code path sets it post-refactor (455c136f), so the test passed trivially without verifying anything. Fixed to assert ``_child_role`` (the real field) is stripped, and added an explanatory message so a future failure is easier to diagnose. Module-header docstring updated too. * **``submit()`` RuntimeError branch: stale-semaphore parity fix (claude #3).** Same pattern I already fixed in ``_on_async_future_done``, missed here: a concurrent ``_reset_async_pool`` between ``acquire`` and ``release`` would cause ``_async_sem_get()`` to lazy-create a fresh sem and over- release on it. Snapshot ``_async_sem_inst`` + swallow ``ValueError`` like the symmetric path. * **Shutdown race: proc registered after the snapshot (claude #4 / hermes #1).** Worker that got between ``subprocess.Popen()`` and ``_register_live_proc(proc)`` would miss the shutdown-sweep snapshot and block for the full ``spec.timeout``. After registering, the worker now checks ``_async_shutting_down`` and self-terminates its subprocess group. * **WARN log noise on SIGTERM'd children (claude #5).** Shutdown-induced exits (rc = -15 / -9) no longer spam a per-proc ``WARNING`` — demoted to ``DEBUG`` when ``_async_shutting_down`` is set. Both the atexit path and the signal handlers now set the flag before terminating, so a Ctrl-C or a ``kill -TERM`` with 10 running hooks emits zero warn lines instead of 10. Still outstanding (documented trade-offs, not fixed here): * Gateway shutdown blocks the event loop for up to ``grace_seconds`` (claude #2). Acknowledged as a follow-up candidate via ``loop.run_in_executor``. * ``_maybe_install_signal_handlers`` is still leading-underscore (claude NousResearch#6). Cosmetic; kept consistent with the rest of the module's private-by-convention API. All 101 hook tests still pass.

Two more real bugs surfaced by a follow-up review round: * **Windows regression (codex #1 / hermes).** The subprocess termination helpers called ``os.killpg(os.getpgid(proc.pid), signal.SIGKILL)`` guarded only by ``except (ProcessLookupError, OSError)``. On Windows those module attributes don't exist (``AttributeError``) and ``signal.SIGKILL`` is undefined, so any timeout / shutdown path would crash instead of cleaning up. Adds ``_IS_WINDOWS`` and the platform-guarded ``proc.terminate()`` / ``proc.kill()`` fallback, matching the convention in ``tools/process_registry.py``. Adds ``agent/shell_hooks.py`` to ``tests/tools/test_windows_compat.py`` ``GUARDED_FILES`` so the AST check enforces this going forward. * **SIGKILL escalation skipped under signal-initiated shutdown (codex #2).** ``_async_pool_sig{int,term}_handler`` flip ``_async_shutting_down = True`` before atexit / gateway shutdown runs ``shutdown_async_hooks``, which previously used that same flag for idempotency. Result: the documented SIGTERM-wait-SIGKILL escalation was silently skipped on every signal-initiated shutdown — a hook script with ``trap '' TERM`` only died via the per-hook timeout (60 s default) instead of the shutdown grace (5 s). Decouples the two concerns: ``_async_shutting_down`` still gates new submissions and demotes worker-exit log levels; idempotency now lives on a separate ``_async_cleanup_ran`` under ``_async_cleanup_lock`` (which also makes concurrent shutdown callers thread-safe, which the bare-bool gate wasn't). Regression test (``test_shutdown_escalates_even_when_shutting_down_flag_preset``) covers the TERM-trap path: pre-sets the flag, spawns a TERM-trapping subprocess, calls shutdown, and asserts the child was killed within the grace window. 112 hook tests pass. The 24 broader failures seen in tests/tools and tests/gateway reproduce identically on the pre-change branch — they are unrelated pre-existing breaks from being cut from an older merge base.

Live-testing the prior commit's flag decoupling surfaced that the CLI signal path still blocked for up to ``spec.timeout`` (60 s default, 300 s max) whenever a hook script ignored SIGTERM — because the handlers only sent SIGTERM and relied on atexit to run the SIGKILL escalation, but atexit fires *after* non-daemon worker threads finish. A worker stuck in ``proc.communicate`` waiting on a TERM-ignoring subprocess kept the interpreter alive for the full per-hook timeout. Run ``shutdown_async_hooks()`` inline from ``_async_pool_sig{int,term}_handler`` so the escalation (SIGTERM → wait ``grace_seconds`` → SIGKILL survivors → ``pool.shutdown(wait=True)``) completes before the handler chains / ``sys.exit``s. atexit's subsequent call to ``shutdown_async_hooks`` sees ``_async_cleanup_ran=True`` and no-ops. End-to-end measurement on a harness that registers a Python hook which truly ignores SIGTERM (``signal.signal(SIGTERM, lambda *a: None)`` — bash's ``trap '' TERM`` doesn't qualify because bash's external ``sleep`` child still dies from SIGTERM): before: harness exited after 60.551s (blocked on spec.timeout) after: harness exited after 2.104s (grace_seconds=2) Test ``test_sigint_escalates_to_sigkill_when_term_is_ignored`` locks this in: spawns the Python-ignores-TERM subprocess, calls the handler directly, asserts the handler returns within ``grace_seconds + buffer`` and that ``returncode == -SIGKILL`` (the signature of the escalation running — a TERM-responsive exit would have been ``-SIGTERM``). Uses a sentinel file to close the race where SIGTERM can arrive before the child's Python-level handler is installed (in which case Python's default disposition would terminate the child on SIGTERM and miss the code path under test). 125 hook tests pass.

``cli.py`` installs its own SIGTERM handlers (``_signal_handler`` for interactive chat, ``_signal_handler_q`` for single-query ``-q`` mode) *after* ``hermes_cli/main.py`` calls ``agent.shell_hooks._maybe_install_signal_handlers``. ``signal.signal`` overwrites — no chaining — so our ``_async_pool_sigterm_handler`` is never reached when the agent is actually running. The live-test harness didn't hit this because it doesn't install cli.py's handler. Symptom: ``kill -TERM`` on ``hermes chat -q ...`` with a hook that ignores SIGTERM took 62.8 s to exit (``spec.timeout`` default), not the 2 s ``hooks_async_shutdown_grace_seconds`` the PR description advertises. The interpreter waits for non-daemon ``ThreadPoolExecutor`` workers, which block in ``proc.communicate`` until the subprocess dies. Fix: call ``shutdown_async_hooks()`` inline in both cli.py handlers before raising ``KeyboardInterrupt``. Idempotent — no-op if the async pool was never materialised. Verified end-to-end: before: agent exit 62.818s (TERM-ignoring hook, grace=2s, timeout=60s) after : agent exit 4.487s (= 1.5s cli.py grace + 2s hook grace + ~1s unwind) Happy path untouched — a TERM-responsive hook still exits in 2.6 s (cli.py grace + pool shutdown, no grace wait).

# Conflicts: # hermes_cli/plugins.py # website/docs/user-guide/features/hooks.md

pefontana added 9 commits April 22, 2026 18:13

pefontana added 4 commits April 24, 2026 12:57

Merge remote-tracking branch 'origin/main' into async-hooks

841d3dd

# Conflicts: # hermes_cli/plugins.py # website/docs/user-guide/features/hooks.md

pefontana force-pushed the async-hooks branch from 1c05a22 to 841d3dd Compare April 24, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async hooks#5

Async hooks#5
pefontana wants to merge 13 commits into
mainfrom
async-hooks

pefontana commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pefontana commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example — async Slack audit on every tool call

Example — one Slack post when a batch finishes

Which events can go async

subagent_batch_complete payload

Backpressure + shutdown

hermes hooks CLI updates

Config keys

Type of Change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pefontana commented Apr 22, 2026 •

edited

Loading

`subagent_batch_complete` payload

`hermes hooks` CLI updates