Cluster failure tracking by hjc-puro · Pull Request #10 · NousResearch/hermes-agent

hjc-puro · 2025-11-15T05:01:36Z

triggers consolidation into trajectories.jsonl upon detecting 10 errors from same tool
also logs what kinds of errors

…on export, pinned sessions, context meter Ported from ibelick/webclaw PRs NousResearch#24, #10, NousResearch#14, NousResearch#13: - Command palette (⌘K): search and switch sessions instantly - Conversation export: download as Markdown, JSON, or Plain Text - Pinned sessions: pin/unpin from context menu, shown at top of sidebar - Context meter: token usage ring in chat header with hover details - Keyboard shortcuts: ⌘K search, ⌘⇧O new session New UI primitives: autocomplete, command, input, preview-card Attachment button/preview components (composer already has built-in support)

- Active (blue pulse), Waiting (amber), Error (red), Idle counts - Pending approval count badge - Responsive: hidden on mobile, visible on sm+ - Updates in real-time from agentWorkingRows

…en accounting' (NousResearch#10) from fix/gateway-status-tokens into main

…search#5989) Reverts the fork-local accumulator divergence and adopts upstream PR NousResearch#5989's architecturally correct fix: read /status token totals from SessionDB (the source of truth where the agent persists tokens directly), not from the gateway-side SessionStore accumulator. ## Background The user reported `/status` on Telegram showed `Tokens: 0` while the local CLI status bar correctly showed token usage. We initially shipped PR NousResearch#10 with a `_lifetime_mirror` accumulator-delta-tracking machinery to fix the symptom — until we found that: 1. Issue NousResearch#5960 had been filed 2 days earlier upstream by Louise-Qiuqiu with a more thorough root-cause analysis. 2. PR NousResearch#5989 by Tranquil-Flow was already open with a much better fix. 3. The fork was carrying a fork-local commit `1daa37bb` ("fix(gateway): wire LLM token usage into session store for /status") that re-introduced an accumulator pattern upstream had deliberately removed in commit 20441cf ("fix(insights): persist token usage for non-CLI sessions"). The accumulator divergence was the actual mistake; PR NousResearch#10 was patching its symptoms instead of reverting the mistake. Architecturally correct flow (upstream + this commit): - Agent persists token counts directly into SessionDB via `_flush_messages_to_session_db` after each turn. - SessionDB (`hermes_state.py`) is the source of truth. - Gateway's `/status` command reads from SessionDB via `get_session_token_totals(session_id)`. - SessionStore's `update_session` only tracks lightweight metadata (`last_prompt_tokens`) for compression / context-window decisions. ## Changes - `hermes_state.py`: add `get_session_token_totals(session_id)` that aggregates input/output/cache_read/cache_write/reasoning columns from the `sessions` table and returns the sum as `total_tokens`. Returns `None` if the session is not in the DB. Verbatim from upstream PR NousResearch#5989. - `gateway/run.py:_handle_status_command`: query SessionDB for token totals; fall back to `session_entry.total_tokens` only if the DB row is missing (fresh install, DB unavailable, pre-SessionDB session). Verbatim from upstream PR NousResearch#5989. - `gateway/run.py:_run_agent`: revert PR NousResearch#10's `_total_toks` plumbing in both return dicts. No longer needed — token persistence happens via the agent → SessionDB path. - `gateway/run.py` (turn-end persistence call): revert PR NousResearch#10's `update_session(input_tokens=, output_tokens=, total_tokens=)` call to upstream's `update_session(session_key, last_prompt_tokens=...)` shape. Token totals are not gateway's concern. - `gateway/session.py:update_session`: revert to upstream's lightweight signature (`session_key, last_prompt_tokens=None`). Drop the `_lifetime_mirror` accumulator infrastructure entirely. - `tests/gateway/test_session.py`: drop the 5 lifetime_mirror regression tests added in PR NousResearch#10. They test machinery that no longer exists. - `tests/gateway/test_status_command.py`: add 2 tests from PR NousResearch#5989 covering SessionDB-preferred totals + the SessionStore fallback when the DB row is missing. - `tests/test_hermes_state.py`: add 2 tests from PR NousResearch#5989 covering the new helper (column sum + missing-row None behavior). ## Result | | Before PR NousResearch#10 | After PR NousResearch#10 | After this commit | |---|---|---|---| | Architecture | fork accumulator (broken) | fork accumulator + mirror band-aid | upstream SessionDB read | | /status accuracy | Tokens: 0 | correct | correct | | Maintenance burden | high (diverges from upstream) | high | low (matches upstream) | | Subagent-found bugs | - | zero-call mirror corruption + reset_session leak | n/a — machinery gone | ## Test results `pytest tests/test_hermes_state.py tests/gateway/test_status_command.py tests/gateway/test_session.py -q` 202 passed, 0 failed.

- connection.py: cap header read at 8KB to prevent DoS from malicious handler - handler.py: use .find() instead of `in` + .index() to eliminate race in patch - handler.py: add truncated field to execute response when output exceeds 50KB - server.py: include error data field in formatted error messages - test: add timeout to test client recv, handle TimeoutExpired in close Fixes issues NousResearch#1, NousResearch#4, NousResearch#5, NousResearch#6, NousResearch#8, NousResearch#10 from Qwen 3.5 peer review on PR NousResearch#19. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Barbate91

LGTM. This is a major improvement in system robustness, especially the cluster failure detection and consolidated error handling. Great work on making the agent more fault-tolerant!

Barbate91

Code Review Summary\n\nVerdict: Changes Requested (2 minor suggestions, 0 critical issues)\n\nThis PR introduces significant improvements to fault tolerance, error handling, and observability, which is excellent. The logic for extracting tool errors and implementing batch failure thresholds is robust.\n\n### ⚠️ Warnings\n- batch_runner.py: The function (if intended to be used) is not fully implemented/used in the provided diff structure. Please ensure this function is fully integrated and utilized to provide complete statistical reporting.\n\n### 💡 Suggestions\n- batch_runner.py: Consider adding a formal unit test suite for the new and functions to verify complex error parsing paths (especially non-JSON/malformed tool responses).\n- safe_print.py: While the block is good for fallbacks, ensure that the rich library installation is clearly documented in the repository's setup guide for new contributors.\n\n### ✅ Looks Good\n- Clear separation of concerns in the new error extraction logic.\n- Robust implementation of early exit conditions based on tool failure rate/count.\n- Excellent documentation within the functions and parameters.

Barbate91

Code Review Summary\n\nVerdict: Changes Requested (2 minor suggestions, 0 critical issues)\n\nThis PR introduces significant improvements to fault tolerance, error handling, and observability, which is excellent. The logic for extracting tool errors and implementing batch failure thresholds is robust.\n\n### ⚠️ Warnings\n- batch_runner.py: The function (if intended to be used) is not fully implemented/used in the provided diff structure. Please ensure this function is fully integrated and utilized to provide complete statistical reporting.\n\n### 💡 Suggestions\n- batch_runner.py: Consider adding a formal unit test suite for the new and functions to verify complex error parsing paths (especially non-JSON/malformed tool responses).\n- safe_print.py: While the block is good for fallbacks, ensure that the rich library installation is clearly documented in the repository's setup guide for new contributors.\n\n### ✅ Looks Good\n- Clear separation of concerns in the new error extraction logic.\n- Robust implementation of early exit conditions based on tool failure rate/count.

…ption - Add memory_review_prompt and combined_review_prompt to config.yaml - run_agent.py reads all three review prompts from config, fallback to class attrs - skill_manage tool description loaded from config.yaml (eliminate contradiction with review prompts) - _load_skill_manage_description() reads config, minimal fallback if missing - Update LOCAL_MODIFICATIONS.md entry NousResearch#10

Captures Tier 3 tooling work that was considered during the 2026-04-26 overhaul but deferred. Each section explains the design, why it wasn't shipped, and the trigger condition that should reopen it. Tier 3 items deferred: NousResearch#7 Lazy MCP discovery — high refactor surface, low immediate value (gateway RAM/startup not currently constrained). NousResearch#8 Health-aware tool router — overlaps with already-shipped manual fallback (Tavily→DDG/httpx); marginal gain doesn't justify scaffolding. NousResearch#10 Batch tool API — concurrent path + within-turn dedup cache already mitigate the common case; would be disruptive. Each section includes a reference design so future work can pick up without re-deriving rationale, plus a "reopen when..." condition tied to observable signals (memory > 4GB, fallback rate > 30%, etc.) so the weekly health check (or operator) can flag when revisit is justified. Tier 3 NousResearch#9 (LLM-summarize large results) was shipped in 9b3cef1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…le-vision into feature/browser-use-api Reviewed-on: https://git.lambda.coredump.ru/APEX/BrowserUse_and_ComputerUse_skills/pulls/10

@pty819

#20226) * docs(AGENTS.md): add curator/cron/delegation/toolsets, fix plugin tree, frontmatter, auto-discovery caveat Closes #19101 and #19107 (@pty819). Verified 16 claims from those two issues against current main. 12 were real gaps; 2 were generated/hallucinated (#10 unverified --now flag is actually real and already cited in AGENTS.md; #11 stale PR refs #5587 and #4950 do not appear in AGENTS.md at all); 2 were low-prio nits (memory provider hierarchy, --now scope enumeration) deferred. Changes: - Project tree: add yuanbao to platforms comment; expand plugins/ subtree with real directory names (kanban, hermes-achievements, observability, image_gen) instead of vague '<others>'. - Test-count blurb: 15k/700 Apr → 17k/900 May (verified: 17,375 test defs, 915 files). - Adding New Tools: clarify that auto-discovery wires up schemas but the tool only reaches an agent if its name is added to a toolset in toolsets.py. _HERMES_CORE_TOOLS is not dead code. - Adding Configuration: enumerate top-level config.yaml sections including auxiliary and curator; note auxiliary is per-task overrides for side-LLM work. - SKILL.md frontmatter: add author, license, related_skills. Note top-level tags/category are mirrored from metadata.hermes.*. - New section 'Toolsets' — enumerates the 30 current TOOLSETS keys (including yuanbao, kanban, moa, spotify, safe, debugging). - New section 'Delegation (delegate_task)' — sync semantics, batch mode, leaf vs orchestrator roles, config knobs, durability caveat. - New section 'Curator (skill lifecycle)' — core files, 11 CLI verbs, telemetry sidecar, invariants (pin/delete split after PR #20220, bundled/hub off-limits), curator.* config section. - New section 'Cron (scheduled jobs)' — 4 schedule formats, 7 CLI verbs, per-job fields, 3-min hard interrupt, catchup/grace windows, tick.lock, cron→session isolation. Skipped (invalid claims): - #19107 item 10: --now is real (hermes_cli/skills_hub.py:624/966/1013/1470) - #19107 item 11: no '#5587' or '#4950' or 'async_delegation' in AGENTS.md * docs(AGENTS.md): add Kanban section Adds a Kanban entry alongside Curator / Cron / Delegation so the major durable background systems are all represented. Covers the CLI verbs, the HERMES_KANBAN_TASK-gated worker toolset, the in-gateway dispatcher, plugin assets, and the board/tenant isolation model. Points at the full 742-line user docs for detail.

@pty819

NousResearch#20226) * docs(AGENTS.md): add curator/cron/delegation/toolsets, fix plugin tree, frontmatter, auto-discovery caveat Closes NousResearch#19101 and NousResearch#19107 (@pty819). Verified 16 claims from those two issues against current main. 12 were real gaps; 2 were generated/hallucinated (NousResearch#10 unverified --now flag is actually real and already cited in AGENTS.md; NousResearch#11 stale PR refs NousResearch#5587 and NousResearch#4950 do not appear in AGENTS.md at all); 2 were low-prio nits (memory provider hierarchy, --now scope enumeration) deferred. Changes: - Project tree: add yuanbao to platforms comment; expand plugins/ subtree with real directory names (kanban, hermes-achievements, observability, image_gen) instead of vague '<others>'. - Test-count blurb: 15k/700 Apr → 17k/900 May (verified: 17,375 test defs, 915 files). - Adding New Tools: clarify that auto-discovery wires up schemas but the tool only reaches an agent if its name is added to a toolset in toolsets.py. _HERMES_CORE_TOOLS is not dead code. - Adding Configuration: enumerate top-level config.yaml sections including auxiliary and curator; note auxiliary is per-task overrides for side-LLM work. - SKILL.md frontmatter: add author, license, related_skills. Note top-level tags/category are mirrored from metadata.hermes.*. - New section 'Toolsets' — enumerates the 30 current TOOLSETS keys (including yuanbao, kanban, moa, spotify, safe, debugging). - New section 'Delegation (delegate_task)' — sync semantics, batch mode, leaf vs orchestrator roles, config knobs, durability caveat. - New section 'Curator (skill lifecycle)' — core files, 11 CLI verbs, telemetry sidecar, invariants (pin/delete split after PR NousResearch#20220, bundled/hub off-limits), curator.* config section. - New section 'Cron (scheduled jobs)' — 4 schedule formats, 7 CLI verbs, per-job fields, 3-min hard interrupt, catchup/grace windows, tick.lock, cron→session isolation. Skipped (invalid claims): - NousResearch#19107 item 10: --now is real (hermes_cli/skills_hub.py:624/966/1013/1470) - NousResearch#19107 item 11: no 'NousResearch#5587' or 'NousResearch#4950' or 'async_delegation' in AGENTS.md * docs(AGENTS.md): add Kanban section Adds a Kanban entry alongside Curator / Cron / Delegation so the major durable background systems are all represented. Covers the CLI verbs, the HERMES_KANBAN_TASK-gated worker toolset, the in-gateway dispatcher, plugin assets, and the board/tenant isolation model. Points at the full 742-line user docs for detail.

Messages starting with `//` or `cmd:` (case-insensitive, optional leading whitespace) now skip idea capture and fall through to the normal agent path. Use this when you want to give an instruction about existing ideas (e.g. `// 把 NousResearch#10 合并到 NousResearch#8`) instead of having the message recorded as a brand-new 灵感. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-20260525-v2026.5.16 chore: sync upstream release v2026.5.16 into downstream main

Four findings from Copilot's review on PR NousResearch#22891, all in the AX elements-array cap added by 22fa1ed: 1. The truncation note ("response truncated to N of M elements") was appended unconditionally — including in the som/vision multimodal path, whose response carries a screenshot rather than an `elements` array. The note described a payload field that wasn't present. Moved the note into the AX-text branch where the array actually appears. 2. `_format_elements(cap.elements)` ran on the full untrimmed list with its own `max_lines=40` cap, so a caller passing `max_elements=10` would see summary lines referencing `NousResearch#11..NousResearch#40` even though the JSON `elements` array only held NousResearch#1..NousResearch#10. Format on `visible_elements` instead so the summary indices always exist in the response. 3. `_coerce_max_elements` enforced a lower bound but no upper bound, so `max_elements=10_000_000` silently disabled the safeguard and reintroduced the original context-blow-up. Added a hard cap (`_MAX_ALLOWED_MAX_ELEMENTS = 1000`) that clamps oversized values. 4. The schema string said "Default 100" but the property carried no `default` field, and claimed `max_elements` had no effect on som/ vision while the image-missing fallback path can still return an elements array. Added `"default": 100`, `"maximum": 1000`, and clarified the fallback-path wording. Each finding gets a regression test: - test_capture_ax_clamps_oversized_max_elements_to_hard_cap - test_capture_ax_summary_indices_match_returned_elements - test_capture_multimodal_summary_omits_truncation_note - test_schema_max_elements_documents_default_and_upper_bound Verified with `pytest tests/tools/test_computer_use.py` (53 passed, including the 5 new cases). Confirmed each new test fails on the pre-fix code path before applying the production change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> #AI commit#

Four findings from Copilot's review on PR NousResearch#22891, all in the AX elements-array cap added by 22fa1ed: 1. The truncation note ("response truncated to N of M elements") was appended unconditionally — including in the som/vision multimodal path, whose response carries a screenshot rather than an `elements` array. The note described a payload field that wasn't present. Moved the note into the AX-text branch where the array actually appears. 2. `_format_elements(cap.elements)` ran on the full untrimmed list with its own `max_lines=40` cap, so a caller passing `max_elements=10` would see summary lines referencing `NousResearch#11..NousResearch#40` even though the JSON `elements` array only held NousResearch#1..NousResearch#10. Format on `visible_elements` instead so the summary indices always exist in the response. 3. `_coerce_max_elements` enforced a lower bound but no upper bound, so `max_elements=10_000_000` silently disabled the safeguard and reintroduced the original context-blow-up. Added a hard cap (`_MAX_ALLOWED_MAX_ELEMENTS = 1000`) that clamps oversized values. 4. The schema string said "Default 100" but the property carried no `default` field, and claimed `max_elements` had no effect on som/ vision while the image-missing fallback path can still return an elements array. Added `"default": 100`, `"maximum": 1000`, and clarified the fallback-path wording. Each finding gets a regression test: - test_capture_ax_clamps_oversized_max_elements_to_hard_cap - test_capture_ax_summary_indices_match_returned_elements - test_capture_multimodal_summary_omits_truncation_note - test_schema_max_elements_documents_default_and_upper_bound Verified with `pytest tests/tools/test_computer_use.py` (53 passed, including the 5 new cases). Confirmed each new test fails on the pre-fix code path before applying the production change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Match the other sandbox backends' per-init filesystem isolation. Docker stamps a fresh 'hermes-<uuid>' container name on every _init (docker.py:508), so a destroyed-then-recreated env always sees a brand-new filesystem. Gondolin's sandbox_dir is deterministic from task_id, and _setup_overlay_mounts keeps the scratch dir (overlays/<safe>/{upper,work,merged}) on disk across env lifecycles. The next env that mounts the same guest_path under the same sandbox_dir inherits the prior session's writes via the persisted upper layer — a real cross-session contamination bug, not just a disk leak. Fix: _teardown_overlay_mounts now rmtrees the per-mount scratch dir (merged.parent) after the lazy unmount returns. Lazy unmount + open-fd-keeps- inode-alive means this is safe even if the daemon hasn't fully released handles. Crash recovery still preserves upper/ because the import-time sweep only unmounts and never rmtrees. This also closes design-doc revisit item NousResearch#9 (failed-init cleanup). Test: tests/integration/test_gondolin_terminal.py::test_overlay_writes_do_not_leak_ between_env_lifecycles A KVM-gated integration test that asserts the behavioural invariant via the public GondolinEnvironment.execute() API: env1 writes a file into an overlay extra_mount, env2 (same sandbox_dir, same mount config) must not see it. Implementation-agnostic — no mention of upper/ or fuse-overlayfs — so a future migration to a custom upstream VFSProvider (the @earendil-works/ gondolin package ships vfs/provider) satisfies the same contract trivially and the test passes for free. Doc updates (DO NOT MERGE revisit list): - NousResearch#9 marked resolved (this fix) - NousResearch#6 narrowed: lists the one test we now have and what's still missing - NousResearch#10 added: task_id='default' is shared across all top-level agents at the hermes/gateway layer; concurrent-tenancy isolation needs a per-session task_id and is out of scope for this branch - NousResearch#11 added: overlay=true + missing readonly is a silent UX trap (host-side scratch is created, daemon makes guest mount EROFS) Regression: all 118 gondolin unit + integration tests pass. DO NOT MERGE — see docs/design/gondolin-terminal-backend.md.

@pty819

NousResearch#20226) * docs(AGENTS.md): add curator/cron/delegation/toolsets, fix plugin tree, frontmatter, auto-discovery caveat Closes NousResearch#19101 and NousResearch#19107 (@pty819). Verified 16 claims from those two issues against current main. 12 were real gaps; 2 were generated/hallucinated (NousResearch#10 unverified --now flag is actually real and already cited in AGENTS.md; NousResearch#11 stale PR refs NousResearch#5587 and NousResearch#4950 do not appear in AGENTS.md at all); 2 were low-prio nits (memory provider hierarchy, --now scope enumeration) deferred. Changes: - Project tree: add yuanbao to platforms comment; expand plugins/ subtree with real directory names (kanban, hermes-achievements, observability, image_gen) instead of vague '<others>'. - Test-count blurb: 15k/700 Apr → 17k/900 May (verified: 17,375 test defs, 915 files). - Adding New Tools: clarify that auto-discovery wires up schemas but the tool only reaches an agent if its name is added to a toolset in toolsets.py. _HERMES_CORE_TOOLS is not dead code. - Adding Configuration: enumerate top-level config.yaml sections including auxiliary and curator; note auxiliary is per-task overrides for side-LLM work. - SKILL.md frontmatter: add author, license, related_skills. Note top-level tags/category are mirrored from metadata.hermes.*. - New section 'Toolsets' — enumerates the 30 current TOOLSETS keys (including yuanbao, kanban, moa, spotify, safe, debugging). - New section 'Delegation (delegate_task)' — sync semantics, batch mode, leaf vs orchestrator roles, config knobs, durability caveat. - New section 'Curator (skill lifecycle)' — core files, 11 CLI verbs, telemetry sidecar, invariants (pin/delete split after PR NousResearch#20220, bundled/hub off-limits), curator.* config section. - New section 'Cron (scheduled jobs)' — 4 schedule formats, 7 CLI verbs, per-job fields, 3-min hard interrupt, catchup/grace windows, tick.lock, cron→session isolation. Skipped (invalid claims): - NousResearch#19107 item 10: --now is real (hermes_cli/skills_hub.py:624/966/1013/1470) - NousResearch#19107 item 11: no 'NousResearch#5587' or 'NousResearch#4950' or 'async_delegation' in AGENTS.md * docs(AGENTS.md): add Kanban section Adds a Kanban entry alongside Curator / Cron / Delegation so the major durable background systems are all represented. Covers the CLI verbs, the HERMES_KANBAN_TASK-gated worker toolset, the in-gateway dispatcher, plugin assets, and the board/tenant isolation model. Points at the full 742-line user docs for detail.

…ex models (NousResearch#24182) * feat(codex-runtime): scaffold optional codex app-server runtime Foundational commit for an opt-in alternate runtime that hands OpenAI/Codex turns to a 'codex app-server' subprocess instead of Hermes' tool dispatch. Default behavior is unchanged. Lands in three pieces: 1. agent/transports/codex_app_server.py — JSON-RPC 2.0 over stdio speaker for codex's app-server protocol (codex-rs/app-server). Spawn, init handshake, request/response, notification queue, server-initiated request queue (for approval round-trips), interrupt-friendly blocking reads. Tested against real codex 0.130.0 binary end-to-end during development. 2. hermes_cli/runtime_provider.py: - Adds 'codex_app_server' to _VALID_API_MODES. - Adds _maybe_apply_codex_app_server_runtime() helper, called at the end of _resolve_runtime_from_pool_entry(). Inert unless 'model.openai_runtime: codex_app_server' is set in config.yaml AND provider in {openai, openai-codex}. Other providers cannot be rerouted (anthropic, openrouter, etc. preserved). 3. tests/agent/transports/test_codex_app_server_runtime.py — 24 tests covering api_mode registration, the rewriter helper (default-off, case-insensitive, opt-in, non-eligible providers preserved), version parser, missing-binary handling, error class. Does NOT require codex CLI installed. This commit is wire-only: the api_mode is recognized but AIAgent does not yet branch on it. Followup commits add the session adapter, event projector, approval bridge, transcript projection (so memory/skill review still works), plugin migration, and slash command. Existing tests remain green: - tests/cli/test_cli_provider_resolution.py (29 passed) - tests/agent/test_credential_pool_routing.py (included above) * feat(codex-runtime): add codex item projector for memory/skill review The translator that lets Hermes' self-improvement loop keep working under the Codex runtime: converts codex 'item/*' notifications into Hermes' standard {role, content, tool_calls, tool_call_id} message shape that agent/curator.py already knows how to read. Item taxonomy (matches codex-rs/app-server-protocol/src/protocol/v2/item.rs): - userMessage → {role: user, content} - agentMessage → {role: assistant, content: text} - reasoning → stashed in next assistant's 'reasoning' field - commandExecution → assistant tool_call(name='exec_command') + tool result - fileChange → assistant tool_call(name='apply_patch') + tool result - mcpToolCall → assistant tool_call(name='mcp.<server>.<tool>') + tool result - dynamicToolCall → assistant tool_call(name=<tool>) + tool result - plan/hookPrompt/etc → opaque assistant note, no fabricated tool_calls Invariants preserved: - Message role alternation never violated: each tool item produces at most one assistant + one tool message in that order, correlated by call_id. - Streaming deltas (item/<type>/outputDelta, item/agentMessage/delta) don't materialize messages — only item/completed does. Mirrors how Hermes already only writes the assistant message after streaming ends. - Tool call ids are deterministic (codex item id-based) so replays produce identical messages and prefix caches stay valid (AGENTS.md pitfall NousResearch#16). - JSON args use sorted_keys for the same reason. Real wire formats verified against codex 0.130.0 by capturing live notifications from thread/shellCommand and including one as a fixture (COMMAND_EXEC_COMPLETED). 23 new tests, all green: - Streaming deltas don't materialize (3 paths) - Turn/thread frame events are silent - commandExecution: 5 tests including non-zero exit annotation + deterministic id stability across replays - agentMessage + reasoning attachment + reasoning consumption - fileChange: summary without inlined content - mcpToolCall: namespaced naming + error surfacing - userMessage: text fragments only (drops images/etc) - opaque items: no fabricated tool_calls - Helpers: deterministic id stability + sorted JSON args - Role alternation invariant across all four tool-shaped item types This commit is a pure addition. AIAgent integration (the wire that uses the projector) is the next commit. * feat(codex-runtime): add session adapter + approval bridge The third self-contained module: CodexAppServerSession owns one Codex thread per Hermes session, drives turn/start, consumes streaming notifications via CodexEventProjector, handles server-initiated approval requests, and translates cancellation into turn/interrupt. The adapter has a single public per-turn method: result = session.run_turn(user_input='...', turn_timeout=600) # result.final_text → assistant text for the caller # result.projected_messages → list ready to splice into AIAgent.messages # result.tool_iterations → tick count for _iters_since_skill nudge # result.interrupted → True on Ctrl+C / deadline / interrupt # result.error → error string when the turn cannot complete # result.turn_id, thread_id → for sessions DB / resume Behavior: - ensure_started() spawns codex, does the initialize handshake, and issues thread/start with cwd + permissions profile. Idempotent. - run_turn() blocks until turn/completed, drains server-initiated requests (approvals) before reading notifications so codex never deadlocks waiting for us, projects every item/completed via the projector, and increments tool_iterations for the skill nudge gate. - request_interrupt() is thread-safe (threading.Event); the next loop iteration issues turn/interrupt and unwinds. - turn_timeout deadlock guard issues turn/interrupt and records an error if the turn never completes. - close() escalates terminate → kill via the underlying client. Approval bridge: Codex emits server-initiated requests for execCommandApproval and applyPatchApproval. The adapter translates Hermes' approval choice vocabulary onto codex's decision vocabulary: Hermes 'once' → codex 'approved' Hermes 'session' or 'always' → codex 'approvedForSession' Hermes 'deny' / anything else → codex 'denied' Routing precedence: 1. _ServerRequestRouting.auto_approve_* flags (cron / non-interactive) 2. approval_callback wired by the CLI (defers to tools.approval.prompt_dangerous_approval()) 3. Fail-closed denial when neither is wired Unknown server-request methods are answered with JSON-RPC error -32601 so codex doesn't hang waiting for us. Permission profile mapping mirrors AGENTS.md: Hermes 'auto' → codex 'workspace-write' Hermes 'approval-required' → codex 'read-only-with-approval' Hermes 'unrestricted/yolo' → codex 'full-access' 20 new tests, all green. Combined with prior commits this PR now has 67 tests across three modules: - test_codex_app_server_runtime.py: 24 (api_mode + transport surface) - test_codex_event_projector.py: 23 (item taxonomy projections) - test_codex_app_server_session.py: 20 (turn loop + approvals + interrupts) Full tests/agent/transports/ directory: 249/249 pass — no regressions to existing transport tests. Still no wire into AIAgent.run_conversation(); that integration commit is small and goes next. * feat(codex-runtime): wire codex_app_server runtime into AIAgent The integration commit. AIAgent.run_conversation() now early-returns to a new helper _run_codex_app_server_turn() when self.api_mode == 'codex_app_server', bypassing the chat_completions tool loop entirely. Three small surgical edits to run_agent.py (~105 LOC total): 1. Line ~1204 (constructor api_mode validation set): Add 'codex_app_server' so an explicit api_mode='codex_app_server' passed to AIAgent() isn't silently rewritten to 'chat_completions'. 2. Line ~12048 (run_conversation, just before the while loop): Early-return to _run_codex_app_server_turn() when self.api_mode is 'codex_app_server'. Placed AFTER all standard pre-loop setup — logging context, session DB, surrogate sanitization, _user_turn_count and _turns_since_memory increments, _ext_prefetch_cache, memory manager on_turn_start — so behavior outside the model-call loop is identical between paths. Default Hermes flow is unchanged when the flag is off. 3. End-of-class (line ~15497): New method _run_codex_app_server_turn(). Lazy-instantiates one CodexAppServerSession per AIAgent (reused across turns), runs the turn, splices projected_messages into messages, increments _iters_since_skill by tool_iterations (since the chat_completions loop normally does that per iteration), fires _spawn_background_review on the same cadence as the default path. Counter accounting: _turns_since_memory ← already incremented at run_conversation:11817 (gated on memory store configured) — codex helper does NOT touch it (would double-count). _user_turn_count ← already incremented at run_conversation:11793 — codex helper does NOT touch it. _iters_since_skill ← incremented in the chat_completions loop per tool iteration. Codex helper increments by turn.tool_iterations since the loop is bypassed. User message: ALREADY appended to messages by run_conversation pre-loop (line 11823) before the early-return reaches us. Helper does NOT append again. Regression test test_user_message_not_duplicated guards this. Approval callback wiring: Lazy-fetches tools.terminal_tool._get_approval_callback at session spawn time, passes to CodexAppServerSession. CLI threads with prompt_toolkit get interactive approvals; gateway/cron contexts get the codex-side fail-closed deny. Error path: Codex session exceptions become a 'partial' result with completed=False and a final_response that explicitly tells the user how to switch back: 'Codex app-server turn failed: ... Fall back to default runtime with /codex-runtime auto.' Same return-dict shape as the chat_completions path so all callers (gateway, CLI, batch_runner, ACP) work unchanged. 9 new integration tests in tests/run_agent/test_codex_app_server_integration.py: - api_mode='codex_app_server' is accepted on AIAgent construction - run_conversation returns the expected codex shape (final_response, codex_thread_id, codex_turn_id, completed, partial) - Projected messages are spliced into messages list - _iters_since_skill ticks per tool iteration - _user_turn_count delegated to standard flow (not double-counted) - User message appears exactly once (regression guard) - _spawn_background_review IS invoked (memory/skill review keeps working) - chat.completions.create is NEVER called (loop fully bypassed) - Session exception → partial result with /codex-runtime auto hint - Interrupted turn → partial result with error preserved Adjacent test runs confirm no regressions: - tests/run_agent/test_memory_nudge_counter_hydration.py: green - tests/run_agent/test_background_review.py: green - tests/run_agent/test_fallback_model.py: green - tests/agent/transports/: 249/249 green Still missing for full feature: /codex-runtime slash command, plugin migration helper, docs page, live e2e test gated on codex binary. Those are the remaining followup commits. * feat(codex-runtime): add /codex-runtime slash command (CLI + gateway) User-facing toggle for the optional codex app-server runtime. Follows the 'Adding a Slash Command (All Platforms)' pattern from AGENTS.md exactly: single CommandDef in the central registry → CLI handler → gateway handler → running-agent guard → all surfaces (autocomplete, /help, Telegram menu, Slack subcommands) update automatically. Surface: /codex-runtime — show current state + codex CLI status /codex-runtime auto — Hermes default runtime /codex-runtime codex_app_server — codex subprocess runtime /codex-runtime on / off — synonyms Files changed: hermes_cli/codex_runtime_switch.py (new): Pure-Python state machine shared by CLI and gateway. Parse args, read/write model.openai_runtime in the config dict, gate enabling behind a codex --version check (don't let users opt in to a runtime they have no binary for; print npm install hint instead). Returns a CodexRuntimeStatus dataclass that callers render however suits their surface. hermes_cli/commands.py: Single CommandDef entry, no aliases (codex-runtime is its own thing). cli.py: Dispatch in process_command() + _handle_codex_runtime() handler that delegates to the shared module and renders results via _cprint. gateway/run.py: Dispatch in _handle_message() + _handle_codex_runtime_command() that returns a string (gateway sends as message). On a successful change that requires a new session, _evict_cached_agent() forces the next inbound message to construct a fresh AIAgent with the new api_mode — avoids prompt-cache invalidation mid-session. gateway/run.py running-agent guard: /codex-runtime joins /model in the early-intercept block so a runtime flip mid-turn can't split a turn across two transports. Tests: tests/hermes_cli/test_codex_runtime_switch.py — 25 tests covering the state machine: arg parsing (10 cases incl. case-insensitive and synonyms), reading current runtime (5 cases incl. malformed configs), writing runtime (3 cases), apply() entry point covering read-only, no-op, codex-missing-blocked, codex-present-success, disable-no-binary-check, and persist-failure paths (8 cases). All green. Adjacent test suites confirm no regressions: - tests/hermes_cli/test_commands.py + test_codex_runtime_switch.py: 167/167 green - tests/agent/transports/: 283/283 green when combined with prior commits Still missing: plugin migration helper, docs page, live e2e test gated on codex binary. Followup commits. * feat(codex-runtime): auto-migrate Hermes MCP servers to ~/.codex/config.toml Translates the user's mcp_servers config from ~/.hermes/config.yaml into the TOML format codex's MCP client expects. Wired into the /codex-runtime codex_app_server enable path so users get their MCP tool surface in the spawned subprocess automatically. The migration runs on every enable. Failures are non-fatal — the runtime change still proceeds and the user gets a warning so they can fix the codex config manually. What translates (mapping verified against codex-rs/core/src/config/edit.rs): Hermes mcp_servers.<n>.command/args/env → codex stdio transport Hermes mcp_servers.<n>.url/headers → codex streamable_http transport Hermes mcp_servers.<n>.timeout → codex tool_timeout_sec Hermes mcp_servers.<n>.connect_timeout → codex startup_timeout_sec Hermes mcp_servers.<n>.cwd → codex stdio cwd Hermes mcp_servers.<n>.enabled: false → codex enabled = false What does NOT translate (warned + skipped per server): Hermes-specific keys (sampling, etc.) — codex's MCP client has no equivalent. Listed in the per-server skipped[] field of the report. What's NOT migrated (intentional): AGENTS.md — codex respects this file natively in its cwd. Hermes' own AGENTS.md (project-level) is already in the worktree, so codex picks it up without translation. No code needed. Idempotency design: All managed content lives between a 'managed by hermes-agent' marker and the next non-mcp_servers section header. _strip_existing_managed_block removes the prior managed region cleanly, preserving any user-added codex config (model, providers.openai, sandbox profiles, etc.) above or below. Files added: hermes_cli/codex_runtime_plugin_migration.py — pure-Python migration helper. Public API: migrate(hermes_config, codex_home=None, dry_run=False) returns MigrationReport with .migrated/.errors/ .skipped_keys_per_server. No external TOML dependency — minimal formatter handles strings/numbers/booleans/lists/inline-tables. tests/hermes_cli/test_codex_runtime_plugin_migration.py — 39 tests covering: - per-server translation (12): stdio/http/sse, cwd, timeouts, enabled flag, command+url precedence, sampling drop, unknown keys - TOML formatter (8): types, escaping, inline tables, error case - existing-block stripping (4): no marker, alone, with user content above, with user content below - end-to-end migrate() (8): empty, dry-run, round-trip, idempotent re-run, preserves user config, error reporting, invalid input, summary formatting Files changed: hermes_cli/codex_runtime_switch.py — apply() now calls migrate() in the codex_app_server enable branch. Migration failure logs a warning in the result message but does NOT fail the runtime change. Disable path (auto) explicitly skips migration. tests/hermes_cli/test_codex_runtime_switch.py — 3 new tests: test_enable_triggers_mcp_migration, test_disable_does_not_trigger_migration, test_migration_failure_does_not_block_enable. All 325 feature tests green: - tests/agent/transports/: 249 (incl. 67 new) - tests/run_agent/test_codex_app_server_integration.py: 9 - tests/hermes_cli/test_codex_runtime_switch.py: 28 (3 new) - tests/hermes_cli/test_codex_runtime_plugin_migration.py: 39 (new) * perf(codex-runtime): cache codex --version check within apply() Single /codex-runtime invocation could spawn 'codex --version' up to 3 times (state report, enable gate, success message). Each spawn is ~50ms, so the cumulative cost wasn't a crisis, but it was wasteful and turned a trivial slash command into something noticeably laggy on slower systems. Refactored to lazy-once via a closure over a nonlocal cache. First call spawns; subsequent calls in the same apply() reuse the result. Behavior unchanged — same return shape, same error handling, same install hint when codex is missing. Just one subprocess per call instead of three. Two regression-guard tests added: - test_binary_check_cached_within_apply: enable path → call_count == 1 - test_binary_check_cached_on_read_only_call: state-report path → call_count == 1 Total tests for /codex-runtime now 30 (was 28); all 143 codex-runtime tests still green. * fix(codex-runtime): correct protocol field names found via live e2e test Three real bugs caught only by running a turn end-to-end against codex 0.130.0 with a real ChatGPT subscription. Unit tests passed because they asserted on our own (incorrect) wire shapes; the wire format from codex-rs/app-server-protocol/src/protocol/v2/* is the source of truth and my initial reading of the README was incomplete. Bug 1: thread/start.permissions wire format Was sending {"profileId": "workspace-write"}. Real format per PermissionProfileSelectionParams enum (tagged union): {"type": "profile", "id": "workspace-write"} AND requires the experimentalApi capability declared during initialize. AND requires a matching [permissions] table in ~/.codex/config.toml or codex fails the request with 'default_permissions requires a [permissions] table'. Fix: stop overriding permissions on thread/start. Codex picks its default profile (read-only unless user configures otherwise), which matches what codex CLI users expect — they configure their default permission profile in ~/.codex/config.toml the standard way. Trying to be clever about profile selection broke every turn we tested. Live error before fix: 'Invalid request: missing field type' on every turn/start, even though our turn/start payload was correct — the field codex was complaining about was inside the permissions sub-object we shouldn't have been sending. Bug 2: server-request method names Was matching 'execCommandApproval' and 'applyPatchApproval'. Real names per common.rs ServerRequest enum: item/commandExecution/requestApproval item/fileChange/requestApproval item/permissions/requestApproval (new third method) Fix: match the documented names. Added handler for item/permissions/requestApproval that always declines — codex sometimes asks to escalate permissions mid-turn and silent acceptance would surprise users. Live symptom before fix: agent.log showed 'Unknown codex server request: item/commandExecution/requestApproval' and codex stalled because we replied with -32601 (unsupported method) instead of an approval decision. The agent reported back 'The write command was rejected' even though Hermes never showed the user an approval prompt. Bug 3: approval decision values Was sending decision strings 'approved'/'approvedForSession'/'denied'. Real values per CommandExecutionApprovalDecision enum (camelCase): accept, acceptForSession, decline, cancel (also AcceptWithExecpolicyAmendment and ApplyNetworkPolicyAmendment variants we don't currently use). Fix: rename _approval_choice_to_codex_decision return values; update auto_approve_* fallbacks; update fail-closed default from 'denied' to 'decline'. Test mapping table updated to match. Live test verified after fixes: $ hermes (with model.openai_runtime: codex_app_server) > Run the shell command: echo hermes-codex-livetest > .../proof.txt then read it back Approval prompt fired with 'Codex requests exec in <cwd>'. User chose 'Allow once'. Codex executed the command, wrote the file, read it back. Final response: 'Read back from proof.txt: hermes-codex-livetest'. File contents on disk match. agent.log confirms: codex app-server thread started: id=019e200e profile=workspace-write cwd=/tmp/hermes-codex-livetest/workspace All 20 session tests still green after wire-format updates. * fix(codex-runtime): correct apply_patch approval params + ship docs Live e2e revealed FileChangeRequestApprovalParams doesn't carry the changeset (just itemId, threadId, turnId, reason, grantRoot) — Codex's 'reason' field describes what the patch wants to do. Test config and display logic updated to use it. The first 'apply_patch (0 change(s))' display from the live test is now 'apply_patch: <reason>'. Adds website/docs/user-guide/features/codex-app-server-runtime.md covering enable/disable, prerequisites, approval UX, MCP migration behavior, permission profile delegation to ~/.codex/config.toml, known limitations, and the architecture diagram. Wired into the Automation category in sidebars.ts. Live e2e validation across the path matrix: ✓ thread/start handshake ✓ turn/start with text input ✓ commandExecution items + projection ✓ item/commandExecution/requestApproval → Hermes UI → response ✓ Approve once → command runs ✓ Deny → command rejected, codex falls back to read-only message ✓ Multi-turn (codex remembers prior turn's results) ✓ apply_patch via Codex's fileChange path ✓ item/fileChange/requestApproval → Hermes UI ✓ MCP server migration loads inside spawned codex (verified via 'use the filesystem MCP tool' prompt) ✓ /codex-runtime auto → codex_app_server toggle cycle ✓ Disable doesn't trigger migration ✓ Enable with codex CLI present succeeds + migrates ✓ Hermes-side interrupt path (turn/interrupt request issued cleanly even if codex finishes before the interrupt lands) Known live-validated limitations now documented in the docs page: - delegate_task subagents unavailable on this runtime - permission profile selection delegated to ~/.codex/config.toml - apply_patch approval prompt has no inline changeset (codex protocol doesn't expose it) 145/145 codex-runtime tests still green. * feat(codex-runtime): native plugin migration + UX polish (quirks 2/4/5/10/11) Major: migrate native Codex plugins (NousResearch#7 in OpenClaw's PR list) Discovers installed curated plugins via codex's plugin/list RPC and writes [plugins."<name>@<marketplace>"] entries to ~/.codex/config.toml so they're enabled in the spawned Codex sessions. This is the 'YouTube-video-worthy' bit Pash highlighted: when a user has google-calendar, github, etc. installed in their Codex CLI, those plugins activate automatically when they enable Hermes' codex runtime. Implementation: - hermes_cli/codex_runtime_plugin_migration.py: new _query_codex_plugins() helper spawns 'codex app-server' briefly and walks plugin/list. Returns (plugins, error) — failures are non-fatal so MCP migration still works. - render_codex_toml_section() now takes plugins + permissions args. - migrate() defaults: discover_plugins=True, default_permission_profile= 'workspace-write'. Explicit None on either disables that side. - _strip_existing_managed_block() now also strips [plugins.*] and [permissions]/[permissions.*] sections inside the managed block, so re-runs replace plugins cleanly without touching codex's own config. Quirk fixes: NousResearch#2 Default permissions profile written on enable. Without this, Codex's read-only default kicks in and EVERY write triggers an approval prompt. Now writes [permissions] default = 'workspace-write' so the runtime feels normal out of the box. Set default_permission_profile=None to opt out. NousResearch#4 apply_patch approval prompt now shows what's changing. Codex's FileChangeRequestApprovalParams doesn't carry the changeset. Session adapter now caches the fileChange item from item/started notifications and looks it up by itemId when codex requests approval. Prompt shows '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of 'apply_patch (0 change(s))'. Side benefit: also drains pending notifications BEFORE handling a server request, so the projector and per-turn caches are up to date when the approval decision fires. Bounded to 8 notifications per loop iter to avoid starving codex's response. NousResearch#5/NousResearch#10 Exec approval prompt never shows empty cwd. When codex omits cwd in CommandExecutionRequestApprovalParams, fall back to the session's cwd. If somehow neither is available, show '<unknown>' explicitly instead of an empty string. Also surfaces 'reason' from the approval params when codex provides it — gives users more context on why codex wants to run something. NousResearch#11 Banner indicates the codex_app_server runtime when active. New 'Runtime: codex app-server (terminal/file ops/MCP run inside codex)' line appears in the welcome banner only when the runtime is on. Default banner is unchanged. Tests: - 7 new tests in test_codex_runtime_plugin_migration.py covering plugin discovery (mocked), failure handling, dry-run skip, opt-out flag, idempotent re-runs, and permissions writing. - 3 new tests in test_codex_app_server_session.py covering the enriched approval prompts: cwd fallback, change summary on apply_patch, fallback when no item/started cache exists. - All 26 session tests + 46 migration tests green; 153 total in PR. * feat(codex-runtime): hermes-tools MCP callback + native plugin migration The big architectural addition: when codex_app_server runtime is on, Hermes registers its own tool surface as an MCP server in ~/.codex/config.toml so the codex subprocess can call back into Hermes for tools codex doesn't ship with — web_search, browser_*, vision, image_generate, skills, TTS. Also: 'migrate native codex plugins' (Pash's YouTube-video-worthy bit) — when the user has plugins like Linear, GitHub, Gmail, Calendar, Canva installed via 'codex plugin', Hermes discovers them via plugin/list and writes [plugins.<name>@openai-curated] entries so they activate automatically. New module: agent/transports/hermes_tools_mcp_server.py FastMCP stdio server exposing 17 Hermes tools. Each call dispatches through model_tools.handle_function_call() — same code path as the Hermes default runtime. Run with: python -m agent.transports.hermes_tools_mcp_server [--verbose] Exposed: web_search, web_extract, browser_navigate / _click / _type / _press / _snapshot / _scroll / _back / _get_images / _console / _vision, vision_analyze, image_generate, skill_view, skills_list, text_to_speech. NOT exposed (deliberately): - terminal/shell/read_file/write_file/patch — codex has built-ins - delegate_task/memory/session_search/todo — _AGENT_LOOP_TOOLS in model_tools.py:493, require running AIAgent context. Documented as a limitation and surfaced in the slash command output. Migration changes (hermes_cli/codex_runtime_plugin_migration.py): - _query_codex_plugins() spawns 'codex app-server' briefly to walk plugin/list and pull installed openai-curated plugins. Failures are non-fatal — MCP migration still completes. - render_codex_toml_section() now takes plugins + permissions args AND wraps the managed block with a MIGRATION_END_MARKER comment so the stripper can reliably find both ends, even when the block contains top-level keys (default_permissions = ...). - migrate() defaults: discover_plugins=True, expose_hermes_tools=True, default_permission_profile=':workspace' (built-in codex profile name — must be prefixed with ':'). All three opt-out via explicit args. - _build_hermes_tools_mcp_entry() builds the codex stdio entry with HERMES_HOME and PYTHONPATH passthrough so a worktree-launched Hermes points the MCP subprocess at the same module layout. Live-caught wire bugs fixed during this turn: 1. Permission profile config key is top-level , NOT a [permissions] table. The [permissions] table is for *user-defined* profiles with structured fields. Built-in profile names start with ':' (':workspace', ':read-only', ':danger-no-sandbox'). Was emitting which codex rejected with 'invalid type: string "X", expected struct PermissionProfileToml'. 2. Built-in profile is , NOT . Codex rejected with 'unknown built-in profile'. 3. Codex's MCP layer sends for tool-call confirmation. We weren't handling it, so codex stalled and returned 'MCP tool call was rejected'. Now: auto-accept for our own hermes-tools server (user already opted in by enabling the runtime), decline for third-party servers. Quirk fixes shipped (from the limitations list): NousResearch#2 default permissions: workspace profile written on enable. No more approval prompt on every write. NousResearch#4 apply_patch approval shows what's changing: cache fileChange items from item/started, look up by itemId when codex sends item/fileChange/requestApproval. Prompt: '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of '0 change(s)'. NousResearch#5/NousResearch#10 exec approval cwd never empty: fall back to session cwd, then '<unknown>'. Also surfaces 'reason' from codex when present. NousResearch#11 banner shows 'Runtime: codex app-server' line when active so users understand why tool counts may not match what's reachable. Tests: - 5 new tests in test_codex_runtime_plugin_migration.py covering plugin discovery, expose_hermes_tools entry generation, idempotent re-runs, opt-out flag, permissions profile. - 3 new tests in test_codex_app_server_session.py covering enriched approval prompts (cwd fallback, fileChange summary). - 2 new tests for mcpServer/elicitation/request handling (accept hermes-tools, decline others). - New test file test_hermes_tools_mcp_server.py covering module surface, EXPOSED_TOOLS safety invariants (no shell/file_ops, no agent-loop tools), and main() error paths. - 166 codex-runtime tests total, all green. Live e2e validated against codex 0.130.0 + ChatGPT subscription: ✓ /codex-runtime codex_app_server enables, migrates filesystem MCP, registers hermes-tools, writes default_permissions = ':workspace' ✓ Banner shows 'Runtime: codex app-server' line in subsequent sessions ✓ Shell command runs without approval prompt (workspace profile works) ✓ Multi-turn — codex remembers prior turn's results ✓ apply_patch path via fileChange request approval ✓ web_search via hermes-tools MCP callback returns real Firecrawl results: 'OpenAI Codex CLI – Getting Started' end-to-end in 13s ✓ Disable cycle clean Docs updated: website/docs/user-guide/features/codex-app-server-runtime.md Full re-write covering native plugin migration, the hermes-tools callback architecture, the prerequisites change ('codex login is separate from hermes auth login codex'), the trade-off table now reflecting which Hermes tools work via callback, and the limitations list updated with what's actually unavailable on this runtime. * feat(codex-runtime): pin user-config preservation invariant for quirk NousResearch#6 Quirk NousResearch#6 from the limitations list — user MCP servers / overrides / codex-only sections in ~/.codex/config.toml that live OUTSIDE the hermes-managed block must survive re-migration verbatim. This already worked thanks to the MIGRATION_MARKER + MIGRATION_END_MARKER pair I added when fixing the default_permissions wire format (so the strip can find both ends of the managed region even with top-level keys like default_permissions). But it was an emergent property without a test pinning it. Now explicitly tested: - User MCP server above the managed block survives migration - User MCP server below the managed block survives migration - Both above + below survive a second re-migration - User content (model, providers, sandbox, otel, etc.) outside our region is left untouched Docs added a section "Editing ~/.codex/config.toml safely" explaining the marker contract — so users know they can add their own MCP servers, override permissions, configure codex-only options, etc. without fear of Hermes overwriting their work. 167 codex-runtime tests, all green. * docs(codex-runtime): clarify the actual tool surface — shell covers terminal/read/write/find Previous docs and PR description undersold what codex's built-in toolset actually provides. apply_patch alone made it sound like the runtime could only edit files in patch format — implying you'd lose terminal use, read_file, write_file, search/find. That was wrong. Codex's 'shell' tool runs arbitrary shell commands inside the sandbox, which covers everything you'd do in bash: cat/head/tail (read), echo> or heredocs (write), find/rg/grep (search), ls/cd (navigate), build/ test/git/etc. apply_patch is for structured multi-file edits on top of that. update_plan is its in-runtime todo. view_image loads images. And codex has its own web_search built in (in addition to the Firecrawl-backed one Hermes exposes via MCP callback). Docs now have a 'What tools the model actually has' section right after Why, breaking the surface into three clearly-labeled buckets: 1. Codex's built-in toolset (always on) — shell, apply_patch, update_plan, view_image, web_search; covers everything terminal- adjacent. 2. Native Codex plugins (auto-migrated from your codex plugin install) — Linear, GitHub, Gmail, Calendar, Outlook, Canva, etc. 3. Hermes tool callback (MCP server in ~/.codex/config.toml) — web_search/web_extract via Firecrawl, browser_*, vision_analyze, image_generate, skill_view/skills_list, text_to_speech. Plus a 'What's NOT available' callout listing the four agent-loop tools (delegate_task, memory, session_search, todo) that need running AIAgent context and can't reach the codex runtime. Trade-offs table broken out: shell, apply_patch, update_plan, view_image, sandbox each get their own row with a one-line description so users can see at a glance what's available natively. Architecture diagram updated to list the codex built-ins by name instead of 'apply_patch + shell + sandbox'. No code changes — purely docs clarification. 167 codex-runtime tests still green. * fix(codex-runtime): _spawn_background_review signature + review fork api_mode downgrade Two real bugs in the self-improvement loop integration that the previous test mocked away. Bug 1: wrong call signature The codex helper was calling self._spawn_background_review() with no args after every turn. That function actually requires: messages_snapshot=list (positional or keyword) review_memory=bool (at least one trigger must be True) review_skills=bool So the call would have raised TypeError at runtime — except the only test that exercised this path mocked _spawn_background_review entirely and just asserted spawn.called, so the wrong-arg shape never surfaced. Bug 2: review fork inherits codex_app_server api_mode The review fork is constructed with: api_mode = _parent_runtime.get('api_mode') So when the parent is codex_app_server, the review fork ALSO runs as codex_app_server. But the review fork's whole job is to call agent-loop tools (memory, skill_manage) which require Hermes' own dispatch — they short-circuit with 'must be handled by the agent loop' on the codex runtime. So the review fork would have run, decided to save something, called memory or skill_manage, and silently no-op'd. Fixed in run_agent.py:_spawn_background_review() — when the parent api_mode is 'codex_app_server', the review fork is downgraded to 'codex_responses' (same OAuth credentials, same openai-codex provider, but talks to OpenAI's Responses API directly so Hermes owns the loop). Also rewrote the codex helper's review wiring to match the chat_completions path: - Computes _should_review_memory in the pre-loop block (was already being computed; now passed through to the helper as an arg). - Computes _should_review_skills AFTER the codex turn returns + counters tick (line ~15432 pattern in chat_completions). - Calls _spawn_background_review(messages_snapshot=, review_memory=, review_skills=) only when at least one trigger fires. - Adds the external memory provider sync (_sync_external_memory_for_turn) that the chat_completions path runs after every turn. Tests: Replaced the broken test_background_review_invoked (which only asserted spawn.called) with three sharper tests: - test_background_review_NOT_invoked_below_threshold: single turn at default thresholds → no review fires (would have caught the original 'every turn calls spawn with no args' bug) - test_background_review_skill_trigger_fires_above_threshold: 10 tool_iterations at threshold=10 → review fires with messages_snapshot=list, review_skills=True, counter resets - test_background_review_signature_never_breaks: regression guard asserting positional args are always empty and kwargs include messages_snapshot New TestReviewForkApiModeDowngrade class: - test_codex_app_server_parent_downgrades_review_fork: drives the real _spawn_background_review function (no mock at that level), asserts the review_agent gets api_mode='codex_responses' when the parent was codex_app_server. Live-validated against real run_conversation: - Counter ticked from 0 to 5 after a 5-tool-iteration turn - _spawn_background_review fired exactly once with kwargs-only signature - review_skills=True, review_memory=False - messages_snapshot was 12 entries (5 assistant tool_calls + 5 tool results + 1 final assistant + initial system/user) - Counter reset to 0 after fire 170 codex-runtime tests, all green. Docs: added a Self-improvement loop section to the codex runtime page explaining both how the trigger logic stays equivalent and that the review fork is auto-downgraded to codex_responses for the agent-loop tools. Also clarified that apply_patch and update_plan ARE codex's built-in tools (the previous version made it sound like they were separate from 'codex's stuff' — they're not, all five tools listed in 'What tools the model actually has' section 1 are codex built-ins). * feat(codex-runtime): expose kanban tools through Hermes MCP callback Kanban workers spawn as separate hermes chat -q subprocesses that read the user's config.yaml. If model.openai_runtime: codex_app_server is set globally (which is the whole point of opt-in), every dispatched worker ALSO comes up on the codex runtime. That mostly works — codex's built-in shell + apply_patch + update_plan do the actual task work fine — but it had one critical break: the worker handoff tools (kanban_complete, kanban_block, kanban_comment, kanban_heartbeat) are Hermes-registered tools, not codex built-ins. On the codex runtime, codex builds its own tool list and these never reach the model, so the worker would do the work but not be able to report back, hanging until the dispatcher's timeout escalates it as zombie. Fix: add all 9 kanban tools to the EXPOSED_TOOLS list in the Hermes MCP callback. They dispatch statelessly through handle_function_call() just like web_search and the others — they read HERMES_KANBAN_TASK from env (set by the dispatcher), gate correctly (worker tools require the env var, orchestrator tools require it unset), and write to ~/.hermes/kanban.db. Why kanban tools work via stateless dispatch when delegate_task/memory/ session_search/todo don't: those four are listed in _AGENT_LOOP_TOOLS (model_tools.py:493) and short-circuit in handle_function_call() with 'must be handled by the agent loop' — they need to mutate AIAgent's mid-loop state. Kanban tools have no such requirement; they're pure side-effect functions against the kanban.db plus state_meta. Tools exposed: Worker handoff (require HERMES_KANBAN_TASK): kanban_complete, kanban_block, kanban_comment, kanban_heartbeat Read-only board queries: kanban_show, kanban_list Orchestrator (require HERMES_KANBAN_TASK unset): kanban_create, kanban_unblock, kanban_link Tests: - test_kanban_worker_tools_exposed: complete/block/comment/heartbeat in EXPOSED_TOOLS (regression guard for the would-hang-worker bug) - test_kanban_orchestrator_tools_exposed: create/show/list/unblock/link Docs: - New 'Workflow features' section in the docs page covering /goal, kanban, and cron behavior on this runtime - /goal: works fully via run_conversation feedback; only caveat is approval-prompt noise on long writes-heavy goals (mitigated by the default :workspace permission profile) - Kanban: enumerated which tools are reachable via the callback and why the env var propagates correctly through the codex subprocess to the MCP server subprocess - Cron: documented as 'not specifically tested' — same rules as the CLI apply since cron runs through AIAgent.run_conversation - Trade-offs table gained rows for /goal, kanban worker, kanban orchestrator 172/172 codex-runtime tests green (+2 from kanban tests). * docs(codex-runtime): wire /codex-runtime into slash-commands ref + flag aux token cost Three docs gaps caught during a final audit: 1. /codex-runtime was only in the feature docs page, not in the slash-commands reference. Added rows to both the CLI section and the Messaging section so users discover it where they'd look for slash command syntax. 2. CODEX_HOME and HERMES_KANBAN_TASK weren't in environment-variables.md. CODEX_HOME lets users redirect Codex CLI's config dir (the migration honors it). HERMES_KANBAN_TASK is set by the kanban dispatcher and propagates to the codex subprocess + the hermes-tools MCP subprocess so kanban worker tools gate correctly — documented as 'don't set manually' since it's an internal handoff. 3. Aux client behavior on this runtime. When openai_runtime= codex_app_server is on with the openai-codex provider, every aux task (title generation, context compression, vision auto-detect, session search summarization, the background self-improvement review fork) flows through the user's ChatGPT subscription by default. This is true for the existing codex_responses path too, but it's more visible / important here because users explicitly opted in for subscription billing. Added a 'Auxiliary tasks and ChatGPT subscription token cost' section to the docs page with a YAML example showing how to override specific aux tasks to a cheaper model (typically google/gemini-3-flash-preview via OpenRouter). Also documents how the self-improvement review fork gets auto-downgraded from codex_app_server to codex_responses by the fix earlier in this PR. No code changes — pure docs. 172 codex-runtime tests still green. * docs+test(codex-runtime): pin HOME passthrough, document multi-profile + CODEX_HOME OpenClaw hit a real footgun in openclaw/openclaw#81562: when spawning codex app-server they were synthesizing a per-agent HOME alongside CODEX_HOME. That made every subprocess codex's shell tool launches (gh, git, aws, npm, gcloud, ...) see a fake $HOME and miss the user's real config files. They had to back it out in PR #81562 — keep CODEX_HOME isolation, leave HOME alone. Audit confirms Hermes' codex spawn doesn't have this problem. We do os.environ.copy() and only overlay CODEX_HOME (when provided) and RUST_LOG. HOME passes through unchanged. But it was an emergent property without a test pinning it, so adding a regression guard: test_spawn_env_preserves_HOME — confirms parent HOME survives intact in the subprocess env test_spawn_env_sets_CODEX_HOME_when_provided — confirms codex_home arg still isolates codex state correctly Docs additions: 'HOME environment variable passthrough' section — calls out the contract explicitly: CODEX_HOME isolates codex's own state, HOME stays user-real so gh/git/aws/npm/etc. find their normal config. Cites openclaw#81562 as the cautionary tale. 'Multi-profile / multi-tenant setups' section — addresses the related concern: profiles share ~/.codex/ by default. For users who want per-profile codex isolation (separate auth, separate plugins), documents the manual CODEX_HOME=<profile-scoped-dir> approach. Explains why we DON'T auto-scope CODEX_HOME per profile: doing so would silently invalidate existing codex login state for anyone upgrading to this PR with tokens already at ~/.codex/auth.json. Opt-in is safer than surprising users. 174 codex-runtime tests (+2 from HOME guards), all green. * fix(codex-runtime): TOML control-char escapes + atomic config.toml write Two footguns caught in a final audit pass before merge. Bug 1: TOML control characters not escaped The _format_toml_value() helper escaped backslashes and double quotes but passed literal control characters (\n, \t, \r, \f, \b) through unchanged. TOML basic strings don't allow literal control characters — a path or env var containing a newline would produce invalid TOML that codex refuses to load. Realistic exposure: pathological cases like a HERMES_HOME with a trailing newline (env var concatenation accident), or a PYTHONPATH with a tab from a multi-line shell heredoc. Fix: escape all five TOML basic-string control sequences (\b \t \n \f \r) in addition to \\ and \" that we already did. Order matters — backslash must come first or the other escapes get re-escaped. Bug 2: config.toml write wasn't atomic If the python process crashed between target.mkdir() and the write_text() finishing, a half-written config.toml could be left behind. On NFS / Windows / some FUSE mounts this is a real concern; on ext4/APFS small writes are usually atomic in practice but not guaranteed. Fix: write to a tempfile.mkstemp() temp file in the same directory, then Path.replace() (atomic same-dir rename on POSIX, ReplaceFile on Windows). On rename failure, clean up the temp file so repeated failed migrations don't pile up .config.toml.* files. Tests: - test_string_with_newline_escaped — \n in value → \n in output - test_string_with_tab_escaped — \t in value → \t in output - test_string_with_other_controls_escaped — \r, \f, \b - test_windows_path_escaped_correctly — backslash doubling - test_atomic_write_no_temp_leak_on_success — no .config.toml.* left over after a successful write - test_atomic_write_cleanup_on_rename_failure — temp file removed when Path.replace raises (simulated disk full) 180 codex-runtime tests, all green (+6 from this commit). Footguns audited but NOT fixed (with rationale): - Concurrent migrations race. Two Hermes processes hitting /codex-runtime codex_app_server within seconds of each other could cause one writer to lose entries. Low probability (you'd have to enable from two surfaces simultaneously) and low impact (just re-run migration). Adding fcntl/msvcrt locking is more code than it's worth here. The atomic rename above means each individual write is consistent — only the merge step is racy. - Codex protocol version drift. We pin MIN_CODEX_VERSION=0.125 and check at runtime but don't reject too-new versions. Right call — the protocol has been stable through 0.125 → 0.130. If OpenAI breaks it later we'd see the error in test_codex_app_server_runtime on CI before users hit it.

Four findings from Copilot's review on PR NousResearch#22891, all in the AX elements-array cap added by 22fa1ed: 1. The truncation note ("response truncated to N of M elements") was appended unconditionally — including in the som/vision multimodal path, whose response carries a screenshot rather than an `elements` array. The note described a payload field that wasn't present. Moved the note into the AX-text branch where the array actually appears. 2. `_format_elements(cap.elements)` ran on the full untrimmed list with its own `max_lines=40` cap, so a caller passing `max_elements=10` would see summary lines referencing `NousResearch#11..NousResearch#40` even though the JSON `elements` array only held NousResearch#1..NousResearch#10. Format on `visible_elements` instead so the summary indices always exist in the response. 3. `_coerce_max_elements` enforced a lower bound but no upper bound, so `max_elements=10_000_000` silently disabled the safeguard and reintroduced the original context-blow-up. Added a hard cap (`_MAX_ALLOWED_MAX_ELEMENTS = 1000`) that clamps oversized values. 4. The schema string said "Default 100" but the property carried no `default` field, and claimed `max_elements` had no effect on som/ vision while the image-missing fallback path can still return an elements array. Added `"default": 100`, `"maximum": 1000`, and clarified the fallback-path wording. Each finding gets a regression test: - test_capture_ax_clamps_oversized_max_elements_to_hard_cap - test_capture_ax_summary_indices_match_returned_elements - test_capture_multimodal_summary_omits_truncation_note - test_schema_max_elements_documents_default_and_upper_bound Verified with `pytest tests/tools/test_computer_use.py` (53 passed, including the 5 new cases). Confirmed each new test fails on the pre-fix code path before applying the production change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e removal, LOOP target warnings NousResearch#8: Card ID parsing uses regex 'Created\s+card\s+(\S+)' with fallback NousResearch#9: validate checks gate→revision pairs (referenced gates must have dependents) NousResearch#10: _find_revision_node documented single-revision assumption NousResearch#11: LOOP target mismatch warns before fallback NousResearch#12: Removed unused _find_dependents NousResearch#7: 32 unit tests — DAG, cycles, LOOP regex, failure propagation, state persistence, validation, card parsing

Production-blocking: - P1+P2: Add recall_embed and preflight_rerank workloads to providers.yaml - P3+P4: Fix embedding_source tracking — explicit source in apply_hybrid_scoring returns (hits, source) with 'disabled'/'ok'/'cache'/'failed'/'partial' values - P5: Wire config.recall.top_k into dedupe_and_cap with validate_top_k Bugs in shipped code: - P6: Reranker preserves un-promoted positions 4-8 (was silently dropping them) - P7: Introduce hybrid_score field on TrajectoryHit to avoid double-weighting BM25 through rank_trajectories - P8: Skip caching empty embedding vectors (prevents sticky 0-similarity) - P9: Cache key includes provider:model to prevent cross-contamination - P10: Consistent 500-char truncation for query and candidate embeddings - P11: Guard negative access_count in strength factor - P12: Fix bool('false')==True for same_provider_ok (Hard Invariant NousResearch#10 bypass) - P13: All bare 'from hermes_llm import' use try/except fallback pattern - P15: _text_hash guards non-string input - P16: Filter empty/whitespace FTS terms before query Validation/robustness: - P17: Validate recency exponent < 0 at config load - P18: _parse_rerank_indices handles dict-with-'indices' JSON shape - P19: RerankIndices schema built dynamically from _RERANK_TOP_N - P20: Module-level except narrowed to ImportError,AttributeError - P21: Single _load_config() call per fire path (was 2) - P22: rerank_outcome 'empty-response' distinguished from 'parse-failed' Decision resolutions: - D2: Multiplicative source_boost (*= 1.3) matching upstream TYPE_BOOSTS - D3: Unicode-aware YAKE regex (\w + re.UNICODE) - D4: Defensive threading.Locks for _gates, _citations, _embedding_cache Tests updated: 4 hybrid_scoring tests unpack (hits, source) tuple. Full suite: 116 pass, 6 pre-existing failures (unchanged), 0 regressions.

hjc-puro added 3 commits November 7, 2025 14:08

log first 20 chars

2d8f6c4

add logging of prefix of tool call and tool response

0c61848

add tracking for cluster failurse

31c7333

hjc-puro requested a review from teknium1 November 16, 2025 09:26

hjc-puro closed this Nov 18, 2025

SHL0MS mentioned this pull request Mar 30, 2026

[UX] Enable mouse support (cursor positioning, scrolling) with config toggle #4064

Open

teknium1 mentioned this pull request Apr 2, 2026

feat(memory): pluggable memory provider interface with profile isolation, review fixes, and honcho CLI restoration #4623

Merged

malaiwah pushed a commit to malaiwah/hermes-agent that referenced this pull request Apr 11, 2026

Merge pull request 'fix(gateway): /status Tokens: 0 — two bugs in tok…

09554a1

…en accounting' (NousResearch#10) from fix/gateway-status-tokens into main

Barbate91 approved these changes Apr 18, 2026

View reviewed changes

Barbate91 suggested changes Apr 18, 2026

View reviewed changes

OutThisLife mentioned this pull request Apr 21, 2026

fix(tui): v2 blitz-test UX pack #13589

Closed

8 tasks

CHANTXU64 pushed a commit to CHANTXU64/hermes-agent that referenced this pull request Apr 21, 2026

docs: add entry NousResearch#10 for skill_review_prompt configurability

afc0f3d

aiagent0619 mentioned this pull request Apr 28, 2026

Credential pool rotation should not count toward api_max_retries #16830

Open

Interstellar-code mentioned this pull request May 28, 2026

[a2a_fleet] CRITICAL: server never auto-starts — register() runs outside an event loop #34161

Closed

victor-develop pushed a commit to victor-develop/hermes-agent that referenced this pull request May 29, 2026

Merge pull request NousResearch#10 from asdigitos/integration/release…

131b6c6

…-20260525-v2026.5.16 chore: sync upstream release v2026.5.16 into downstream main

ricardocamiloconsir mentioned this pull request Jun 2, 2026

feat(gateway): session model pool — concurrency-aware auto-assignment with auxiliary slot tracking #37519

Open

cristianmgm7 mentioned this pull request Jun 10, 2026

feat(platforms): add Carbon Voice as a native messaging platform #43226

Open

zenc-cp mentioned this pull request Jun 11, 2026

feat(adr-032): observability sink — consumer hop zenc-cp/hermes-agent#6

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster failure tracking#10

Cluster failure tracking#10
hjc-puro wants to merge 3 commits into
mainfrom
cluster-fail

hjc-puro commented Nov 15, 2025 •

edited

Loading

Uh oh!

Barbate91 left a comment

Uh oh!

Barbate91 left a comment

Uh oh!

Barbate91 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hjc-puro commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Barbate91 left a comment

Choose a reason for hiding this comment

Uh oh!

Barbate91 left a comment

Choose a reason for hiding this comment

Uh oh!

Barbate91 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hjc-puro commented Nov 15, 2025 •

edited

Loading