Skip to content

feat: add optional skill backend probing and command checks#1

Closed
kshitijk4poor wants to merge 277 commits into
feat/skill-prerequisitesfrom
feat/skill-prerequisites-followup
Closed

feat: add optional skill backend probing and command checks#1
kshitijk4poor wants to merge 277 commits into
feat/skill-prerequisitesfrom
feat/skill-prerequisites-followup

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Owner

Summary

  • add optional local/remote environment probing paths for skill readiness checks
  • add command-prerequisite evaluation (required_commands / missing_required_commands)
  • include terminal/runtime coupling updates required by probe execution paths

Why split

This branch is intentionally stacked on top of feat/skill-prerequisites so core NousResearch#688 secure setup can be reviewed separately from runtime/readiness expansion.

Test plan

  • python -m pytest tests/tools/test_skills_tool.py tests/agent/test_skill_commands.py tests/test_cli_secret_capture.py tests/agent/test_prompt_builder.py tests/tools/test_terminal_tool.py -q
  • python -m pytest tests/hermes_cli/test_config.py tests/test_run_agent.py tests/tools/test_registry.py -q
  • python -m pytest tests/ -q

All commands pass on this branch.

JackTheGit and others added 30 commits March 10, 2026 08:10
_fetch_models_from_api checked for "hide" while _read_cache_models
checked for "hidden", causing models hidden by the API to still
appear when loaded from cache. Both now accept either value.
…allMode

Adds full Honcho memory integration to Hermes:

- Session manager with async background writes, memory modes (honcho/hybrid/local),
  and dialectic prefetch for first-turn context warming
- Agent integration: prefetch pipeline, tool surface gated by recallMode,
  system prompt context injection, SIGTERM/SIGINT flush handlers
- CLI commands: setup, status, mode, tokens, peer, identity, migrate
- recallMode setting (auto | context | tools) for A/B testing retrieval strategies
- Session strategies: per-session, per-repo (git tree root), per-directory, global
- Polymorphic memoryMode config: string shorthand or per-peer object overrides
- 97 tests covering async writes, client config, session resolution, and memory modes
Tell users to go to app.honcho.dev > Settings > API Keys.
Updated in setup walkthrough, setup prompt, and client error message.
Explain what context vs dialectic actually do in plain language:
context = raw memory retrieval, dialectic = AI-to-AI inference
for session continuity. Describe what user/AI peer cards are.
Matches the mental model: hybrid = context + tools,
context = context only, tools = tools only.
New tool lets Hermes persist conclusions about the user (preferences,
corrections, project context) directly to Honcho via the conclusions
API. Feeds into the user's peer card and representation.
Consistent naming: all honcho tools now prefixed with honcho_
(honcho_context, honcho_search, honcho_profile, honcho_conclude).
Optional 'peer' parameter: "user" (default) or "ai". Allows asking
about the AI assistant's history/identity, not just the user's.
…-skill

Add ASCII video skill to creative category
Replaces the stub docs with comprehensive coverage: setup (interactive +
manual), all config fields, memory modes, recall modes, write frequency,
session strategies, host blocks, async prefetch pipeline, dual-peer
architecture, dynamic reasoning, gateway integration, four tools, full
CLI reference, migration paths, and AI peer identity. Trims the Honcho
section in memory.md to a cross-reference.
… language

Adds back use cases section and example tool queries from the original
docs. Clarifies that built-in memory and Honcho can work together or be
configured separately via memoryMode.
Setup wizard now writes memoryMode, writeFrequency, recallMode, and
sessionStrategy into hosts.hermes instead of the config root. Client
resolution updated to read sessionStrategy and sessionPeerPrefix from
host block first. Docs updated to show hosts-based config as the default
example so other integrations can coexist cleanly.
…_paths/audio_path/document_paths

The Signal adapter was passing image_paths, audio_path, and document_paths
to MessageEvent.__init__(), but those fields don't exist on the dataclass.
MessageEvent uses media_urls (List[str]) and media_types (List[str]).

Changes:
- Replace separate image_paths/audio_path/document_paths with unified
  media_urls and media_types lists (matching Discord, Slack, etc.)
- Add _ext_to_mime() helper to map file extensions to MIME types
- Use Signal's contentType from attachment metadata when available,
  falling back to extension-based mapping
- Update message type detection to check media_types prefixes

Fixes TypeError: MessageEvent.__init__() got an unexpected keyword
argument 'image_paths'
_keep_typing() was called with metadata= for thread-aware typing
indicators, but neither it nor the base send_typing() accepted
that parameter. Most adapter overrides (Slack, Discord, Telegram,
WhatsApp, HA) already accept metadata=None, but the base class
and Signal adapter did not.

- Add metadata=None to BasePlatformAdapter.send_typing()
- Add metadata=None to BasePlatformAdapter._keep_typing(), pass through
- Add metadata=None to SignalAdapter.send_typing()

Fixes TypeError in _process_message_background for Signal.
…o, metadata)

Signal's send() used 'text' instead of 'content' and 'reply_to_message_id'
instead of 'reply_to', mismatching BasePlatformAdapter.send(). Callers in
gateway/run.py use keyword args matching the base interface, so Signal's
send() was missing its required 'text' positional arg.

Fixes: 'SignalAdapter.send() missing 1 required positional argument: text'
…sResearch#860)

Three separate code paths all wrote to the same SQLite state.db with
no deduplication, inflating session transcripts by 3-4x:

1. _log_msg_to_db() — wrote each message individually after append
2. _flush_messages_to_session_db() — re-wrote ALL new messages at
   every _persist_session() call (~18 exit points), with no tracking
   of what was already written
3. gateway append_to_transcript() — wrote everything a third time
   after the agent returned

Since load_transcript() prefers SQLite over JSONL, the inflated data
was loaded on every session resume, causing proportional token waste.

Fix:
- Remove _log_msg_to_db() and all 16 call sites (redundant with flush)
- Add _last_flushed_db_idx tracking in _flush_messages_to_session_db()
  so repeated _persist_session() calls only write truly new messages
- Reset flush cursor on compression (new session ID)
- Add skip_db parameter to SessionStore.append_to_transcript() so the
  gateway skips SQLite writes when the agent already persisted them
- Gateway now passes skip_db=True for agent-managed messages, still
  writes to JSONL as backup

Verified: a 12-message CLI session with tool calls produces exactly
12 SQLite rows with zero duplicates (previously would be 36-48).

Tests: 9 new tests covering flush deduplication, skip_db behavior,
compression reset, and initialization. Full suite passes (2869 tests).
Use rich_box.HORIZONTALS instead of the default ROUNDED box style
for the agent response panel. This keeps the top/bottom horizontal
rules (with title) but removes the vertical │ borders on left and
right, making it much easier to copy-paste response text from the
terminal.
…search loops after context compression

Authored by 0xbyt4. Adds read/search loop detection, file history injection after compression, and todo filtering for active items only.
…ds, fix bugs

Follow-up to PR NousResearch#705 (merged from 0xbyt4). Addresses several issues:

1. CONSECUTIVE-ONLY TRACKING: Redesigned the read/search tracker to only
   warn/block on truly consecutive identical calls. Any other tool call
   in between (write, patch, terminal, etc.) resets the counter via
   notify_other_tool_call(), called from handle_function_call() in
   model_tools.py. This prevents false blocks in read→edit→verify flows.

2. THRESHOLD ADJUSTMENT: Warn on 3rd consecutive (was 2nd), block on
   4th+ consecutive (was 3rd+). Gives the model more room before
   intervening.

3. TUPLE UNPACKING BUG: Fixed get_read_files_summary() which crashed on
   search keys (5-tuple) when trying to unpack as 3-tuple. Now uses a
   separate read_history set that only tracks file reads.

4. WEB_EXTRACT DOCSTRING: Reverted incorrect removal of 'title' from
   web_extract return docs in code_execution_tool.py — the field IS
   returned by web_tools.py.

5. TESTS: Rewrote test_read_loop_detection.py (35 tests) to cover
   consecutive-only behavior, notify_other_tool_call, interleaved
   read/search, and summary-unaffected-by-searches.
teknium1 and others added 19 commits March 12, 2026 19:34
fix(doctor): treat configured honcho as available
Haiku models don't support extended thinking at all. Without this
guard, claude-haiku-4-5-20251001 would receive type=enabled +
budget_tokens and return a 400 error.

Incorporates the fix from PR NousResearch#1127 (by frizynn) on top of NousResearch#1128's
adaptive thinking refactor.

Verified live with Claude Code OAuth:
  claude-opus-4-6       → adaptive thinking ✓
  claude-haiku-4-5      → no thinking params ✓
  claude-sonnet-4       → enabled thinking ✓
…c877bdeb

fix(anthropic): skip thinking params for Haiku models
…on, reauthentication (NousResearch#1132)

Fixes Anthropic OAuth/subscription authentication end-to-end:

Auth failures (401 errors):
- Add missing 'claude-code-20250219' beta header for OAuth tokens. Both
  clawdbot and OpenCode include this alongside 'oauth-2025-04-20' — without
  it, Anthropic's API rejects OAuth tokens with 401 authentication errors.
- Fix _fetch_anthropic_models() to use canonical beta headers from
  _COMMON_BETAS + _OAUTH_ONLY_BETAS instead of hardcoding.

Token refresh:
- Add _refresh_oauth_token() — when Claude Code credentials from
  ~/.claude/.credentials.json are expired but have a refresh token,
  automatically POST to console.anthropic.com/v1/oauth/token to get
  a new access token. Uses the same client_id as Claude Code / OpenCode.
- Add _write_claude_code_credentials() — writes refreshed tokens back
  to ~/.claude/.credentials.json, preserving other fields.
- resolve_anthropic_token() now auto-refreshes expired tokens before
  returning None.

Config contamination:
- Anthropic's _model_flow_anthropic() no longer saves base_url to config.
  Since resolve_runtime_provider() always hardcodes Anthropic's URL, the
  stale base_url was contaminating other providers when users switched
  without re-running 'hermes model' (e.g., Codex hitting api.anthropic.com).
- _update_config_for_provider() now pops base_url when passed empty string.
- Same fix in setup.py.

Flow/UX (hermes model command):
- CLAUDE_CODE_OAUTH_TOKEN env var now checked in credential detection
- Reauthentication option when existing credentials found
- run_oauth_setup_token() runs 'claude setup-token' as interactive
  subprocess, then auto-detects saved credentials
- Clean has_creds/needs_auth flow in both main.py and setup.py

Tests (14 new):
- Beta header assertions for claude-code-20250219
- Token refresh: successful refresh with credential writeback, failed
  refresh returns None, no refresh token returns None
- Credential writeback: new file creation, preserving existing fields
- Auto-refresh integration in resolve_anthropic_token()
- CLAUDE_CODE_OAUTH_TOKEN fallback, credential file auto-discovery
- run_oauth_setup_token() (5 scenarios)
Complete rewrite of the neuroskill-bci skill based on actual source material
from the NeuroSkill desktop app and NeuroLoop CLI repos. Supersedes PR NousResearch#708.

Key improvements over NousResearch#708:
- All CLI commands verified against actual NeuroSkill/NeuroLoop source
- Added --json flag usage throughout (critical for reliable parsing)
- Fixed metric formulas: Focus = σ(β/(α+θ)), Relaxation = σ(α/(β+θ))
- Scores are 0-1 scale (not 0-100 as in NousResearch#708)
- Added all 40+ metrics: FAA, TAR, BAR, TBR, APF, SNR, coherence,
  consciousness (LZC, wakefulness, integration), complexity (PE, HFD, DFA),
  cardiac (RMSSD, SDNN, pNN50, LF/HF, stress index, SpO2),
  motion (stillness, blinks, jaw clenches, nods, shakes)
- Added all missing CLI subcommands: session, search-labels, interactive,
  listen, umap, calibrate, timer, notify, raw
- Protocols sourced from actual NeuroLoop protocol repertoire (70+)
  organized by category (attention, stress, emotional, sleep, somatic,
  digital, dietary, motivation)
- Added full WebSocket/HTTP API reference with all endpoints and
  JSON response formats
- Fixed gamma range: 30-50 Hz (not 30-100)
- Added signal quality per electrode with thresholds
- Added composite state patterns (flow, fatigue, anxiety, creative, etc.)
- Added ZUNA embedding documentation
- Placed as optional built-in skill (not bundled by default)

Files:
- optional-skills/health/DESCRIPTION.md (new category)
- optional-skills/health/neuroskill-bci/SKILL.md (main skill)
- optional-skills/health/neuroskill-bci/references/metrics.md
- optional-skills/health/neuroskill-bci/references/protocols.md
- optional-skills/health/neuroskill-bci/references/api.md

Refs: NousResearch#694, NousResearch#708
…6ec3b1a9

feat(skills): add NeuroSkill BCI integration as optional built-in skill
- Updated command output handling to use RichText for ANSI formatting.
- Improved response display in chat console with RichText integration.
- Ensured fallback for empty command outputs with a clear message.
Persist OAuth/setup tokens in ANTHROPIC_TOKEN instead of ANTHROPIC_API_KEY.
Reserve ANTHROPIC_API_KEY for regular Console API keys.

Changes:
- anthropic_adapter: reorder resolve_anthropic_token() priority —
  ANTHROPIC_TOKEN first, ANTHROPIC_API_KEY as legacy fallback
- config: add save_anthropic_oauth_token() / save_anthropic_api_key() helpers
  that clear the opposing slot to prevent priority conflicts
- config: show_config() prefers ANTHROPIC_TOKEN for display
- setup: OAuth login and pasted setup-tokens write to ANTHROPIC_TOKEN
- setup: API key entry writes to ANTHROPIC_API_KEY and clears ANTHROPIC_TOKEN
- main: same fixes in _run_anthropic_oauth_flow() and _model_flow_anthropic()
- main: _has_any_provider_configured() checks ANTHROPIC_TOKEN
- doctor: use _is_oauth_token() for correct auth method validation
- runtime_provider: updated error message
- run_agent: simplified client init to use resolve_anthropic_token()
- run_agent: updated 401 troubleshooting messages
- status: prefer ANTHROPIC_TOKEN in status display
- tests: updated priority test, added persistence helper tests

Cherry-picked from PR NousResearch#1141 by kshitijk4poor, rebased onto current main
with unrelated changes (web_policy config, blocklist CLI) removed.

Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
…6ec3b1a9

fix: separate Anthropic OAuth tokens from API keys
First Atropos environment to populate distill_token_ids / distill_logprobs
on ScoredDataGroup, enabling on-policy distillation training.

Based on OpenClaw-RL (Princeton, arXiv:2603.10165):
- Extracts hindsight hints from next-state signals (tool results, errors)
- Uses LLM judge with majority voting for hint extraction
- Scores student tokens under hint-enhanced distribution via get_logprobs
- Packages teacher's top-K predictions as distillation targets

Architecture:
- AgenticOPDEnv extends HermesAgentBaseEnv
- Overrides collect_trajectories to add OPD pipeline after standard rollouts
- Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring
- No external servers needed — same VLLM backend handles both rollouts and scoring

Task: Coding problems with test verification (8 built-in tasks, HF dataset support)
Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15)
OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs

Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length
Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate
When the model returns multiple tool calls in a single response, they are
now executed concurrently using a thread pool instead of sequentially.
This significantly reduces wall-clock time when multiple independent tools
are batched (e.g. parallel web_search, read_file, terminal calls).

Architecture:
- _execute_tool_calls() dispatches to sequential or concurrent path
- Single tool calls and batches containing 'clarify' use sequential path
- Multiple non-interactive tools use ThreadPoolExecutor (max 8 workers)
- Results are collected and appended to messages in original order
- _invoke_tool() extracted as shared tool invocation helper

Safety:
- Pre-flight interrupt check skips all tools if interrupted
- Per-tool exception handling: one failure doesn't crash the batch
- Result truncation (100k char limit) applied per tool
- Budget pressure injection after all tools complete
- Checkpoints taken before file-mutating tools
- CLI spinner shows batch progress, then per-tool completion messages

Tests: 10 new tests covering dispatch logic, ordering, error handling,
interrupt behavior, truncation, and _invoke_tool routing.
anthropic/claude-opus-4.6 (OpenRouter format) was being sent as
claude-opus-4.6 to the Anthropic API, which expects claude-opus-4-6
(hyphens, not dots).

normalize_model_name() now converts dots to hyphens after stripping
the provider prefix, matching Anthropic's naming convention.

Fixes 404: 'model: claude-opus-4.6 was not found'
…d28bf447

feat: Agentic On-Policy Distillation (OPD) environment
When a skill declares required_environment_variables in its YAML
frontmatter, missing env vars trigger a secure TUI prompt (identical
to the sudo password widget) when the skill is loaded. Secrets flow
directly to ~/.hermes/.env, never entering LLM context.

Key changes:
- New required_environment_variables frontmatter field for skills
- Secure TUI widget (masked input, 120s timeout)
- Gateway safety: messaging platforms show local setup guidance
- Legacy prerequisites.env_vars normalized into new format
- Remote backend handling: conservative setup_needed=True
- Env var name validation, file permissions hardened to 0o600
- Redact patterns extended for secret-related JSON fields
- 12 existing skills updated with prerequisites declarations
- ~48 new tests covering skip, timeout, gateway, remote backends
- Dynamic panel widget sizing (fixes hardcoded width from original PR)

Cherry-picked from PR NousResearch#723 by kshitijk4poor, rebased onto current main
with conflict resolution.

Fixes NousResearch#688

Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
…42bc21fb

feat: secure skill env setup on load (core NousResearch#688)
fix: add missing packages to setuptools config
…f47f71c0

feat: concurrent tool execution with ThreadPoolExecutor
@kshitijk4poor kshitijk4poor force-pushed the feat/skill-prerequisites-followup branch from 9ad2741 to 35436e3 Compare March 13, 2026 10:28
kshitijk4poor pushed a commit that referenced this pull request Mar 19, 2026
…ult (NousResearch#1922)

SOUL.md now loads in slot #1 of the system prompt, replacing the
hardcoded DEFAULT_AGENT_IDENTITY. This lets users fully customize
the agent's identity and personality by editing ~/.hermes/SOUL.md
without it conflicting with the built-in identity text.

When SOUL.md is loaded as identity, it's excluded from the context
files section to avoid appearing twice. When SOUL.md is missing,
empty, unreadable, or skip_context_files is set, the hardcoded
DEFAULT_AGENT_IDENTITY is used as a fallback.

The default SOUL.md (seeded on first run) already contains the full
Hermes personality, so existing installs are unaffected.

Co-authored-by: Test <test@test.com>
kshitijk4poor pushed a commit that referenced this pull request Mar 19, 2026
Update all SOUL.md documentation to reflect that it now occupies
slot #1 in the system prompt, replacing the hardcoded default identity.

Updated pages:
- user-guide/features/personality.md — SOUL.md is primary identity, not just a layer
- developer-guide/prompt-assembly.md — updated prompt layer order, context files list
- guides/use-soul-with-hermes.md — SOUL.md replaces built-in identity
- user-guide/configuration.md — updated context files table and directory tree

Co-authored-by: Test <test@test.com>
kshitijk4poor added a commit that referenced this pull request Apr 18, 2026
…ts (NousResearch#11745)

Move moonshotai/kimi-k2.5 to position #1 in every model picker list:
- OPENROUTER_MODELS (with 'recommended' tag)
- _PROVIDER_MODELS: nous, kimi-coding, opencode-zen, opencode-go, alibaba, huggingface
- _model_flow_kimi() Coding Plan model list in main.py

kimi-coding-cn and moonshot lists already had kimi-k2.5 first.
kshitijk4poor pushed a commit that referenced this pull request Apr 23, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the
  alt-screen enters. The <Banner> renders inside the alt screen via the
  seeded intro row, so nothing is lost — just the flash that preceded it.
  Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and
  shaves the 'starting agent' hang perception (row 5 #1) since the UI
  paints straight into the steady-state view
- AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so
  strict emulators start from a pristine grid; named constants replace
  the inline sequences for clarity
- bootBanner.ts deleted — dead code
kshitijk4poor pushed a commit that referenced this pull request May 1, 2026
The scheme-validation commit (e77a3f2c) was too strict: a user with
legacy ''baseUrl: localhost:8000'' (no ''http://'' prefix) in their
''~/.honcho/config.json'' would get ''No API key configured'' from the
CLI after that change, even though their setup worked before.

urlparse on a schemeless host:port treats the host segment as the
scheme and leaves netloc empty, so the http/https check rejected it.

Falls back to a lenient check for schemeless strings that look like
hosts: contain '.' or ':', aren't a boolean/null literal, aren't pure
digits. The SDK still rejects truly malformed URLs at connect time
with a clearer error than ours.

Three new tests: legacy schemeless hosts accepted; obvious garbage
literals (''true'', ''null'', ''12345'') still rejected.  Reviewer
noted concern #1: schemeless regression for self-hosters with old
configs.
kshitijk4poor pushed a commit that referenced this pull request May 1, 2026
* ci(nix): auto-fix stale npm hashes on push to main

When a PR merges to main with updated package-lock.json or package.json
in ui-tui/ or web/, the new auto-fix-main job detects stale npmDepsHash
values and pushes a fix commit directly to main.

This eliminates the recurring manual hash-bump PRs (NousResearch#15420, NousResearch#15314,
NousResearch#15272, NousResearch#15244) by reusing the existing fix-lockfiles --apply pipeline.

The fix commit only touches nix/*.nix files, which are outside the push
path filter (package-lock.json / package.json), so it cannot re-trigger
itself.

Closes NousResearch#15314

* fix(ci): use GitHub App token for auto-fix-main push

GITHUB_TOKEN commits are invisible to workflow triggers (GitHub's
infinite-loop prevention). The auto-fix-main job pushes directly to
main, so the fix commit never triggered downstream nix.yml verification.

Mint a short-lived token via the repo's GitHub App (daimon-nous, APP_ID
+ APP_PRIVATE_KEY secrets) so the push is treated as a real event and
nix.yml fires to verify the corrected hashes.

Tested via workflow_dispatch dry-run: app token minted successfully,
checkout with app token succeeded, fix job correctly gated.

Resolves review feedback from Bugbot (r3144569551).

* ci(nix): rename lockfile check job for required status check

Rename 'check' → 'nix-lockfile-check' so the status check name is
unambiguous when added as a required check on main.

* fix(ci): harden auto-fix-main against races, loops, and silent failures

Address adversarial review findings:

1. Race condition (#1): Job-level concurrency with cancel-in-progress
   collapses back-to-back pushes; ref: main checkout always gets latest
   branch state; explicit push target (origin HEAD:main).

2. Loop prevention (#2): File-whitelist check before commit aborts if
   any file outside nix/{tui,web}.nix was modified, preventing
   accidental self-triggering.

3. Silent infra failures (#8): nix-lockfile-check now fails explicitly
   when fix-lockfiles exits without reporting stale status (catches nix
   setup failures, network errors, script bugs that bypass continue-on-error).

4. Commit traceability (NousResearch#11): Auto-fix commits include source SHA and
   workflow run URL in the commit body.

5. Explicit push target (NousResearch#12): git push origin HEAD:main instead of
   bare git push.

---------

Co-authored-by: alt-glitch <alt-glitch@users.noreply.github.com>
kshitijk4poor pushed a commit that referenced this pull request May 1, 2026
…ch#16706)

* fix(tui): drop stale stream events after ctrl-c interrupt

Once interruptTurn() flips this.interrupted, only recordMessageDelta
short-circuited.  recordReasoningDelta/Available, recordToolStart/
Progress/Complete, and recordInlineDiffToolComplete kept populating
turnState until the python loop reached its next _interrupt_requested
check (~1s on busy turns), making it look like ctrl-c was ignored
while late "thinking" + tool calls kept landing in the UI.

Add the same interrupted guard to every stream-side recorder, and
clear the flag at startMessage() so the next turn isn't suppressed
if the previous turn never delivered message.complete.

* fix(tui): guard recordTodos against post-interrupt mutation; fake-timers in test

Copilot review on PR NousResearch#16706:

1. `recordToolStart` is interruption-guarded, but `tool.start`
   handler also calls `recordTodos(payload.todos)` first — so a
   late tool.start carrying todos could still mutate `turnState.todos`
   after Ctrl-C, leaving ghost rows in the panel.  Adds the same
   `if (this.interrupted) return` early-exit to `recordTodos` so
   *all* tool.start side-effects are dropped post-interrupt.

2. The interrupt test was leaking a real `setTimeout` (interrupt
   cooldown) across test files, which could fire later and mutate
   uiStore from the wrong test context.  Wraps the test in
   `vi.useFakeTimers()` + `vi.runAllTimers()` and restores real
   timers in finally.

3. Extends the same test with a todos payload on the post-interrupt
   tool.start so we have explicit regression coverage for #1.

* fix(tui): guard pushTrail post-interrupt; harden interrupt-test cleanup

Round 2 Copilot review on PR NousResearch#16706:

1. `tool.generating` events route through `pushTrail`, which was not
   interruption-guarded — late events could still write 'drafting …'
   into `turnTrail` after Ctrl-C, leaving a stale shimmer in the UI.
   Adds the same `if (this.interrupted) return` early-exit.

2. Test cleanup moved `vi.runAllTimers()` into `finally` (before
   `vi.useRealTimers()`) so a mid-test assertion failure can't leak
   the interrupt-cooldown setTimeout across other test files.

3. Replaced the misleading 'pre-interrupt todos … expected to be
   cleared by the interrupt cycle' comment with an accurate one
   reflecting current behaviour (interrupt does NOT clear todos).

4. Added an explicit assertion that a post-interrupt `tool.generating`
   event does not extend `turnTrail` — regression coverage for #1.
kshitijk4poor pushed a commit that referenced this pull request May 11, 2026
…3456)

* feat(goals): /goal checklist + /subgoal user controls

Two-phase judge for /goal — Phase A decomposes the goal into a detailed
checklist on first turn; Phase B evaluates each pending item harshly
against the agent's most recent response. The goal completes only when
every item is in a terminal status (completed or impossible). Adds
/subgoal so the user can append, complete, mark impossible, undo,
remove, or clear items the judge missed or got wrong.

Mechanics:
- GoalState gains `checklist` and `decomposed` fields, both backwards
  compatible (old state_meta rows load unchanged).
- Phase A: aux call writes a harsh, exhaustive checklist; biased toward
  more items not fewer. Falls through to legacy freeform judge when
  decompose fails.
- Phase B: judge gets the checklist + last-response snippet + path to
  a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json.
  A bounded read_file tool (max 5 calls per turn, restricted to that
  one file) lets the judge inspect history when the snippet is
  ambiguous. Stickiness in code: terminal items are frozen, only the
  user can revert via /subgoal undo.
- Continuation prompt shows checklist progress when non-empty;
  reverts to old prompt when empty.
- Status line shows M/N done counts.

CLI + gateway + TUI gateway all pass the agent reference into
evaluate_after_turn so the dump can be written. Gateway-side
/subgoal is allowed mid-run since it only modifies the checklist
the judge consults at turn boundaries.

Tests: 24 new cases — backcompat round-trip, Phase A decompose,
Phase B updates + new_items + stickiness, user override flows,
conversation dump (incl. unsafe-sid sanitization), judge read_file
restriction. Existing freeform-mode tests updated to patch the
renamed `judge_goal_freeform` and skip Phase A explicitly.

* fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning

Three live-test findings from running /goal end-to-end against
gemini-3-flash-preview as the judge:

1. Off-by-one bug — the judge sees the checklist rendered with 1-based
   indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed
   state.checklist as 0-based. Result: every judge update landed on
   the wrong item, evidence got attached to neighbouring rows, and
   the genuine 'first pending' item (usually #1) never got marked.
   Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the
   user prompt to call out the 1-based scheme explicitly. New tests
   cover the parser conversion + an end-to-end fake-judge round-trip.

2. Conversation dump never happened — _extract_agent_messages tried
   common AIAgent attribute names (.messages, .conversation_history,
   etc.) but AIAgent doesn't expose the message list as an instance
   attribute; it lives inside run_conversation()'s scope. Result: the
   judge's read_file tool always saw history_path=unavailable. Fix:
   added an explicit messages= kwarg to evaluate_after_turn that all
   three call sites (CLI, gateway, TUI gateway) now pass directly.
   Agent-attribute extraction kept as back-compat fallback.

3. Prompt was too harsh on simple goals. The original 'be HARSH,
   default to leaving items pending' wording made the judge refuse
   to mark 'file exists' completed even after the agent ran ls,
   test -f, os.path.isfile, and find — burning the entire 8-turn
   budget on a fizzbuzz task. Softened to 'strict but not absurd'
   with explicit guidance on what counts as evidence and a directive
   not to require re-proving items already established earlier.

Re-tested live with the same fizzbuzz goal: now terminates in 2
turns with all 8 checklist items correctly attributed to their
own evidence. /subgoal user-action flow (add / complete / undo /
impossible) verified live as well.
kshitijk4poor added a commit that referenced this pull request May 13, 2026
…registries

Both web_search_registry._resolve() and image_gen_registry.get_active_provider()
walked their registered providers and returned the first one matching the
capability flag — without checking whether that provider was actually
usable. On a fresh install with no credentials at all, this meant
get_active_search_provider() returned `brave-free` (legacy preference
order) even though BRAVE_SEARCH_API_KEY was unset, leading the
dispatcher to surface a "BRAVE_SEARCH_API_KEY is not set" error for a
provider the user never chose. Same bug shape in image_gen for FAL.

Resolution semantics now match tools.web_tools._get_backend():

  1. Explicit config name wins, ignoring is_available() — the dispatcher
     surfaces a precise "X_API_KEY is not set" error rather than silently
     switching backends. Matches user expectation: "I configured X, tell
     me what's wrong with X."
  2. Fallback (no explicit config) walks the legacy preference order
     filtered by is_available() — pick the highest-priority backend the
     user actually has credentials for.

is_available() is wrapped in a try/except so a buggy provider doesn't
brick resolution.

E2E verified:
  - No creds + no config: get_active_search_provider() -> None
  - Explicit brave-free + no key: get_active_search_provider() -> brave-free
    (and .is_available() correctly reports False)

This fix was identified during the spike (NousResearch#25182 finding #1) and is
fold-in to the same PR rather than a follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.