feat: add optional skill backend probing and command checks#1
Closed
kshitijk4poor wants to merge 277 commits into
Closed
feat: add optional skill backend probing and command checks#1kshitijk4poor wants to merge 277 commits into
kshitijk4poor wants to merge 277 commits into
Conversation
_fetch_models_from_api checked for "hide" while _read_cache_models checked for "hidden", causing models hidden by the API to still appear when loaded from cache. Both now accept either value.
…load path is reached
…allMode Adds full Honcho memory integration to Hermes: - Session manager with async background writes, memory modes (honcho/hybrid/local), and dialectic prefetch for first-turn context warming - Agent integration: prefetch pipeline, tool surface gated by recallMode, system prompt context injection, SIGTERM/SIGINT flush handlers - CLI commands: setup, status, mode, tokens, peer, identity, migrate - recallMode setting (auto | context | tools) for A/B testing retrieval strategies - Session strategies: per-session, per-repo (git tree root), per-directory, global - Polymorphic memoryMode config: string shorthand or per-peer object overrides - 97 tests covering async writes, client config, session resolution, and memory modes
Tell users to go to app.honcho.dev > Settings > API Keys. Updated in setup walkthrough, setup prompt, and client error message.
Explain what context vs dialectic actually do in plain language: context = raw memory retrieval, dialectic = AI-to-AI inference for session continuity. Describe what user/AI peer cards are.
Matches the mental model: hybrid = context + tools, context = context only, tools = tools only.
New tool lets Hermes persist conclusions about the user (preferences, corrections, project context) directly to Honcho via the conclusions API. Feeds into the user's peer card and representation.
Consistent naming: all honcho tools now prefixed with honcho_ (honcho_context, honcho_search, honcho_profile, honcho_conclude).
Optional 'peer' parameter: "user" (default) or "ai". Allows asking about the AI assistant's history/identity, not just the user's.
…-skill Add ASCII video skill to creative category
Replaces the stub docs with comprehensive coverage: setup (interactive + manual), all config fields, memory modes, recall modes, write frequency, session strategies, host blocks, async prefetch pipeline, dual-peer architecture, dynamic reasoning, gateway integration, four tools, full CLI reference, migration paths, and AI peer identity. Trims the Honcho section in memory.md to a cross-reference.
… language Adds back use cases section and example tool queries from the original docs. Clarifies that built-in memory and Honcho can work together or be configured separately via memoryMode.
Setup wizard now writes memoryMode, writeFrequency, recallMode, and sessionStrategy into hosts.hermes instead of the config root. Client resolution updated to read sessionStrategy and sessionPeerPrefix from host block first. Docs updated to show hosts-based config as the default example so other integrations can coexist cleanly.
…lock" This reverts commit c90ba02.
…_paths/audio_path/document_paths The Signal adapter was passing image_paths, audio_path, and document_paths to MessageEvent.__init__(), but those fields don't exist on the dataclass. MessageEvent uses media_urls (List[str]) and media_types (List[str]). Changes: - Replace separate image_paths/audio_path/document_paths with unified media_urls and media_types lists (matching Discord, Slack, etc.) - Add _ext_to_mime() helper to map file extensions to MIME types - Use Signal's contentType from attachment metadata when available, falling back to extension-based mapping - Update message type detection to check media_types prefixes Fixes TypeError: MessageEvent.__init__() got an unexpected keyword argument 'image_paths'
… MessageEvent construction
_keep_typing() was called with metadata= for thread-aware typing indicators, but neither it nor the base send_typing() accepted that parameter. Most adapter overrides (Slack, Discord, Telegram, WhatsApp, HA) already accept metadata=None, but the base class and Signal adapter did not. - Add metadata=None to BasePlatformAdapter.send_typing() - Add metadata=None to BasePlatformAdapter._keep_typing(), pass through - Add metadata=None to SignalAdapter.send_typing() Fixes TypeError in _process_message_background for Signal.
…o, metadata) Signal's send() used 'text' instead of 'content' and 'reply_to_message_id' instead of 'reply_to', mismatching BasePlatformAdapter.send(). Callers in gateway/run.py use keyword args matching the base interface, so Signal's send() was missing its required 'text' positional arg. Fixes: 'SignalAdapter.send() missing 1 required positional argument: text'
…sResearch#860) Three separate code paths all wrote to the same SQLite state.db with no deduplication, inflating session transcripts by 3-4x: 1. _log_msg_to_db() — wrote each message individually after append 2. _flush_messages_to_session_db() — re-wrote ALL new messages at every _persist_session() call (~18 exit points), with no tracking of what was already written 3. gateway append_to_transcript() — wrote everything a third time after the agent returned Since load_transcript() prefers SQLite over JSONL, the inflated data was loaded on every session resume, causing proportional token waste. Fix: - Remove _log_msg_to_db() and all 16 call sites (redundant with flush) - Add _last_flushed_db_idx tracking in _flush_messages_to_session_db() so repeated _persist_session() calls only write truly new messages - Reset flush cursor on compression (new session ID) - Add skip_db parameter to SessionStore.append_to_transcript() so the gateway skips SQLite writes when the agent already persisted them - Gateway now passes skip_db=True for agent-managed messages, still writes to JSONL as backup Verified: a 12-message CLI session with tool calls produces exactly 12 SQLite rows with zero duplicates (previously would be 36-48). Tests: 9 new tests covering flush deduplication, skip_db behavior, compression reset, and initialization. Full suite passes (2869 tests).
…on in gateway sessions Fixes NousResearch#860.
Use rich_box.HORIZONTALS instead of the default ROUNDED box style for the agent response panel. This keeps the top/bottom horizontal rules (with title) but removes the vertical │ borders on left and right, making it much easier to copy-paste response text from the terminal.
…search loops after context compression Authored by 0xbyt4. Adds read/search loop detection, file history injection after compression, and todo filtering for active items only.
…ds, fix bugs Follow-up to PR NousResearch#705 (merged from 0xbyt4). Addresses several issues: 1. CONSECUTIVE-ONLY TRACKING: Redesigned the read/search tracker to only warn/block on truly consecutive identical calls. Any other tool call in between (write, patch, terminal, etc.) resets the counter via notify_other_tool_call(), called from handle_function_call() in model_tools.py. This prevents false blocks in read→edit→verify flows. 2. THRESHOLD ADJUSTMENT: Warn on 3rd consecutive (was 2nd), block on 4th+ consecutive (was 3rd+). Gives the model more room before intervening. 3. TUPLE UNPACKING BUG: Fixed get_read_files_summary() which crashed on search keys (5-tuple) when trying to unpack as 3-tuple. Now uses a separate read_history set that only tracks file reads. 4. WEB_EXTRACT DOCSTRING: Reverted incorrect removal of 'title' from web_extract return docs in code_execution_tool.py — the field IS returned by web_tools.py. 5. TESTS: Rewrote test_read_loop_detection.py (35 tests) to cover consecutive-only behavior, notify_other_tool_call, interleaved read/search, and summary-unaffected-by-searches.
fix(doctor): treat configured honcho as available
Haiku models don't support extended thinking at all. Without this guard, claude-haiku-4-5-20251001 would receive type=enabled + budget_tokens and return a 400 error. Incorporates the fix from PR NousResearch#1127 (by frizynn) on top of NousResearch#1128's adaptive thinking refactor. Verified live with Claude Code OAuth: claude-opus-4-6 → adaptive thinking ✓ claude-haiku-4-5 → no thinking params ✓ claude-sonnet-4 → enabled thinking ✓
…c877bdeb fix(anthropic): skip thinking params for Haiku models
…on, reauthentication (NousResearch#1132) Fixes Anthropic OAuth/subscription authentication end-to-end: Auth failures (401 errors): - Add missing 'claude-code-20250219' beta header for OAuth tokens. Both clawdbot and OpenCode include this alongside 'oauth-2025-04-20' — without it, Anthropic's API rejects OAuth tokens with 401 authentication errors. - Fix _fetch_anthropic_models() to use canonical beta headers from _COMMON_BETAS + _OAUTH_ONLY_BETAS instead of hardcoding. Token refresh: - Add _refresh_oauth_token() — when Claude Code credentials from ~/.claude/.credentials.json are expired but have a refresh token, automatically POST to console.anthropic.com/v1/oauth/token to get a new access token. Uses the same client_id as Claude Code / OpenCode. - Add _write_claude_code_credentials() — writes refreshed tokens back to ~/.claude/.credentials.json, preserving other fields. - resolve_anthropic_token() now auto-refreshes expired tokens before returning None. Config contamination: - Anthropic's _model_flow_anthropic() no longer saves base_url to config. Since resolve_runtime_provider() always hardcodes Anthropic's URL, the stale base_url was contaminating other providers when users switched without re-running 'hermes model' (e.g., Codex hitting api.anthropic.com). - _update_config_for_provider() now pops base_url when passed empty string. - Same fix in setup.py. Flow/UX (hermes model command): - CLAUDE_CODE_OAUTH_TOKEN env var now checked in credential detection - Reauthentication option when existing credentials found - run_oauth_setup_token() runs 'claude setup-token' as interactive subprocess, then auto-detects saved credentials - Clean has_creds/needs_auth flow in both main.py and setup.py Tests (14 new): - Beta header assertions for claude-code-20250219 - Token refresh: successful refresh with credential writeback, failed refresh returns None, no refresh token returns None - Credential writeback: new file creation, preserving existing fields - Auto-refresh integration in resolve_anthropic_token() - CLAUDE_CODE_OAUTH_TOKEN fallback, credential file auto-discovery - run_oauth_setup_token() (5 scenarios)
Complete rewrite of the neuroskill-bci skill based on actual source material from the NeuroSkill desktop app and NeuroLoop CLI repos. Supersedes PR NousResearch#708. Key improvements over NousResearch#708: - All CLI commands verified against actual NeuroSkill/NeuroLoop source - Added --json flag usage throughout (critical for reliable parsing) - Fixed metric formulas: Focus = σ(β/(α+θ)), Relaxation = σ(α/(β+θ)) - Scores are 0-1 scale (not 0-100 as in NousResearch#708) - Added all 40+ metrics: FAA, TAR, BAR, TBR, APF, SNR, coherence, consciousness (LZC, wakefulness, integration), complexity (PE, HFD, DFA), cardiac (RMSSD, SDNN, pNN50, LF/HF, stress index, SpO2), motion (stillness, blinks, jaw clenches, nods, shakes) - Added all missing CLI subcommands: session, search-labels, interactive, listen, umap, calibrate, timer, notify, raw - Protocols sourced from actual NeuroLoop protocol repertoire (70+) organized by category (attention, stress, emotional, sleep, somatic, digital, dietary, motivation) - Added full WebSocket/HTTP API reference with all endpoints and JSON response formats - Fixed gamma range: 30-50 Hz (not 30-100) - Added signal quality per electrode with thresholds - Added composite state patterns (flow, fatigue, anxiety, creative, etc.) - Added ZUNA embedding documentation - Placed as optional built-in skill (not bundled by default) Files: - optional-skills/health/DESCRIPTION.md (new category) - optional-skills/health/neuroskill-bci/SKILL.md (main skill) - optional-skills/health/neuroskill-bci/references/metrics.md - optional-skills/health/neuroskill-bci/references/protocols.md - optional-skills/health/neuroskill-bci/references/api.md Refs: NousResearch#694, NousResearch#708
…6ec3b1a9 feat(skills): add NeuroSkill BCI integration as optional built-in skill
- Updated command output handling to use RichText for ANSI formatting. - Improved response display in chat console with RichText integration. - Ensured fallback for empty command outputs with a clear message.
Persist OAuth/setup tokens in ANTHROPIC_TOKEN instead of ANTHROPIC_API_KEY. Reserve ANTHROPIC_API_KEY for regular Console API keys. Changes: - anthropic_adapter: reorder resolve_anthropic_token() priority — ANTHROPIC_TOKEN first, ANTHROPIC_API_KEY as legacy fallback - config: add save_anthropic_oauth_token() / save_anthropic_api_key() helpers that clear the opposing slot to prevent priority conflicts - config: show_config() prefers ANTHROPIC_TOKEN for display - setup: OAuth login and pasted setup-tokens write to ANTHROPIC_TOKEN - setup: API key entry writes to ANTHROPIC_API_KEY and clears ANTHROPIC_TOKEN - main: same fixes in _run_anthropic_oauth_flow() and _model_flow_anthropic() - main: _has_any_provider_configured() checks ANTHROPIC_TOKEN - doctor: use _is_oauth_token() for correct auth method validation - runtime_provider: updated error message - run_agent: simplified client init to use resolve_anthropic_token() - run_agent: updated 401 troubleshooting messages - status: prefer ANTHROPIC_TOKEN in status display - tests: updated priority test, added persistence helper tests Cherry-picked from PR NousResearch#1141 by kshitijk4poor, rebased onto current main with unrelated changes (web_policy config, blocklist CLI) removed. Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
…6ec3b1a9 fix: separate Anthropic OAuth tokens from API keys
First Atropos environment to populate distill_token_ids / distill_logprobs on ScoredDataGroup, enabling on-policy distillation training. Based on OpenClaw-RL (Princeton, arXiv:2603.10165): - Extracts hindsight hints from next-state signals (tool results, errors) - Uses LLM judge with majority voting for hint extraction - Scores student tokens under hint-enhanced distribution via get_logprobs - Packages teacher's top-K predictions as distillation targets Architecture: - AgenticOPDEnv extends HermesAgentBaseEnv - Overrides collect_trajectories to add OPD pipeline after standard rollouts - Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring - No external servers needed — same VLLM backend handles both rollouts and scoring Task: Coding problems with test verification (8 built-in tasks, HF dataset support) Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15) OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate
When the model returns multiple tool calls in a single response, they are now executed concurrently using a thread pool instead of sequentially. This significantly reduces wall-clock time when multiple independent tools are batched (e.g. parallel web_search, read_file, terminal calls). Architecture: - _execute_tool_calls() dispatches to sequential or concurrent path - Single tool calls and batches containing 'clarify' use sequential path - Multiple non-interactive tools use ThreadPoolExecutor (max 8 workers) - Results are collected and appended to messages in original order - _invoke_tool() extracted as shared tool invocation helper Safety: - Pre-flight interrupt check skips all tools if interrupted - Per-tool exception handling: one failure doesn't crash the batch - Result truncation (100k char limit) applied per tool - Budget pressure injection after all tools complete - Checkpoints taken before file-mutating tools - CLI spinner shows batch progress, then per-tool completion messages Tests: 10 new tests covering dispatch logic, ordering, error handling, interrupt behavior, truncation, and _invoke_tool routing.
anthropic/claude-opus-4.6 (OpenRouter format) was being sent as claude-opus-4.6 to the Anthropic API, which expects claude-opus-4-6 (hyphens, not dots). normalize_model_name() now converts dots to hyphens after stripping the provider prefix, matching Anthropic's naming convention. Fixes 404: 'model: claude-opus-4.6 was not found'
…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment
When a skill declares required_environment_variables in its YAML frontmatter, missing env vars trigger a secure TUI prompt (identical to the sudo password widget) when the skill is loaded. Secrets flow directly to ~/.hermes/.env, never entering LLM context. Key changes: - New required_environment_variables frontmatter field for skills - Secure TUI widget (masked input, 120s timeout) - Gateway safety: messaging platforms show local setup guidance - Legacy prerequisites.env_vars normalized into new format - Remote backend handling: conservative setup_needed=True - Env var name validation, file permissions hardened to 0o600 - Redact patterns extended for secret-related JSON fields - 12 existing skills updated with prerequisites declarations - ~48 new tests covering skip, timeout, gateway, remote backends - Dynamic panel widget sizing (fixes hardcoded width from original PR) Cherry-picked from PR NousResearch#723 by kshitijk4poor, rebased onto current main with conflict resolution. Fixes NousResearch#688 Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
…42bc21fb feat: secure skill env setup on load (core NousResearch#688)
fix: add missing packages to setuptools config
…f47f71c0 feat: concurrent tool execution with ThreadPoolExecutor
9ad2741 to
35436e3
Compare
kshitijk4poor
pushed a commit
that referenced
this pull request
Mar 19, 2026
…ult (NousResearch#1922) SOUL.md now loads in slot #1 of the system prompt, replacing the hardcoded DEFAULT_AGENT_IDENTITY. This lets users fully customize the agent's identity and personality by editing ~/.hermes/SOUL.md without it conflicting with the built-in identity text. When SOUL.md is loaded as identity, it's excluded from the context files section to avoid appearing twice. When SOUL.md is missing, empty, unreadable, or skip_context_files is set, the hardcoded DEFAULT_AGENT_IDENTITY is used as a fallback. The default SOUL.md (seeded on first run) already contains the full Hermes personality, so existing installs are unaffected. Co-authored-by: Test <test@test.com>
kshitijk4poor
pushed a commit
that referenced
this pull request
Mar 19, 2026
Update all SOUL.md documentation to reflect that it now occupies slot #1 in the system prompt, replacing the hardcoded default identity. Updated pages: - user-guide/features/personality.md — SOUL.md is primary identity, not just a layer - developer-guide/prompt-assembly.md — updated prompt layer order, context files list - guides/use-soul-with-hermes.md — SOUL.md replaces built-in identity - user-guide/configuration.md — updated context files table and directory tree Co-authored-by: Test <test@test.com>
kshitijk4poor
added a commit
that referenced
this pull request
Apr 18, 2026
…ts (NousResearch#11745) Move moonshotai/kimi-k2.5 to position #1 in every model picker list: - OPENROUTER_MODELS (with 'recommended' tag) - _PROVIDER_MODELS: nous, kimi-coding, opencode-zen, opencode-go, alibaba, huggingface - _model_flow_kimi() Coding Plan model list in main.py kimi-coding-cn and moonshot lists already had kimi-k2.5 first.
kshitijk4poor
pushed a commit
that referenced
this pull request
Apr 23, 2026
- entry.tsx no longer writes bootBanner() to the main screen before the alt-screen enters. The <Banner> renders inside the alt screen via the seeded intro row, so nothing is lost — just the flash that preceded it. Fixes the torn first frame reported on Alacritty (blitz row 5 NousResearch#17) and shaves the 'starting agent' hang perception (row 5 #1) since the UI paints straight into the steady-state view - AlternateScreen prefixes ERASE_SCROLLBACK (\x1b[3J) to its entry so strict emulators start from a pristine grid; named constants replace the inline sequences for clarity - bootBanner.ts deleted — dead code
kshitijk4poor
pushed a commit
that referenced
this pull request
May 1, 2026
The scheme-validation commit (e77a3f2c) was too strict: a user with
legacy ''baseUrl: localhost:8000'' (no ''http://'' prefix) in their
''~/.honcho/config.json'' would get ''No API key configured'' from the
CLI after that change, even though their setup worked before.
urlparse on a schemeless host:port treats the host segment as the
scheme and leaves netloc empty, so the http/https check rejected it.
Falls back to a lenient check for schemeless strings that look like
hosts: contain '.' or ':', aren't a boolean/null literal, aren't pure
digits. The SDK still rejects truly malformed URLs at connect time
with a clearer error than ours.
Three new tests: legacy schemeless hosts accepted; obvious garbage
literals (''true'', ''null'', ''12345'') still rejected. Reviewer
noted concern #1: schemeless regression for self-hosters with old
configs.
kshitijk4poor
pushed a commit
that referenced
this pull request
May 1, 2026
* ci(nix): auto-fix stale npm hashes on push to main When a PR merges to main with updated package-lock.json or package.json in ui-tui/ or web/, the new auto-fix-main job detects stale npmDepsHash values and pushes a fix commit directly to main. This eliminates the recurring manual hash-bump PRs (NousResearch#15420, NousResearch#15314, NousResearch#15272, NousResearch#15244) by reusing the existing fix-lockfiles --apply pipeline. The fix commit only touches nix/*.nix files, which are outside the push path filter (package-lock.json / package.json), so it cannot re-trigger itself. Closes NousResearch#15314 * fix(ci): use GitHub App token for auto-fix-main push GITHUB_TOKEN commits are invisible to workflow triggers (GitHub's infinite-loop prevention). The auto-fix-main job pushes directly to main, so the fix commit never triggered downstream nix.yml verification. Mint a short-lived token via the repo's GitHub App (daimon-nous, APP_ID + APP_PRIVATE_KEY secrets) so the push is treated as a real event and nix.yml fires to verify the corrected hashes. Tested via workflow_dispatch dry-run: app token minted successfully, checkout with app token succeeded, fix job correctly gated. Resolves review feedback from Bugbot (r3144569551). * ci(nix): rename lockfile check job for required status check Rename 'check' → 'nix-lockfile-check' so the status check name is unambiguous when added as a required check on main. * fix(ci): harden auto-fix-main against races, loops, and silent failures Address adversarial review findings: 1. Race condition (#1): Job-level concurrency with cancel-in-progress collapses back-to-back pushes; ref: main checkout always gets latest branch state; explicit push target (origin HEAD:main). 2. Loop prevention (#2): File-whitelist check before commit aborts if any file outside nix/{tui,web}.nix was modified, preventing accidental self-triggering. 3. Silent infra failures (#8): nix-lockfile-check now fails explicitly when fix-lockfiles exits without reporting stale status (catches nix setup failures, network errors, script bugs that bypass continue-on-error). 4. Commit traceability (NousResearch#11): Auto-fix commits include source SHA and workflow run URL in the commit body. 5. Explicit push target (NousResearch#12): git push origin HEAD:main instead of bare git push. --------- Co-authored-by: alt-glitch <alt-glitch@users.noreply.github.com>
kshitijk4poor
pushed a commit
that referenced
this pull request
May 1, 2026
…ch#16706) * fix(tui): drop stale stream events after ctrl-c interrupt Once interruptTurn() flips this.interrupted, only recordMessageDelta short-circuited. recordReasoningDelta/Available, recordToolStart/ Progress/Complete, and recordInlineDiffToolComplete kept populating turnState until the python loop reached its next _interrupt_requested check (~1s on busy turns), making it look like ctrl-c was ignored while late "thinking" + tool calls kept landing in the UI. Add the same interrupted guard to every stream-side recorder, and clear the flag at startMessage() so the next turn isn't suppressed if the previous turn never delivered message.complete. * fix(tui): guard recordTodos against post-interrupt mutation; fake-timers in test Copilot review on PR NousResearch#16706: 1. `recordToolStart` is interruption-guarded, but `tool.start` handler also calls `recordTodos(payload.todos)` first — so a late tool.start carrying todos could still mutate `turnState.todos` after Ctrl-C, leaving ghost rows in the panel. Adds the same `if (this.interrupted) return` early-exit to `recordTodos` so *all* tool.start side-effects are dropped post-interrupt. 2. The interrupt test was leaking a real `setTimeout` (interrupt cooldown) across test files, which could fire later and mutate uiStore from the wrong test context. Wraps the test in `vi.useFakeTimers()` + `vi.runAllTimers()` and restores real timers in finally. 3. Extends the same test with a todos payload on the post-interrupt tool.start so we have explicit regression coverage for #1. * fix(tui): guard pushTrail post-interrupt; harden interrupt-test cleanup Round 2 Copilot review on PR NousResearch#16706: 1. `tool.generating` events route through `pushTrail`, which was not interruption-guarded — late events could still write 'drafting …' into `turnTrail` after Ctrl-C, leaving a stale shimmer in the UI. Adds the same `if (this.interrupted) return` early-exit. 2. Test cleanup moved `vi.runAllTimers()` into `finally` (before `vi.useRealTimers()`) so a mid-test assertion failure can't leak the interrupt-cooldown setTimeout across other test files. 3. Replaced the misleading 'pre-interrupt todos … expected to be cleared by the interrupt cycle' comment with an accurate one reflecting current behaviour (interrupt does NOT clear todos). 4. Added an explicit assertion that a post-interrupt `tool.generating` event does not extend `turnTrail` — regression coverage for #1.
kshitijk4poor
pushed a commit
that referenced
this pull request
May 11, 2026
…3456) * feat(goals): /goal checklist + /subgoal user controls Two-phase judge for /goal — Phase A decomposes the goal into a detailed checklist on first turn; Phase B evaluates each pending item harshly against the agent's most recent response. The goal completes only when every item is in a terminal status (completed or impossible). Adds /subgoal so the user can append, complete, mark impossible, undo, remove, or clear items the judge missed or got wrong. Mechanics: - GoalState gains `checklist` and `decomposed` fields, both backwards compatible (old state_meta rows load unchanged). - Phase A: aux call writes a harsh, exhaustive checklist; biased toward more items not fewer. Falls through to legacy freeform judge when decompose fails. - Phase B: judge gets the checklist + last-response snippet + path to a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json. A bounded read_file tool (max 5 calls per turn, restricted to that one file) lets the judge inspect history when the snippet is ambiguous. Stickiness in code: terminal items are frozen, only the user can revert via /subgoal undo. - Continuation prompt shows checklist progress when non-empty; reverts to old prompt when empty. - Status line shows M/N done counts. CLI + gateway + TUI gateway all pass the agent reference into evaluate_after_turn so the dump can be written. Gateway-side /subgoal is allowed mid-run since it only modifies the checklist the judge consults at turn boundaries. Tests: 24 new cases — backcompat round-trip, Phase A decompose, Phase B updates + new_items + stickiness, user override flows, conversation dump (incl. unsafe-sid sanitization), judge read_file restriction. Existing freeform-mode tests updated to patch the renamed `judge_goal_freeform` and skip Phase A explicitly. * fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning Three live-test findings from running /goal end-to-end against gemini-3-flash-preview as the judge: 1. Off-by-one bug — the judge sees the checklist rendered with 1-based indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed state.checklist as 0-based. Result: every judge update landed on the wrong item, evidence got attached to neighbouring rows, and the genuine 'first pending' item (usually #1) never got marked. Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the user prompt to call out the 1-based scheme explicitly. New tests cover the parser conversion + an end-to-end fake-judge round-trip. 2. Conversation dump never happened — _extract_agent_messages tried common AIAgent attribute names (.messages, .conversation_history, etc.) but AIAgent doesn't expose the message list as an instance attribute; it lives inside run_conversation()'s scope. Result: the judge's read_file tool always saw history_path=unavailable. Fix: added an explicit messages= kwarg to evaluate_after_turn that all three call sites (CLI, gateway, TUI gateway) now pass directly. Agent-attribute extraction kept as back-compat fallback. 3. Prompt was too harsh on simple goals. The original 'be HARSH, default to leaving items pending' wording made the judge refuse to mark 'file exists' completed even after the agent ran ls, test -f, os.path.isfile, and find — burning the entire 8-turn budget on a fizzbuzz task. Softened to 'strict but not absurd' with explicit guidance on what counts as evidence and a directive not to require re-proving items already established earlier. Re-tested live with the same fizzbuzz goal: now terminates in 2 turns with all 8 checklist items correctly attributed to their own evidence. /subgoal user-action flow (add / complete / undo / impossible) verified live as well.
kshitijk4poor
added a commit
that referenced
this pull request
May 13, 2026
…registries
Both web_search_registry._resolve() and image_gen_registry.get_active_provider()
walked their registered providers and returned the first one matching the
capability flag — without checking whether that provider was actually
usable. On a fresh install with no credentials at all, this meant
get_active_search_provider() returned `brave-free` (legacy preference
order) even though BRAVE_SEARCH_API_KEY was unset, leading the
dispatcher to surface a "BRAVE_SEARCH_API_KEY is not set" error for a
provider the user never chose. Same bug shape in image_gen for FAL.
Resolution semantics now match tools.web_tools._get_backend():
1. Explicit config name wins, ignoring is_available() — the dispatcher
surfaces a precise "X_API_KEY is not set" error rather than silently
switching backends. Matches user expectation: "I configured X, tell
me what's wrong with X."
2. Fallback (no explicit config) walks the legacy preference order
filtered by is_available() — pick the highest-priority backend the
user actually has credentials for.
is_available() is wrapped in a try/except so a buggy provider doesn't
brick resolution.
E2E verified:
- No creds + no config: get_active_search_provider() -> None
- Explicit brave-free + no key: get_active_search_provider() -> brave-free
(and .is_available() correctly reports False)
This fix was identified during the spike (NousResearch#25182 finding #1) and is
fold-in to the same PR rather than a follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
required_commands/missing_required_commands)Why split
This branch is intentionally stacked on top of
feat/skill-prerequisitesso core NousResearch#688 secure setup can be reviewed separately from runtime/readiness expansion.Test plan
python -m pytest tests/tools/test_skills_tool.py tests/agent/test_skill_commands.py tests/test_cli_secret_capture.py tests/agent/test_prompt_builder.py tests/tools/test_terminal_tool.py -qpython -m pytest tests/hermes_cli/test_config.py tests/test_run_agent.py tests/tools/test_registry.py -qpython -m pytest tests/ -qAll commands pass on this branch.