Skip to content

feat(delegate): subagent architecture v2#3387

Closed
Mibayy wants to merge 14 commits into
NousResearch:mainfrom
Mibayy:feat/subagent-v2
Closed

feat(delegate): subagent architecture v2#3387
Mibayy wants to merge 14 commits into
NousResearch:mainfrom
Mibayy:feat/subagent-v2

Conversation

@Mibayy

@Mibayy Mibayy commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

feat(delegate): subagent architecture v2

Extends delegate_task with a set of opt-in features for multi-agent coordination, quality control, and resilience. All existing calls work unchanged.

What this PR adds

Coordination

  • skills field on each task: inject skill content directly into child system prompt
  • blackboard.enabled: shared thread-safe key-value store visible to all siblings in a batch
  • dag.enabled + depends_on field: topological sort with predecessor summary injection

Quality

  • verify: true on a task (or verify.enabled globally): run a critic subagent after the generator, attaches verdict (valid/invalid/unknown) and critic_summary to the result
  • retry.max_retries: retry failed tasks with the previous error injected as context

Observability and resilience

Config

Manager-first strategy (new)

  • DELEGATION_GUIDANCE added to prompt_builder.py: injected into the system prompt whenever delegate_task is available
  • Instructs Hermes to act as a manager agent — plan, coordinate, synthesize — and delegate implementation tasks (3+ distinct steps, multi-file changes, 5+ tool calls) rather than executing them directly
  • Does not affect quick lookups, iterative debugging, or direct replies

MCP workspace guard (new)

  • Subagents share the same MCP process, so a set_project_root / switch_project / reindex call inside a subagent would corrupt the shared codebase-index state for all siblings and the parent
  • Thread-local guard (_tl.subagent_guard) activated in _run_single_child before child.run_conversation(), always restored in finally
  • _make_tool_handler checks the flag and returns a clear error for the three blocked tool names — parent and sibling threads are unaffected
  • Read-only codebase-index tools (find_symbol, search_codebase, get_structure_summary, etc.) work normally inside subagents

Relationship to other PRs

Dependency matrix

Feature Depends on Default
memory_access read/read-write #3093 (graceful no-op) off
Skill injection #3294 or any skill in skills/ off
Blackboard none off
DAG task deps none off
Generator-critic none off
Intelligent retry none off
Semantic dedup cache #3093 (graceful no-op) off
Detailed trace none off
Checkpointing none off
max_depth configurable none 2
Manager-first guidance none (prompt injection) on when delegate_task loaded
MCP workspace guard none always active in subagent threads

New files

tools/delegate_dag.py         -- topological_sort + resolve_deps
tools/delegate_checkpoint.py  -- SQLite-backed CheckpointStore
tests/tools/test_delegate_config.py
tests/tools/test_delegate_depth.py
tests/tools/test_delegate_memory.py
tests/tools/test_delegate_skills.py
tests/tools/test_delegate_blackboard.py
tests/tools/test_delegate_callback.py
tests/tools/test_delegate_dag.py
tests/tools/test_delegate_verify.py
tests/tools/test_delegate_retry.py
tests/tools/test_delegate_cache.py
tests/tools/test_delegate_observability.py
tests/tools/test_delegate_checkpoint.py
tests/tools/test_mcp_subagent_guard.py   -- new
website/docs/user-guide/features/delegate-task.md

Tests

70 new tests, all passing.

Example: subagent with codebase-index skill (pairs with #3294)

{
  "tasks": [
    {
      "goal": "Find all callers of send_message and check for missing error handling",
      "skills": ["codebase-index"],
      "toolsets": ["terminal", "file"]
    }
  ]
}

The subagent will call get_dependents("send_message") and get_function_source instead of grep/cat -- the token savings from #3294 apply inside delegated tasks too.

Example: full DAG batch

{
  "tasks": [
    {
      "id": "design",
      "goal": "Design the notification system schema",
      "skills": ["codebase-index", "nextjs-supabase-notifications"],
      "verify": true
    },
    {
      "id": "impl",
      "goal": "Implement the notification system",
      "depends_on": ["design"],
      "skills": ["codebase-index", "test-driven-development"],
      "verify": true
    }
  ]
}

Config:

delegation:
  dag:
    enabled: true
  retry:
    max_retries: 1
  blackboard:
    enabled: true

Mibayy added 10 commits March 27, 2026 11:17
Adds three tools for querying a self-hosted OpenViking server:
- viking_search: semantic search over memories, resources, and skills
  (auto/fast/deep modes)
- viking_read: read content at a viking:// URI with configurable detail
  levels (abstract / overview / read)
- viking_browse: explore the OpenViking filesystem layout (tree/list/stat)

All tools are gated behind check_fn that verifies the server is reachable,
so they are silently absent when OpenViking is not running — same pattern as
Honcho. The 'openviking' named toolset is also added to TOOLSETS and to
the shared tools list so gateway and CLI pick them up automatically.

Configure via:
  OPENVIKING_ENDPOINT  (default: http://127.0.0.1:1933)
  OPENVIKING_API_KEY   (optional, for secured deployments)

Closes NousResearch#3368
/model was fully implemented in hermes_cli/model_switch.py and
hermes_cli/main.py but was never registered in COMMAND_REGISTRY,
causing it to be absent from /help output and Telegram bot commands.

Fixes NousResearch#3371
…ic cache, detailed trace, opt-in checkpointing

@Mibayy Mibayy left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — feat(delegate): subagent architecture v2

Reviewed by Hermes Agent. Overall the PR is solid — correct behavior, good test coverage, clean module separation. Found 2 bugs, 4 warnings, and 6 suggestions.

Summary

New modulesdelegate_dag.py, delegate_checkpoint.py, delegate_blackboard.py, openviking_tool.py — are all well-scoped and independently testable. The wiring in delegate_tool.py is the complex part; see inline comments.

Verdict: Changes Requested — two issues must be fixed before merge (race condition on DAG results read, and double config load on depth check). The rest are improvements.

@Mibayy Mibayy left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review — feat(delegate): subagent architecture v2

Reviewed commit: 1cde0fca. 127 tests passing. Found 2 bugs to fix before merge, plus warnings and suggestions.


🔴 Must fix

1. Race condition on results list in DAG mode (delegate_tool.py ~line 893)

_run_task reads the shared results list inside a ThreadPoolExecutor to build completed_so_far for DAG dependency injection — no lock. In batch mode, concurrent tasks can read a stale snapshot and miss predecessor summaries.

Fix: protect with a threading.Lock or use a separate thread-safe dict:

_results_lock = threading.Lock()
_completed: dict[str, dict] = {}

def _run_task(i, t):
    if dag_enabled:
        with _results_lock:
            completed_so_far = dict(_completed)
        resolved_task = resolve_deps(t, completed_so_far)
    ...
    with _results_lock:
        _completed[str(i)] = result
    return result

2. _get_max_depth() called twice per depth check (delegate_tool.py ~line 774-777)

Each call loads the config file from disk. Cache in a local:

max_depth = _get_max_depth()
if depth >= max_depth:
    return json.dumps({"error": f"Delegation depth limit reached ({max_depth})..."})

⚠️ Warnings

3. CheckpointStore._connect() — new connection per call
check_same_thread=False is correct but creates connection churn under parallel checkpoint saves. A single threading.Lock + reused connection, or WAL mode + connection pool, would be cleaner.

4. queue.pop(0) in topological_sort is O(n)
collections.deque + popleft() is idiomatic O(1). Fine for 3-task batches but worth fixing.

5. Critic agent gets toolsets=['terminal']
A critic that only reads a summary string has no need for terminal access. toolsets=[] is more minimal and avoids unintended write capabilities in the critic subagent.

6. openviking_tool.py — no error handling on r.json()
A 502 from a proxy returns non-JSON and raises an unhandled json.JSONDecodeError. Wrap in try/except:

try:
    data = r.json()
except Exception:
    return json.dumps({"error": f"Non-JSON response from server: {r.status_code}"})

💡 Suggestions

7. _load_skill_content — add debug log on miss
logger.debug(f"Skill '{name}' not found") helps users diagnose typos in skill names.

8. _format_structured_context — key capitalization inconsistent
Known keys get title-case headers; unknown keys get key.capitalize() which leaves underscores: my_custom_keyMy_custom_key. Either strip underscores or normalize all headers the same way.

9. Critic summary truncation — magic number
summary[:4000] should be a named constant (_CRITIC_MAX_SUMMARY_CHARS = 4000).

10. openviking_tool.py — consider splitting to its own PR
It's a separate feature (OpenViking memory integration) bundled here. The current PR is already 3270 lines. Easier to review and roll back independently.

11. docs/plans/ directory — remove before merge
The 4 subagent-v2-phase*.md files are internal planning artifacts. Add to .gitignore or drop from the branch.


✅ Looks good

  • Module separation clean: dag / checkpoint / blackboard are independently testable with zero coupling to each other.
  • All new features opt-in — backward compatibility preserved for all existing delegate_task callers.
  • _SM_AVAILABLE graceful no-op pattern applied consistently across cache, hot facts, and memory mode.
  • _compute_child_toolsets / _strip_blocked_tools compat wrapper is correct.
  • Hot facts injection reads parent's already-loaded _memory_store — no extra disk I/O.
  • Test coverage thorough: every new helper has unit tests before integration-level wiring.

Mibayy added 2 commits March 27, 2026 13:59
- delegate_dag: deque.popleft() replaces queue.pop(0) -- O(1) vs O(n)
- delegate_checkpoint: single persistent conn + threading.Lock, WAL mode
- delegate_tool: _CRITIC_MAX_SUMMARY_CHARS constant, critic toolsets=[],
  debug log on skill miss, key.replace('_',' ').title() for unknown keys,
  blackboard comment
- openviking_tool: try/except on r.json() in search and read paths
- Remove docs/plans/subagent-v2-*.md planning artifacts, gitignore them
@Mibayy

Mibayy commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up — all review items resolved (commit 727e379f)

All 12 points from the self-review have been addressed. Summary:

🔴 Bugs fixed

  • ① Race condition on DAG results — replaced direct results list read with a thread-safe _completed_by_id dict protected by threading.Lock
  • ② Double _get_max_depth() call — cached in local variable, single disk read per depth check

⚠️ Warnings fixed

  • ③ CheckpointStore connection churn — single persistent connection per instance + threading.Lock + PRAGMA journal_mode=WAL
  • queue.pop(0) O(n) — replaced with collections.deque + popleft() O(1)
  • ⑤ Critic agent toolsets=['terminal'] — changed to toolsets=[], critic only reads a summary string
  • openviking_tool.py unhandled r.json() — wrapped in try/except in both viking_search and viking_read, returns structured error dict on non-JSON response

💡 Suggestions applied

  • ⑦ Skill miss debug loglogger.debug("Skill '%s' not found in search paths: ...") added to _load_skill_content
  • ⑧ Key capitalizationkey.capitalize()key.replace('_', ' ').title() for consistent header formatting
  • ⑨ Critic summary truncation4000 → named constant _CRITIC_MAX_SUMMARY_CHARS = 4000
  • ⑩ openviking in separate PR — N/A, fixes already applied, no reason to split
  • ⑪ Blackboard comment — added clarifying comment on blackboard=bb shared instance
  • docs/plans/ artifacts — removed from repo, added docs/plans/subagent-v2-*.md to .gitignore

All 127 tests passing.

…for subagents

- Add DELEGATION_GUIDANCE to prompt_builder.py: injected into system prompt
  when delegate_task is available. Instructs Hermes to act as a manager agent
  — decompose implementation tasks (3+ steps) and delegate rather than executing
  directly. Criteria: feature impl, multi-file changes, DAG-able deps, 5+ tool
  calls. Exceptions: quick lookups, iterative debugging, direct replies.

- Add MCP workspace guard (mcp_tool.py):
  - Thread-local _tl.subagent_guard flag, off by default
  - _SUBAGENT_BLOCKED_TOOL_NAMES: {set_project_root, switch_project, reindex}
  - _make_tool_handler checks flag before dispatching — blocked tools return
    a clear error without touching the MCP session
  - set_subagent_mcp_guard(active) public API for delegate_tool.py

- Wire guard into delegate_tool.py:
  - Import set_subagent_mcp_guard
  - set_subagent_mcp_guard(True) before child.run_conversation()
  - set_subagent_mcp_guard(False) in finally block (always restored)
  - Thread-local: each subagent thread isolated, parent unaffected

- Tests (test_mcp_subagent_guard.py, 10 tests):
  - Blocked names coverage
  - Read-only tools not blocked
  - Block active when guard on, pass-through when guard off
  - Thread-local isolation (guard in main does not leak to child thread)
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the thorough work here, Mibayy — there's clearly a lot of thought behind this. We're closing this PR for now, but want to explain the reasoning in detail so you can address these concerns if you'd like to reopen or submit a new PR.

Overall Design Concern

The core question we kept coming back to: what real problem does this solve?

The current delegate_task is a simple fan-out — spawn 1-3 subagents in parallel, each gets an isolated context, results come back as summaries. It works well for what it does. Most of the features in this PR add coordination complexity for workflows that don't arise much in practice:

DAG dependencies defeat the purpose of delegation. If task B depends on task A's output, they run sequentially. The parent agent could just do them in order and keep full context between steps. Delegation's value is parallelism for independent work. Sequential delegation with dependency injection is strictly worse than the parent doing it — you lose context continuity and pay subagent overhead.

Blackboard has a timing problem. Siblings run in parallel. By the time one writes a key, the others may already be done or past the point where that information would help. Non-deterministic timing makes this unreliable for anything important.

Generator-critic doubles cost for something the parent already does. The parent agent reads every subagent's summary and decides what to do next — that IS the verification step. Spawning a separate critic subagent to check the work first is redundant and doubles API cost and wall-clock time.

Retry is redundant with parent-level re-delegation. The parent sees failed results and can re-delegate with adjusted context. Building retry into the delegation layer adds complexity in a place where it's harder to observe and debug.

Checkpointing is infrastructure without a consumer. The PR saves subagent state to SQLite but nothing ever resumes from a checkpoint. It's a storage layer for a feature that doesn't exist yet.

Manager-first delegation guidance (DELEGATION_GUIDANCE) is the most concerning change. It injects a prompt telling Hermes to act as a "manager agent" and prefer delegating over direct execution — on by default, no config to disable. This would fundamentally change agent behavior for all users and frustrate people who want the agent to just do things.

What IS Valuable

  • Skill injection — letting subagents load skills closes a real gap. We'd take this as a standalone PR.
  • MCP workspace guard — legitimate safety fix preventing subagents from corrupting shared codebase index state. Also welcome as a standalone PR.
  • Configurable max_depth — small and useful, easy standalone PR.

Bundled Unrelated Changes

The PR also includes several unrelated pieces that should be separate PRs:

  • tools/openviking_tool.py — entire new tool integration not mentioned in the PR description
  • /model CommandDef registration without a handler
  • .codebase-index-cache.json — 28MB+ file that should not be committed

Test Issues

Nearly all test files hardcode sys.path.insert(0, '/root/hermes-agent') — this won't work in our test suite. Our conftest.py handles path setup, so these inserts should be removed.

Path Forward

You're welcome to address these concerns and open new PRs. Our suggestion:

  1. Skill injection + MCP guard as one focused PR (high chance of merge)
  2. Configurable max_depth as a tiny standalone PR
  3. If you want to pursue the coordination features (DAG, blackboard, verify, retry), we'd want to see concrete usage scenarios where the current simple fan-out falls short — real examples where these features would have made a meaningful difference

Thanks again for the effort here. The code quality on the individual pieces is solid — it's the overall design direction we want to think through more carefully.

@teknium1 teknium1 closed this Mar 28, 2026
Mibayy added a commit to Mibayy/hermes-agent that referenced this pull request Mar 28, 2026
… feedback

Teknium feedback: generator-critic doubles cost for what the parent already does,
retry is redundant with parent-level re-delegation.

Changes:
- Disabled _run_with_verify() calls (3 locations)
- Disabled _run_with_retry() via 'if False' guards (3 locations)
- Removed verify and depends_on from tool description
- Functions remain in codebase for future opt-in use

This keeps the feat/subagent-v2 branch for personal use while aligning
with upstream direction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants