feat(tools/wot_engine): add Web-of-Thought multi-agent reasoning by Abd0r · Pull Request #20158 · NousResearch/hermes-agent

Abd0r · 2026-05-05T10:55:33Z

Summary

Adds a self-contained multi-agent reasoning engine that coordinates 3-7 LLM agents through a shared message bus. Generic agents (no role taxonomy hardcoded) talk to each other across four communication modes — parallel, streaming, sequential, queue — over any OpenAI-compatible backend. Exposed as a single Hermes tool (wot_chat) under a new wot toolset, plus a methodology skill at skills/coordination/web-of-thought/.

This is a separate concern from PRs #19607 / #19796 (free-tier search backends). They touch different surfaces and are independently reviewable.

Files

tools/wot_engine.py — engine + tool registration (1,034 lines)
tests/tools/test_wot_engine.py — 36 unit tests, all green
skills/coordination/web-of-thought/SKILL.md — methodology guidance for callers (when to invoke, how to design agents, mode selection, cost discipline)

Engine design

No role taxonomy. Agents are differentiated only by caller-supplied name + system_prompt. Engine is content-agnostic; the model decides agent personalities dynamically.
Four communication modes:
- parallel — all agents react to the task simultaneously, see peers' completed messages on round boundaries
- streaming — same as parallel but agents see partial CoT tokens as they're generated
- sequential — round-robin; each agent gets full prior transcript
- queue — tag-driven pull (agents declare interests, only act when relevant tag appears)
Backend probe at startup. Detects llama.cpp / Ollama / vLLM / OpenAI-compat. Uses id_slot pinning + cache_prompt: true on llama.cpp for KV-cache reuse across agents. Strips trailing /v1 from base_url so callers can pass either form.
Reasoning content extraction. Reads delta.reasoning_content separately from delta.content for thinking-mode models (DeepSeek-R1, QwQ, Qwen3.5/3.6 with thinking on). Stored in Message.reasoning so peer messages can choose to propagate raw / strip / summarize CoT (the propagate_reasoning knob; summary is currently stubbed to strip — flagged in the docstring).
Cost rails. Per-channel token_budget, per-agent turn_timeout via asyncio.wait_for, monotonic seq per agent on Message envelope.
AgentSpec auto-sanitization — real LLM callers emit names like "Critical Thinker" or "Agent A". Whitespace becomes _, disallowed characters dropped. Raises only if the result is empty.
Caller-supplied model field is stripped from inner agent specs at wot_chat_tool boundary — outer Hermes tends to hallucinate names like gpt-4o. Engine uses LLM_DEFAULT_MODEL (env-driven) for all inner agents.

Hermes integration

Tool registration is at module top-level (not wrapped in try/except) so tools/registry.py:_module_registers_tools AST scanner picks it up.
wot toolset is auto-created at module load time via toolsets.create_custom_toolset(...) so -t wot validates without modifying toolsets.py.
Skill loads via --skills coordination/web-of-thought and prefixes the system prompt with methodology guidance.

How this fits next to existing Hermes multi-agent surfaces

Hermes already ships several multi-agent / delegation primitives. WoT is additive, not redundant — it fills a specific gap none of them serve.

Capability	`delegate_task`	`mixture_of_agents`	Kanban	WoT (`wot_chat`)
Parent sees children's intermediate outputs	summary only	aggregator-only	polled comments	full transcript every turn
Children talk to each other	no (per #344)	no cross-reference	polling comments	direct `@name`-mentions
Children see each other's CoT	no	no	no	streaming mode pipes partial CoT
Multi-round refinement	one-shot per child	one-shot per reference model	heavyweight (board cycle)	native, default 5 rounds
Process model	subprocess per child	parallel HTTP calls	cross-process, durable	in-process asyncio
Latency floor	process spawn time	API round-trip	DB persist + claim	single API round-trip per agent per round
State persistence	none (ephemeral)	none (ephemeral)	SQLite-backed	none (live in-memory)
Best for	durable cross-process delegation with isolation	Best-of-N synthesis via aggregator	long-running multi-profile workflows	live multi-perspective reasoning within one task

What WoT specifically adds: in-process live multi-agent reasoning where inner agents can address each other directly and the outer agent sees the full transcript as it forms. That's the niche the existing surfaces don't fill — delegate_task deliberately hides intermediate output, MoA's reference models don't see each other, and Kanban's polling comments aren't real-time. Several long-open feature requests (#412 consensus/voting, #376 adversarial debate, #479 best-of-N + judge, #5876 multi-agent council) all reduce to this missing primitive.

WoT does not replace any of the above. Compose: outer Hermes can call delegate_task for durable cross-process work, dispatch wot_chat for live debate within its own turn, and use Kanban for cross-session orchestration. They're complementary.

Validated end-to-end

Setup: Ubuntu 24.04 + RTX 4050. Isolated Hermes install (separate HERMES_HOME, no overlap with any production setup).

1. Local llama.cpp + Qwen3-4B-Instruct-Q4_K_M (--parallel 4 --jinja --ctx-size 65536):

5/5 sessions completed, 48 WoT messages across runs, 0 inner errors
Range of behaviors: deep multi-round debate (18 msgs over 6 rounds), smart short-circuit on triviality (3 msgs in 1 round when all agents emit DONE), self-healing on bad arg shapes (Hermes retried with corrected payload)

2. OpenRouter + DeepSeek-V4-Flash (with skill loaded):

5/5 sessions completed, 23 WoT messages, 1 inner error (token truncation mid-thinking on round 3 of one run; engine surfaced it cleanly via errors[])
Skill methodology measurably moved model behavior toward leaner invocations: avg agent name length ~10 chars (vs ~22 unloaded), max_rounds: 3 explicitly set on 5/5, token_budget on 3/5, ~43% latency drop vs no-skill baseline

Coverage — honest framing

Integration-validated end-to-end (with V4 Flash via OpenRouter as inner agents, full session JSONL captured):

parallel mode — 5/5 sessions clean, 48 WoT messages, multi-round @-mention emergence
streaming mode — 22 streaming chunks + 3 final messages produced, stop_reason=all_done clean
sequential mode — agent ordering preserved across 2 rounds with cross-round @-mentions (round-2 alpha addresses round-1 beta)
queue mode — interests tags drove tag-prefixed output ([design] → [code] → [review]), 3 rounds completed
Per-agent turn_timeout — standalone test, 2/2 agents timed out at 2.0s as configured, errors surfaced via errors[]
Backend probe (llama.cpp + OpenRouter), slot pinning on llama.cpp
Reasoning content extraction (R1 + V4 Flash thinking traces visible in transcripts)
AgentSpec auto-sanitization (model-emitted role-y names sanitized cleanly)
Caller-supplied model stripping (saved a run when V4 Flash hallucinated gpt-4o)
/v1 suffix doubling fix (caught the OpenRouter 404)
Hermes tool auto-discovery + custom toolset registration
Skill load + methodology effect on model behavior (43% latency drop, lean invocations)

Routing-validated (engine routes correctly; downstream model output quality is upstream's concern):

Ollama native /api/chat path for thinking models — backend probe identifies kind='ollama', request hits /api/chat (not /v1/), parses both message.content and message.thinking fields. Tested live with deepseek-r1:1.5b and deepseek-r1:7b. Output quality of small R1 distills + Ollama template handling is broken upstream (well-known) — engine correctly returns whatever Ollama emits.

Unit-test only (no integration run on this PR):

vLLM backend branch — code path exists, would need a vLLM-serving instance to validate. Same probe + dispatcher pattern as the validated paths, low risk.

Multi-backend mix — integration-validated (added in second commit b1e8872):

AgentSpec now has optional base_url + api_key fields for per-agent backend override
_LLMClient caches backend probes per-base_url so each unique target is only probed once
Validated live with one session running two agents on different backends simultaneously:
- alpha on DeepSeek-V4-Flash via OpenRouter (https://openrouter.ai/api/v1)
- beta on deepseek-r1:1.5b via local Ollama (http://127.0.0.1:11434)
Probe cache after run showed both: https://openrouter.ai/api → openai-compat and http://127.0.0.1:11434 → ollama
0 engine errors; both responses assembled into the transcript with correct from-attribution
Defensive design: wot_chat_tool boundary strips model + base_url + api_key from outer-Hermes-supplied args (Hermes hallucinates them); direct Python callers using AgentSpec(base_url=..., api_key=...) still work

Stubbed:

propagate_reasoning="summary" — currently behaves identically to "strip". A real summary mode would distill peer CoT through a small model; deferred to a follow-up. Docstring is honest about this.

Linked issues

Closes (auto-close on merge):

Closes Feature: Consensus & Voting Engine for Multi-Agent Decision Making (inspired by AgentWorkforce/relay) #412 — Consensus & Voting Engine for Multi-Agent Decision Making. WoT's parallel mode + a synthesizer agent delivers exactly the AgentWorkflows-style consensus pattern.
Closes Feature: Adversarial Debate Mode for Delegation — Two-Agent Iterative Refinement (inspired by CAMEL-AI) #376 — Adversarial Debate Mode for Delegation. WoT's sequential mode with two agents is iterative-refinement debate.
Closes Multi-Agent Council Skill — 4-Persona Reasoning with Nested Multi-Agent #5876 — Multi-Agent Council Skill. The coordination/web-of-thought skill is the council methodology; the engine is the substrate.
Closes Feature: Best-of-N Competitive Evaluation — Judge-Based Selection from Parallel Agent Outputs (inspired by Blackbox AI Chairman) #479 — Best-of-N Competitive Evaluation. parallel mode with a synthesizer/judge agent is the Best-of-N pattern.

Refs (does not auto-close — partial coverage):

Feature: Shared Memory Pools Between Sub-Agents in Workflows (inspired by CAMEL-AI) #377 — Shared Memory Pools Between Sub-Agents. The Channel primitive is the shared memory pool; CAMEL-AI-specific patterns are out of scope here.
Feature: Zeroshot Skill — Multi-Agent Blind Validation Orchestration via CLI #488 — Multi-Agent Blind Validation. Closely related pattern; a thin wrapper skill on top of WoT could deliver this.
Multi-agent communication & per-channel persona like openclaw #11922 — Multi-agent communication & per-channel persona. Engine provides the comm primitive (Channel) and per-agent persona (system_prompt).

Test plan

pytest -p no:xdist tests/tools/test_wot_engine.py — 36/36 passing on this branch
Engine integration tested against llama.cpp + Qwen3-4B-Instruct (5/5 sessions, 0 errors)
Engine integration tested against OpenRouter + DeepSeek-V4-Flash (5/5 sessions, 1 truncation surfaced honestly)
Skill load + invocation-pattern A/B tested (skill measurably moves model behavior)
CI green (will fix anything pytest tests/ flags)

Backwards compatibility

Pure-add. New tool, new toolset, new skill, new test file. Zero changes to existing code paths.

License

MIT (auto per CONTRIBUTING.md).

Adds a self-contained multi-agent reasoning engine that coordinates 3-7 LLM agents through a shared message bus. Generic agents (no role taxonomy) talk to each other across four communication modes — parallel, streaming, sequential, queue — over any OpenAI-compatible backend. The engine is exposed as a single Hermes tool, `wot_chat`, registered under a new `wot` toolset. Caller passes agent specs (name + system_prompt) and a task; the engine orchestrates the conversation and returns a structured transcript for the outer agent to synthesize. Files: - tools/wot_engine.py — engine + tool registration (~1040 lines) - tests/tools/test_wot_engine.py — 36 unit tests, all green - skills/coordination/web-of-thought/SKILL.md — methodology guidance (when to invoke, how to design agents, mode selection, cost discipline) Engine specifics: - Backend probe at startup: detects llama.cpp / Ollama / vLLM / OpenAI-compat. Uses id_slot pinning + cache_prompt: true on llama.cpp for KV-cache reuse across agents. - Reasoning content extraction: handles delta.reasoning_content from thinking-mode models (DeepSeek-R1, QwQ, etc.) separately from content, so peer messages can choose to propagate raw / strip / summarize CoT. - Per-agent timeout via asyncio.wait_for, per-channel token budget, monotonic seq counter on Message envelope for stream debugging. - AgentSpec auto-sanitizes whitespace in names (real LLMs emit "Critical Thinker" / "Agent A"); raises only when sanitized name is empty. - /v1 suffix is stripped from base_url at client init so callers can pass either form (http://host:8088 OR http://host:8088/v1) without doubling. - Hermes tool registration is at module top-level (not wrapped in try/except) so tools/registry.py:_module_registers_tools picks it up via AST scan. Skill methodology: - When to invoke wot_chat (multi-perspective questions, tradeoffs, decisions with real downside) and when NOT to (lookups, single-fact, simple chat). - Agent design rule: minimal differentiating system_prompt, no scripted personas, no role-cargo names. Engine remains role-agnostic. - Mode selection: parallel (default) / streaming / sequential / queue. - Cost discipline: max_rounds: 2-3 for most cases, set token_budget for hard caps. - Reading the result: errors first, then agents_done, then transcript. Validated end-to-end on Ubuntu 24.04 + RTX 4050 with two model configurations: 1. Local llama.cpp + Qwen3-4B-Instruct-Q4_K_M (--parallel 4 --jinja): 5/5 sessions completed, 48 WoT messages across runs, 0 inner errors. 2. OpenRouter + DeepSeek-V4-Flash (with skill loaded): 5/5 sessions, skill methodology measurably moved model behavior toward leaner invocations (avg agent name ~10 chars vs ~22 unloaded; max_rounds explicitly set 5/5; 43% latency drop). License: MIT (auto per CONTRIBUTING.md).

Adds per-agent base_url + api_key fields to AgentSpec, enabling a single WoT session to mix backends (e.g. one agent on local Ollama, another on OpenRouter). _LLMClient caches backend probes per-base_url so each unique target is only probed once across the run. Engine changes: - AgentSpec: new optional fields base_url + api_key - _LLMClient: _probe_cache: Dict[str, BackendInfo], ensure_probed() now takes optional base_url_override and caches per-target - _resolve_target() helper composes the right URL + auth headers per call - _openai_payload_for(backend, ...) takes backend explicitly (so id_slot + cache_prompt only land when the THIS request actually targets llama-server) - complete() and stream() take base_url_override + api_key_override kwargs - _stream_openai and _stream_ollama_native take per-call base + headers - Agent.turn_batch + turn_streaming pass spec.base_url + spec.api_key - wot_chat_tool boundary strips model + base_url + api_key from outer-Hermes args (defensive: outer model hallucinates these); direct Python callers using AgentSpec(base_url=..., api_key=...) still work Tests: - 39/39 unit tests passing (up from 36) - New: MultiBackendMixTests verifies per-agent base_url threads to client - New: WotChatToolStripsCallerControlFields verifies tool boundary strips caller-supplied model/base_url/api_key Validated end-to-end: - One WoT session with 2 agents on different backends: - alpha on DeepSeek-V4-Flash via OpenRouter - beta on deepseek-r1:1.5b via local Ollama - Probe cache shows both targets: https://openrouter.ai/api → openai-compat http://127.0.0.1:11434 → ollama - 0 engine errors, both transcripts assembled with correct from-attribution

Abd0r changed the title ~~feat(tools/wot_engine): Web-of-Thought multi-agent reasoning~~ feat(tools/wot_engine): add Web-of-Thought multi-agent reasoning May 5, 2026

alt-glitch added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have labels May 5, 2026

Abd0r closed this May 6, 2026

Abd0r reopened this May 6, 2026

Abd0r mentioned this pull request May 7, 2026

Feature Request: Native Multi-Agent Support #7517

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools/wot_engine): add Web-of-Thought multi-agent reasoning#20158

feat(tools/wot_engine): add Web-of-Thought multi-agent reasoning#20158
Abd0r wants to merge 2 commits into
NousResearch:mainfrom
Abd0r:feat/wot-engine

Abd0r commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Abd0r commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Engine design

Hermes integration

How this fits next to existing Hermes multi-agent surfaces

Validated end-to-end

Coverage — honest framing

Linked issues

Test plan

Backwards compatibility

License

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Abd0r commented May 5, 2026 •

edited

Loading