Skip to content

feat(tools/wot_engine): add Web-of-Thought multi-agent reasoning#20158

Open
Abd0r wants to merge 2 commits into
NousResearch:mainfrom
Abd0r:feat/wot-engine
Open

feat(tools/wot_engine): add Web-of-Thought multi-agent reasoning#20158
Abd0r wants to merge 2 commits into
NousResearch:mainfrom
Abd0r:feat/wot-engine

Conversation

@Abd0r

@Abd0r Abd0r commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a self-contained multi-agent reasoning engine that coordinates 3-7 LLM agents through a shared message bus. Generic agents (no role taxonomy hardcoded) talk to each other across four communication modes — parallel, streaming, sequential, queue — over any OpenAI-compatible backend. Exposed as a single Hermes tool (wot_chat) under a new wot toolset, plus a methodology skill at skills/coordination/web-of-thought/.

This is a separate concern from PRs #19607 / #19796 (free-tier search backends). They touch different surfaces and are independently reviewable.

Files

  • tools/wot_engine.py — engine + tool registration (1,034 lines)
  • tests/tools/test_wot_engine.py — 36 unit tests, all green
  • skills/coordination/web-of-thought/SKILL.md — methodology guidance for callers (when to invoke, how to design agents, mode selection, cost discipline)

Engine design

  • No role taxonomy. Agents are differentiated only by caller-supplied name + system_prompt. Engine is content-agnostic; the model decides agent personalities dynamically.
  • Four communication modes:
    • parallel — all agents react to the task simultaneously, see peers' completed messages on round boundaries
    • streaming — same as parallel but agents see partial CoT tokens as they're generated
    • sequential — round-robin; each agent gets full prior transcript
    • queue — tag-driven pull (agents declare interests, only act when relevant tag appears)
  • Backend probe at startup. Detects llama.cpp / Ollama / vLLM / OpenAI-compat. Uses id_slot pinning + cache_prompt: true on llama.cpp for KV-cache reuse across agents. Strips trailing /v1 from base_url so callers can pass either form.
  • Reasoning content extraction. Reads delta.reasoning_content separately from delta.content for thinking-mode models (DeepSeek-R1, QwQ, Qwen3.5/3.6 with thinking on). Stored in Message.reasoning so peer messages can choose to propagate raw / strip / summarize CoT (the propagate_reasoning knob; summary is currently stubbed to strip — flagged in the docstring).
  • Cost rails. Per-channel token_budget, per-agent turn_timeout via asyncio.wait_for, monotonic seq per agent on Message envelope.
  • AgentSpec auto-sanitization — real LLM callers emit names like "Critical Thinker" or "Agent A". Whitespace becomes _, disallowed characters dropped. Raises only if the result is empty.
  • Caller-supplied model field is stripped from inner agent specs at wot_chat_tool boundary — outer Hermes tends to hallucinate names like gpt-4o. Engine uses LLM_DEFAULT_MODEL (env-driven) for all inner agents.

Hermes integration

  • Tool registration is at module top-level (not wrapped in try/except) so tools/registry.py:_module_registers_tools AST scanner picks it up.
  • wot toolset is auto-created at module load time via toolsets.create_custom_toolset(...) so -t wot validates without modifying toolsets.py.
  • Skill loads via --skills coordination/web-of-thought and prefixes the system prompt with methodology guidance.

How this fits next to existing Hermes multi-agent surfaces

Hermes already ships several multi-agent / delegation primitives. WoT is additive, not redundant — it fills a specific gap none of them serve.

Capability delegate_task mixture_of_agents Kanban WoT (wot_chat)
Parent sees children's intermediate outputs summary only aggregator-only polled comments full transcript every turn
Children talk to each other no (per #344) no cross-reference polling comments direct @name-mentions
Children see each other's CoT no no no streaming mode pipes partial CoT
Multi-round refinement one-shot per child one-shot per reference model heavyweight (board cycle) native, default 5 rounds
Process model subprocess per child parallel HTTP calls cross-process, durable in-process asyncio
Latency floor process spawn time API round-trip DB persist + claim single API round-trip per agent per round
State persistence none (ephemeral) none (ephemeral) SQLite-backed none (live in-memory)
Best for durable cross-process delegation with isolation Best-of-N synthesis via aggregator long-running multi-profile workflows live multi-perspective reasoning within one task

What WoT specifically adds: in-process live multi-agent reasoning where inner agents can address each other directly and the outer agent sees the full transcript as it forms. That's the niche the existing surfaces don't fill — delegate_task deliberately hides intermediate output, MoA's reference models don't see each other, and Kanban's polling comments aren't real-time. Several long-open feature requests (#412 consensus/voting, #376 adversarial debate, #479 best-of-N + judge, #5876 multi-agent council) all reduce to this missing primitive.

WoT does not replace any of the above. Compose: outer Hermes can call delegate_task for durable cross-process work, dispatch wot_chat for live debate within its own turn, and use Kanban for cross-session orchestration. They're complementary.

Validated end-to-end

Setup: Ubuntu 24.04 + RTX 4050. Isolated Hermes install (separate HERMES_HOME, no overlap with any production setup).

1. Local llama.cpp + Qwen3-4B-Instruct-Q4_K_M (--parallel 4 --jinja --ctx-size 65536):

  • 5/5 sessions completed, 48 WoT messages across runs, 0 inner errors
  • Range of behaviors: deep multi-round debate (18 msgs over 6 rounds), smart short-circuit on triviality (3 msgs in 1 round when all agents emit DONE), self-healing on bad arg shapes (Hermes retried with corrected payload)

2. OpenRouter + DeepSeek-V4-Flash (with skill loaded):

  • 5/5 sessions completed, 23 WoT messages, 1 inner error (token truncation mid-thinking on round 3 of one run; engine surfaced it cleanly via errors[])
  • Skill methodology measurably moved model behavior toward leaner invocations: avg agent name length ~10 chars (vs ~22 unloaded), max_rounds: 3 explicitly set on 5/5, token_budget on 3/5, ~43% latency drop vs no-skill baseline

Coverage — honest framing

Integration-validated end-to-end (with V4 Flash via OpenRouter as inner agents, full session JSONL captured):

  • parallel mode — 5/5 sessions clean, 48 WoT messages, multi-round @-mention emergence
  • streaming mode — 22 streaming chunks + 3 final messages produced, stop_reason=all_done clean
  • sequential mode — agent ordering preserved across 2 rounds with cross-round @-mentions (round-2 alpha addresses round-1 beta)
  • queue mode — interests tags drove tag-prefixed output ([design][code][review]), 3 rounds completed
  • Per-agent turn_timeout — standalone test, 2/2 agents timed out at 2.0s as configured, errors surfaced via errors[]
  • Backend probe (llama.cpp + OpenRouter), slot pinning on llama.cpp
  • Reasoning content extraction (R1 + V4 Flash thinking traces visible in transcripts)
  • AgentSpec auto-sanitization (model-emitted role-y names sanitized cleanly)
  • Caller-supplied model stripping (saved a run when V4 Flash hallucinated gpt-4o)
  • /v1 suffix doubling fix (caught the OpenRouter 404)
  • Hermes tool auto-discovery + custom toolset registration
  • Skill load + methodology effect on model behavior (43% latency drop, lean invocations)

Routing-validated (engine routes correctly; downstream model output quality is upstream's concern):

  • Ollama native /api/chat path for thinking models — backend probe identifies kind='ollama', request hits /api/chat (not /v1/), parses both message.content and message.thinking fields. Tested live with deepseek-r1:1.5b and deepseek-r1:7b. Output quality of small R1 distills + Ollama template handling is broken upstream (well-known) — engine correctly returns whatever Ollama emits.

Unit-test only (no integration run on this PR):

  • vLLM backend branch — code path exists, would need a vLLM-serving instance to validate. Same probe + dispatcher pattern as the validated paths, low risk.

Multi-backend mix — integration-validated (added in second commit b1e8872):

  • AgentSpec now has optional base_url + api_key fields for per-agent backend override
  • _LLMClient caches backend probes per-base_url so each unique target is only probed once
  • Validated live with one session running two agents on different backends simultaneously:
    • alpha on DeepSeek-V4-Flash via OpenRouter (https://openrouter.ai/api/v1)
    • beta on deepseek-r1:1.5b via local Ollama (http://127.0.0.1:11434)
  • Probe cache after run showed both: https://openrouter.ai/api → openai-compat and http://127.0.0.1:11434 → ollama
  • 0 engine errors; both responses assembled into the transcript with correct from-attribution
  • Defensive design: wot_chat_tool boundary strips model + base_url + api_key from outer-Hermes-supplied args (Hermes hallucinates them); direct Python callers using AgentSpec(base_url=..., api_key=...) still work

Stubbed:

  • propagate_reasoning="summary" — currently behaves identically to "strip". A real summary mode would distill peer CoT through a small model; deferred to a follow-up. Docstring is honest about this.

Linked issues

Closes (auto-close on merge):

Refs (does not auto-close — partial coverage):

Test plan

  • pytest -p no:xdist tests/tools/test_wot_engine.py — 36/36 passing on this branch
  • Engine integration tested against llama.cpp + Qwen3-4B-Instruct (5/5 sessions, 0 errors)
  • Engine integration tested against OpenRouter + DeepSeek-V4-Flash (5/5 sessions, 1 truncation surfaced honestly)
  • Skill load + invocation-pattern A/B tested (skill measurably moves model behavior)
  • CI green (will fix anything pytest tests/ flags)

Backwards compatibility

Pure-add. New tool, new toolset, new skill, new test file. Zero changes to existing code paths.

License

MIT (auto per CONTRIBUTING.md).

Adds a self-contained multi-agent reasoning engine that coordinates
3-7 LLM agents through a shared message bus. Generic agents (no role
taxonomy) talk to each other across four communication modes — parallel,
streaming, sequential, queue — over any OpenAI-compatible backend.

The engine is exposed as a single Hermes tool, `wot_chat`, registered
under a new `wot` toolset. Caller passes agent specs (name +
system_prompt) and a task; the engine orchestrates the conversation and
returns a structured transcript for the outer agent to synthesize.

Files:
- tools/wot_engine.py — engine + tool registration (~1040 lines)
- tests/tools/test_wot_engine.py — 36 unit tests, all green
- skills/coordination/web-of-thought/SKILL.md — methodology guidance
  (when to invoke, how to design agents, mode selection, cost discipline)

Engine specifics:
- Backend probe at startup: detects llama.cpp / Ollama / vLLM /
  OpenAI-compat. Uses id_slot pinning + cache_prompt: true on llama.cpp
  for KV-cache reuse across agents.
- Reasoning content extraction: handles delta.reasoning_content from
  thinking-mode models (DeepSeek-R1, QwQ, etc.) separately from content,
  so peer messages can choose to propagate raw / strip / summarize CoT.
- Per-agent timeout via asyncio.wait_for, per-channel token budget,
  monotonic seq counter on Message envelope for stream debugging.
- AgentSpec auto-sanitizes whitespace in names (real LLMs emit
  "Critical Thinker" / "Agent A"); raises only when sanitized name is
  empty.
- /v1 suffix is stripped from base_url at client init so callers can
  pass either form (http://host:8088 OR http://host:8088/v1) without
  doubling.
- Hermes tool registration is at module top-level (not wrapped in
  try/except) so tools/registry.py:_module_registers_tools picks it
  up via AST scan.

Skill methodology:
- When to invoke wot_chat (multi-perspective questions, tradeoffs,
  decisions with real downside) and when NOT to (lookups, single-fact,
  simple chat).
- Agent design rule: minimal differentiating system_prompt, no
  scripted personas, no role-cargo names. Engine remains role-agnostic.
- Mode selection: parallel (default) / streaming / sequential / queue.
- Cost discipline: max_rounds: 2-3 for most cases, set token_budget
  for hard caps.
- Reading the result: errors first, then agents_done, then transcript.

Validated end-to-end on Ubuntu 24.04 + RTX 4050 with two model
configurations:
1. Local llama.cpp + Qwen3-4B-Instruct-Q4_K_M (--parallel 4 --jinja):
   5/5 sessions completed, 48 WoT messages across runs, 0 inner errors.
2. OpenRouter + DeepSeek-V4-Flash (with skill loaded): 5/5 sessions,
   skill methodology measurably moved model behavior toward leaner
   invocations (avg agent name ~10 chars vs ~22 unloaded; max_rounds
   explicitly set 5/5; 43% latency drop).

License: MIT (auto per CONTRIBUTING.md).
@Abd0r Abd0r changed the title feat(tools/wot_engine): Web-of-Thought multi-agent reasoning feat(tools/wot_engine): add Web-of-Thought multi-agent reasoning May 5, 2026
@alt-glitch alt-glitch added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have labels May 5, 2026
Adds per-agent base_url + api_key fields to AgentSpec, enabling a single
WoT session to mix backends (e.g. one agent on local Ollama, another on
OpenRouter). _LLMClient caches backend probes per-base_url so each unique
target is only probed once across the run.

Engine changes:
- AgentSpec: new optional fields base_url + api_key
- _LLMClient: _probe_cache: Dict[str, BackendInfo], ensure_probed() now
  takes optional base_url_override and caches per-target
- _resolve_target() helper composes the right URL + auth headers per call
- _openai_payload_for(backend, ...) takes backend explicitly (so id_slot +
  cache_prompt only land when the THIS request actually targets llama-server)
- complete() and stream() take base_url_override + api_key_override kwargs
- _stream_openai and _stream_ollama_native take per-call base + headers
- Agent.turn_batch + turn_streaming pass spec.base_url + spec.api_key
- wot_chat_tool boundary strips model + base_url + api_key from outer-Hermes
  args (defensive: outer model hallucinates these); direct Python callers
  using AgentSpec(base_url=..., api_key=...) still work

Tests:
- 39/39 unit tests passing (up from 36)
- New: MultiBackendMixTests verifies per-agent base_url threads to client
- New: WotChatToolStripsCallerControlFields verifies tool boundary strips
  caller-supplied model/base_url/api_key

Validated end-to-end:
- One WoT session with 2 agents on different backends:
  - alpha on DeepSeek-V4-Flash via OpenRouter
  - beta on deepseek-r1:1.5b via local Ollama
- Probe cache shows both targets:
  https://openrouter.ai/api → openai-compat
  http://127.0.0.1:11434 → ollama
- 0 engine errors, both transcripts assembled with correct from-attribution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment