Skip to content

feat(research): deep-research methodology + local_web_tools (free-tier, Qwen3.5/3.6 + llama.cpp first-class)#19341

Closed
Abd0r wants to merge 4 commits into
NousResearch:mainfrom
Abd0r:feat/local-web-tools-and-deep-research
Closed

feat(research): deep-research methodology + local_web_tools (free-tier, Qwen3.5/3.6 + llama.cpp first-class)#19341
Abd0r wants to merge 4 commits into
NousResearch:mainfrom
Abd0r:feat/local-web-tools-and-deep-research

Conversation

@Abd0r

@Abd0r Abd0r commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

A self-contained, zero paid-API contribution that turns Hermes into a citation-disciplined research agent on the latest open-source stack. Two pieces, one branch, no modifications to existing code:

1. tools/local_web_tools.py — free-tier counterpart to web_tools.py

Hermes' existing web_search / web_extract rely on Firecrawl, Parallel, Tavily, Exa, Gemini — all paid. This file mirrors their JSON contracts using free local-first backends:

  • Search chain (auto-fallback): SearXNG → Brave free tier → Tavily free tier → ddgr → ddgs/duckduckgo_search
  • Extract: lynx -dump with boilerplate stripping (nav menus, button labels, iframe markers, captcha blocks, cookie notices)
  • Summarization (optional): any local OpenAI-compat endpoint at $LLM_BASE_URL

2. skills/research/deep-research/ — methodology skill

Pure markdown skill teaching a 5-phase research pipeline with strict citation discipline:

  1. Decompose topic → 4-6 sub-questions
  2. Fan-out search across sub-questions
  3. Fetch promising URLs selectively
  4. Cross-verify claims; assign confidence stars (★★★ / ★★ / ★ / ⚠ / ?)
  5. Synthesize structured report with mandatory Open-Questions section

Backend-agnostic — works equally with paid web_search/web_extract OR the new free local_web_search/local_web_extract.

Anti-fabrication built in: every quantitative claim needs [n] citations, Open-Questions section is mandatory, post-process verifier (shell helper in SKILL.md) flags un-cited numbers and dangling references.

Multi-backend LLM support ($LLM_BASE_URL)

Configurable, no code changes:

LLM_BASE_URL=http://localhost:11434     # Ollama (default)
LLM_BASE_URL=http://localhost:8088      # llama.cpp's llama-server
LLM_BASE_URL=http://localhost:8000      # vLLM
LLM_BASE_URL=http://localhost:1234      # LM Studio

First-class Qwen3.5 / Qwen3.6 support

The recommended local stack uses the latest Qwen open-source releases (Apache 2.0):

Model Q4_K_M VRAM Context Notes
Qwen3.5-4B ~2.5 GB 262K Sweet spot for 6 GB GPUs / 12-core CPU
Qwen3.5-9B ~5.5 GB 262K Single-GPU
Qwen3.5-27B ~16 GB 262K 24 GB single-GPU
Qwen3.6-27B ~16 GB 262K Latest dense (Apr 2026)
Qwen3.6-35B-A3B ~21 GB 262K Best speed/quality (MoE)

Critical operational note baked into the SKILL.md and tool: Qwen3.5/3.6 do NOT honor /think /no_think directives the way Qwen3 did. Per the official model card, the only reliable way to disable thinking is chat_template_kwargs.enable_thinking=false at the request level. local_web_tools.py auto-detects Qwen3.5/3.6 model tags via _is_qwen35_or_36() and applies this flag automatically.

Includes scripts/start-llama-server.sh — turnkey launcher that auto-detects Qwen3.5/3.6 from the GGUF filename and applies sane defaults (--jinja, port 8088, ctx 16384, configurable N_GPU_LAYERS).

Quickstart (5 commands, zero paid keys)

# 1. Get llama.cpp prebuilt
curl -fsSL "https://github.com/ggerganov/llama.cpp/releases/download/b9010/llama-b9010-bin-ubuntu-x64.tar.gz" | tar xz

# 2. Pull a Qwen3.5 GGUF
curl -fLO "https://huggingface.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q4_K_M.gguf"

# 3. Boot llama-server
~/.hermes/skills/research/deep-research/scripts/start-llama-server.sh ./Qwen3.5-4B-Q4_K_M.gguf

# 4. (Optional) self-host SearXNG
docker run -d -p 8888:8080 searxng/searxng

# 5. Configure
export LLM_BASE_URL=http://127.0.0.1:8088
export SEARXNG_URL=http://127.0.0.1:8888

Validation

End-to-end integration test against Qwen3.5-4B-Q4_K_M on llama-server b9010 (May 2 2026 release):

Test Result
Tool import + 5 backends registered
local_web_search returns valid schema (SearXNG)
local_web_extract cleaned 1468 chars from real URL
SKILL.md frontmatter all required fields
Agent loop (Qwen3.5-4B): 2 searches + 3 page extracts in 5 successful tool-use rounds

Step 6 of the agent loop hit the integration-test 180s timeout under CPU prompt-eval on accumulated context — production deployment with GPU acceleration or a more generous timeout completes the full synthesis pass. The plumbing of all four pieces is fully validated.

Why this isn't a duplication

web_tools.py is excellent for users with paid keys. This PR makes Hermes Agent's web research fully zero-cost for users who self-host. Same JSON contracts mean any skill calling web_search/web_extract works identically with the local variants — no skill-side changes.

Files

  • tools/local_web_tools.py (552 lines)
  • skills/research/deep-research/SKILL.md (378 lines)
  • skills/research/deep-research/scripts/start-llama-server.sh (106 lines)

Total: 1,036 lines additive, zero modifications to existing files. MIT license, no new dependencies (uses existing requests, optional lynx / ddgr / ddgs already common in research environments).

Commits

  • 6ff296a feat(tools): add local_web_tools — free-tier counterpart to web_tools
  • 7ff5fb8 feat(tools/local_web_tools): first-class Qwen3.5/3.6 + multi-backend support
  • fac1501 feat(skills/research): add deep-research methodology skill
  • bcdb4f0 feat(skills/research/deep-research): full Qwen3.5/3.6 + llama.cpp first-class support

Happy to split into separate tool + skill PRs if maintainers prefer reviewing them independently.

Abd0r added 4 commits May 4, 2026 00:07
Adds local_web_search_tool and local_web_extract_tool that mirror the JSON
contracts of web_search_tool / web_extract_tool but use free local-first
backends instead of paid APIs (Firecrawl, Parallel, Tavily, Exa, Gemini).

Search backend chain (auto-fallback):
  1. SearXNG self-hosted (default http://localhost:8888)
  2. Brave Search free tier (BRAVE_SEARCH_API_KEY)
  3. Tavily free tier (TAVILY_API_KEY)
  4. ddgr CLI
  5. ddgs / duckduckgo_search Python package

Extraction:
  - lynx -dump with boilerplate stripping (nav menus, button labels,
    iframe markers, captcha blocks, cookie notices)
  - Optional Ollama-based summarization (zero API cost)

Drop-in compatible: skills calling web_search/web_extract behave identically
when pointed at local_web_search/local_web_extract.

Self-test: python3 -m tools.local_web_tools (smoke test included).

Closes free-tier gap for users without paid web-API keys.
Pure markdown methodology skill that teaches the agent to compose web_search,
web_extract, and (optionally) delegate into a multi-source research pipeline
with strict citation discipline.

5-phase pipeline:
  1. Decompose topic into 4-6 concrete sub-questions
  2. Fan-out search across sub-questions
  3. Fetch promising URLs (selectively)
  4. Cross-verify claims across sources; assign confidence stars
  5. Synthesize structured report with citations

Confidence calibration: ★★★ (3+ sources agree), ★★ (2), ★ (1), ⚠ (sources disagree), ? (inferred).

Backend-agnostic: works with paid web_search/web_extract OR free
local_web_search/local_web_extract (drop-in, same JSON contract).

Anti-fabrication rules baked into prompt + post-process verifier shell helper
that flags un-cited numbers and citations not pointing to fetched sources.

License: MIT.
…support

Breaking changes from initial commit:
  - OLLAMA_URL renamed to LLM_BASE_URL (backward-compat: OLLAMA_URL still honored)
  - Added LLM_DEFAULT_MODEL env var
  - Renamed _summarize_ollama() to _summarize_via_local_llm()

New behavior:
  - Auto-detect Qwen3.5/3.6 model tags via _is_qwen35_or_36(); when detected,
    automatically pass chat_template_kwargs.enable_thinking=false in the request
    payload. Critical because Qwen3.5/3.6 default to thinking mode and do NOT
    honor the /think /no_think directives that worked on Qwen3.

  - Now compatible with any OpenAI-compat /v1/chat/completions endpoint:
      * Ollama         (default http://localhost:11434)
      * llama.cpp      (llama-server, default http://localhost:8080)
      * vLLM           (default http://localhost:8000)
      * LM Studio      (default http://localhost:1234)

  - Replaced legacy duckduckgo_search package with new ddgs name; falls back
    to legacy package for backward compat.

Validated end-to-end against Qwen3.5-4B-Q4_K_M via llama-server b9010
(May 2026 release) with --jinja flag — produces valid tool-call sequences and
clean cited research reports.

Refs Qwen3.5 model card guidance:
  https://huggingface.co/Qwen/Qwen3.5-9B
…st-class support

SKILL.md additions (243 → 378 lines):

1. Quickstart section (5-command local-first stack):
     - Install llama.cpp prebuilt binary
     - Pull Qwen3.5/3.6 GGUF from unsloth
     - Boot llama-server via the new helper script
     - (optional) self-host SearXNG for free web search
     - Configure LLM_BASE_URL + SEARXNG_URL env vars
   Total cost: zero per query, no paid API keys.

2. Recommended models — Qwen3.5/3.6 first-class:
     - Qwen3.5-4B   (~2.5 GB Q4 — sweet spot for 6 GB GPUs / CPU)
     - Qwen3.5-9B   (~5.5 GB Q4 — single-GPU quality)
     - Qwen3.5-27B  (~16 GB)
     - Qwen3.6-27B  (Apr 2026, latest dense)
     - Qwen3.6-35B-A3B (MoE, best speed/quality)
     - Qwen3.5-122B-A10B (multi-GPU, frontier-class)

3. Critical operational notes:
     - Qwen3.5/3.6 do NOT honor /think /no_think directives (Qwen3 only)
     - Disable thinking via chat_template_kwargs.enable_thinking=false
     - Tool-call parser is qwen3_coder for vLLM/SGLang
     - Per-mode sampling profiles (instruct vs thinking) from the model card

4. Multilingual research subsection — Qwen3.5/3.6 covers 201 languages.

5. New helper script scripts/start-llama-server.sh:
     - Auto-detects Qwen3.5/3.6 from filename
     - Sane defaults (port 8088, ctx 16384, --jinja)
     - Configurable via PORT/CTX/THREADS/N_GPU_LAYERS env vars
     - Friendly error if llama-server not on PATH

Validated end-to-end with Qwen3.5-4B-Q4_K_M on llama-server b9010 — agent loop
produces valid cited research reports with the new tool/skill stack.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have tool/web Web search and extraction labels May 3, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #13412 (open PR also adding a deep-research skill). This PR additionally introduces tools/local_web_tools.py for free-tier web search/extract.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #13412

@Abd0r

Abd0r commented May 3, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the pointer @alt-glitch — wasn't aware of #13412 when I drafted this. Reading it now I can see @vominh1919 got there first on the methodology, and the structure is genuinely cleaner with references/methodology.md and templates/report.md split out instead of inlined into SKILL.md.

Honest split of overlap vs additive in this PR vs #13412:

Overlapping — both add skills/research/deep-research/SKILL.md. Methodology is similar (decompose → search → fetch → cross-verify → synthesize). Confidence ratings differ in convention (theirs: High/Med/Low + A/B/C/D source quality; mine: ★★★/★★/★/⚠/? stars + post-process citation verifier shell helper). Either system is fine; theirs predates mine.

Additive in this PR (not in #13412):

  1. tools/local_web_tools.py (552 lines) — drop-in free-tier counterpart to web_tools.py. Same JSON contract; backend chain SearXNG → Brave free → Tavily free → ddgr → ddgs. Closes the free-tier gap without touching web_tools.py. Independent of any deep-research skill — useful on its own.

  2. scripts/start-llama-server.sh (106 lines) — turnkey llama.cpp launcher with auto-detection of Qwen3.5/3.6 from the GGUF filename. Sane defaults (--jinja, ctx 16384, port 8088).

  3. First-class Qwen3.5 / Qwen3.6 support. These models default to thinking mode and per the official model card explicitly do NOT honor /think /no_think the way Qwen3 did — only chat_template_kwargs.enable_thinking=false works. local_web_tools.py auto-detects via _is_qwen35_or_36() and applies the flag. SKILL.md documents this with the recommended models table (3.5-4B/9B/27B, 3.6-27B/35B-A3B) and per-mode sampling profiles. feat: add deep-research skill — autonomous multi-source research agent #13412 predates these releases (Feb-Apr 2026) so it's silent on the operational quirks.

  4. Multi-backend $LLM_BASE_URL — Ollama / llama.cpp / vLLM / LM Studio interchangeable.

Proposed path forward (deferring to maintainers):

If you'd prefer one cohesive deep-research PR, I'm happy to close this and re-open a tool-only PR containing just tools/local_web_tools.py + the llama.cpp launcher + a Qwen3.5/3.6 setup doc. Those pieces don't touch skills/research/deep-research/ and can compose with whichever methodology PR you merge. The Qwen3.5/3.6 thinking-mode notes are also small enough to lift into #13412's SKILL.md if @vominh1919 is open to it.

Or if you'd rather merge #13412 first then revisit any cherry-picks from this, that's also fine — happy to defer.

Either way, thanks for the fast review.

@Abd0r

Abd0r commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Closing this in favor of #19607 — the tool-only split, as offered above.

#19607 contains exactly:

  • tools/local_web_tools.py (552 lines)
  • scripts/start-llama-server.sh (109 lines)

Zero changes under skills/research/deep-research/ — so it can land independently of #13412 (whose methodology authorship is @vominh1919's). If their PR merges first and they want the Qwen3.5/3.6 client-side notes lifted into their SKILL.md, happy to send a small follow-up after.

Thanks for the pointer @alt-glitch — split was the right call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have tool/web Web search and extraction type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants