Skip to content

feat(tools): local_web_tools — free-tier web search/extract + llama-server launcher#19607

Closed
Abd0r wants to merge 3 commits into
NousResearch:mainfrom
Abd0r:feat/local-web-tools-only
Closed

feat(tools): local_web_tools — free-tier web search/extract + llama-server launcher#19607
Abd0r wants to merge 3 commits into
NousResearch:mainfrom
Abd0r:feat/local-web-tools-only

Conversation

@Abd0r

@Abd0r Abd0r commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

This is the tool-only split of #19341 (which I'm closing in favor of this). #19341 also added a skills/research/deep-research/SKILL.md that overlapped with @vominh1919's existing #13412; that overlap is removed here so this PR can land independently of any deep-research methodology decision.

What's left in this PR is purely additive infrastructure — a free-tier counterpart to web_tools.py plus a turnkey llama.cpp launcher. Useful on its own; composes with whichever deep-research skill (or other research tool) ends up shipping.

What this PR adds

tools/local_web_tools.py (552 lines) — drop-in free-tier counterpart to web_tools.py:

  • Same JSON contract — local_web_search / local_web_extract are interchangeable with web_search / web_extract.
  • Backend chain: SearXNG → Brave Search free tier → Tavily free tier → ddgr → ddgs. Fails over cleanly.
  • First-class Qwen3.5 / Qwen3.6 support — auto-detects via _is_qwen35_or_36() and applies chat_template_kwargs={"enable_thinking": false} because per the official Qwen3.5 model card these models do not honor /think /no_think the way Qwen3 did — only the chat-template flag works.
  • Multi-backend: $LLM_BASE_URL works for Ollama / llama.cpp's llama-server / vLLM / LM Studio interchangeably.
  • Self-test: python3 -m tools.local_web_tools (smoke).

scripts/start-llama-server.sh (109 lines) — turnkey llama.cpp launcher:

  • Auto-detects Qwen3.5/3.6 from GGUF filename.
  • Sane defaults (--jinja, ctx 16384, port 8088).
  • nproc fallback for non-Linux (macOS).
  • Friendly errors when GGUF missing / llama-server not on PATH (with install hints for Linux/macOS/pip).
  • Standalone — no skill or tool dependency.

Why split

#19341 bundled this with a deep-research skill. @alt-glitch correctly pointed out the skill overlapped with @vominh1919's #13412 (open since 2026-04-21). Splitting lets the tool + launcher land on their own merits, and lets #13412's methodology PR proceed without coordination overhead. If @vominh1919 wants to lift the Qwen3.5/3.6 client-side notes into their SKILL.md after this lands, I'm happy to send a small follow-up; otherwise the operational quirks live cleanly in tools/local_web_tools.py itself.

Related issues

This PR closes three open feature requests by shipping their requested backends, and partially addresses two more.

Closes (auto-close on merge):

Addresses (does not auto-close):

If maintainers prefer a different close/keep-open call on any of these, happy to adjust.

Two implementation options — maintainers' choice

This PR currently ships Option A, which is the lower-risk drop-in. Option B is functionally equivalent but a cleaner long-term design. Happy to refactor on request.

Option A — parallel module (this PR as-is)

Option B — integrate into existing web_tools.py (also open: #19796)

Reviewers can pick whichever of #19607 / #19796 is cleaner to merge; the other will be closed as superseded.

Files changed

  • tools/local_web_tools.py — new (552)
  • scripts/start-llama-server.sh — new (109, executable)

No existing files modified.

Test plan

Validated on two platforms with the same llama.cpp build (b9010, May 2026 release):

Ubuntu 24.04 (x86_64, RTX 4050 Laptop GPU)

  • Self-test: python3 -m tools.local_web_tools
  • Smoke against SearXNG public instance + Brave free key + ddgs fallback
  • llama-server boot via scripts/start-llama-server.sh ~/models/Qwen3.5-4B-Q4_K_M.gguf — serves on http://127.0.0.1:8088/v1/chat/completions
  • End-to-end agent loop with Hermes pointing at LLM_BASE_URL=http://127.0.0.1:8088 produces valid cited research reports

macOS 26 Tahoe (Apple Silicon M2)

  • brew install llama.cppllama-server resolves on PATH
  • scripts/start-llama-server.sh boots Qwen3.5-4B-Q4_K_M cleanly on Metal backend (3.7 GB GPU memory) — server listening within 19s
  • /v1/models and /props both respond; chat template loads with thinking=1 (auto-detected from Qwen3.5 GGUF)
  • nproc fallback path exercised (macOS lacks nproc; THREADS defaults to 8 per script)
  • python3 -m tools.local_web_tools smoke on macOS — ddg-python backend (via pip-installed ddgs) returned 3 real DuckDuckGo results; lynx extraction path exercised (brew install lynx).
  • End-to-end Hermes agent loop with custom llamacpp_local provider (api: http://127.0.0.1:8089/v1, transport: openai_chat):
    • Single-turn chat completion roundtrip via hermes chat -q '...' --provider llamacpp_local -m Qwen3.5-4B-Q4_K_M.gguf -Q produces valid completion (session opens, response returned, session closes cleanly)
    • Hermes's built-in llama.cpp auto-detection works: /v1/models emits owned_by: llamacpp; /props reports correct n_ctx
  • Direct curl POST to /v1/chat/completions with tool_choice: "required" + chat_template_kwargs: {enable_thinking: false} → Qwen3.5-4B-Q4_K_M emits proper structured tool_calls JSON (verified the model handles tool calling correctly when the chat-template flag is passed)

CI: will fix anything pytest tests/ flags.

License

MIT (auto per CONTRIBUTING.md).

Abd0r and others added 3 commits May 4, 2026 14:28
Adds local_web_search_tool and local_web_extract_tool that mirror the JSON
contracts of web_search_tool / web_extract_tool but use free local-first
backends instead of paid APIs (Firecrawl, Parallel, Tavily, Exa, Gemini).

Search backend chain (auto-fallback):
  1. SearXNG self-hosted (default http://localhost:8888)
  2. Brave Search free tier (BRAVE_SEARCH_API_KEY)
  3. Tavily free tier (TAVILY_API_KEY)
  4. ddgr CLI
  5. ddgs / duckduckgo_search Python package

Extraction:
  - lynx -dump with boilerplate stripping (nav menus, button labels,
    iframe markers, captcha blocks, cookie notices)
  - Optional Ollama-based summarization (zero API cost)

Drop-in compatible: skills calling web_search/web_extract behave identically
when pointed at local_web_search/local_web_extract.

Self-test: python3 -m tools.local_web_tools (smoke test included).

Closes free-tier gap for users without paid web-API keys.
…support

Breaking changes from initial commit:
  - OLLAMA_URL renamed to LLM_BASE_URL (backward-compat: OLLAMA_URL still honored)
  - Added LLM_DEFAULT_MODEL env var
  - Renamed _summarize_ollama() to _summarize_via_local_llm()

New behavior:
  - Auto-detect Qwen3.5/3.6 model tags via _is_qwen35_or_36(); when detected,
    automatically pass chat_template_kwargs.enable_thinking=false in the request
    payload. Critical because Qwen3.5/3.6 default to thinking mode and do NOT
    honor the /think /no_think directives that worked on Qwen3.

  - Now compatible with any OpenAI-compat /v1/chat/completions endpoint:
      * Ollama         (default http://localhost:11434)
      * llama.cpp      (llama-server, default http://localhost:8080)
      * vLLM           (default http://localhost:8000)
      * LM Studio      (default http://localhost:1234)

  - Replaced legacy duckduckgo_search package with new ddgs name; falls back
    to legacy package for backward compat.

Validated end-to-end against Qwen3.5-4B-Q4_K_M via llama-server b9010
(May 2026 release) with --jinja flag — produces valid tool-call sequences and
clean cited research reports.

Refs Qwen3.5 model card guidance:
  https://huggingface.co/Qwen/Qwen3.5-9B
Companion script to tools/local_web_tools.py. Boots llama.cpp's llama-server
with the correct flags for OpenAI-compatible local inference, with first-class
Qwen3.5/3.6 detection from the GGUF filename.

Defaults:
  - port 8088, ctx 16384, threads = nproc
  - --jinja (required for Qwen3.5/3.6 chat-template + tool calling)
  - --n-gpu-layers 0 (CPU; override via N_GPU_LAYERS=-1 for all-on-GPU)

Detects Qwen3.5/3.6 from filename and prints the required client-side flag
(chat_template_kwargs.enable_thinking=false) per the official model card —
since Qwen3.5+ default to thinking mode and ignore /think /no_think.

Useful out of the box for any Hermes tool that talks to an OpenAI-compatible
endpoint (point LLM_BASE_URL at http://127.0.0.1:8088).

Friendly errors when:
  - GGUF path missing or invalid
  - llama-server not on PATH (with install hints for Linux/macOS/pip)

Standalone — no skill or tool dependency. MIT.
@alt-glitch alt-glitch added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets tool/web Web search and extraction P3 Low — cosmetic, nice to have labels May 4, 2026
@Abd0r Abd0r closed this May 7, 2026
@Abd0r Abd0r reopened this May 7, 2026
@Abd0r Abd0r closed this May 7, 2026
Abd0r added a commit to Abd0r/hermes-agent that referenced this pull request May 15, 2026
llama.cpp's `llama-server` already speaks OpenAI chat-completions, so users
could already point Hermes at it via `--provider custom`. But "custom" means
they have to set OPENAI_BASE_URL by hand, the model picker doesn't list it,
and the dashboard has no way to surface the running server. This PR makes
llama.cpp a discoverable, zero-config backend.

What ships
==========

* `plugins/model-providers/llama-cpp/` — new ProviderProfile with default
  base_url `http://127.0.0.1:8088/v1`, aliases `llamacpp` / `llama.cpp` /
  `llama_cpp` / `llama-server`, and an offline-tolerant fetch_models override
  (returns None instead of raising when the local server is down).
* `hermes_cli/auth.py` — adds llama-cpp to PROVIDER_REGISTRY (modeled on
  the lmstudio entry: api_key auth_type with optional LLAMA_CPP_API_KEY +
  LLAMA_CPP_BASE_URL env vars). Removes the old llama.cpp/llamacpp/llama-cpp
  hardcoded aliases that pointed at `custom`, so the plugin's aliases win.
* `hermes_cli/models.py` — adds the same alias mappings to _PROVIDER_ALIASES
  so `--provider llama.cpp` resolves correctly through the CLI parser path.
* `hermes_cli/model_switch.py` — adds a probe-and-surface block in
  list_authenticated_providers, mirroring the existing lmstudio pattern.
  Three surfacing modes:
    1. Live probe: `${LLAMA_CPP_BASE_URL}/models` with a 300 ms cold-discovery
       timeout. If `llama-server` responds, the row appears with the loaded
       model. This is what makes the dashboard "magically" pick up a running
       server with no config.
    2. Hint mode: LLAMA_CPP_API_KEY or LLAMA_CPP_BASE_URL set, or current
       provider matches one of the aliases — 1.5 s timeout.
    3. Sticky current: when llama-cpp is the user's selected provider but
       the server is offline, the row still appears with current_model so
       the user doesn't lose access after restart.
  When no env vars, no current selection, and no live server, the row is
  not injected — keeps the picker tidy for non-llama.cpp users.
* `plugins/model-providers/custom/__init__.py` — drops the llamacpp / llama.cpp
  / llama-cpp aliases from the generic `custom` profile (they now belong to
  the dedicated provider).
* `scripts/start-llama-server.sh` — turnkey llama-server launcher whose
  default port (8088) lines up with the plugin's default base_url, so the
  end-to-end UX is just:
      ./scripts/start-llama-server.sh ~/models/foo.gguf
      hermes chat --provider llama-cpp
  Prints an alignment hint when PORT/HOST diverge from the plugin default.
* `tests/providers/test_llama_cpp_profile.py` — 12 tests covering plugin
  registration, alias resolution end-to-end through hermes_cli.auth,
  CANONICAL_PROVIDERS auto-injection, PROVIDER_REGISTRY entry shape, picker
  surfacing in three modes (current+offline, no-clutter, alias resolution),
  and the offline-graceful fetch_models override.
* `tests/providers/test_plugin_discovery.py` — bumped expected profile count
  33 → 34.
* `website/docs/guides/local-llamacpp-setup.md` — user-facing setup guide
  modeled on the existing local-ollama-setup.md.
* `website/docs/reference/environment-variables.md` — documents
  LLAMA_CPP_API_KEY / LLAMA_CPP_BASE_URL and adds llama-cpp to the
  HERMES_INFERENCE_PROVIDER accepted-values list.

Test plan
=========

  pytest tests/providers/                                # 90 passed
  pytest tests/providers/test_llama_cpp_profile.py -v    # 12 passed
  pytest tests/hermes_cli/test_model_switch_custom_providers.py \
         tests/hermes_cli/test_user_providers_model_switch.py \
         tests/hermes_cli/test_custom_provider_model_switch.py \
         tests/hermes_cli/test_api_key_providers.py \
         tests/hermes_cli/test_auth_provider_gate.py     # 221 passed

Tested on macOS 26.4 (arm64). The launcher uses
`nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 8` so it works
on Linux + macOS + WSL2; not exercised on native Windows.

Notes
=====

* Existing PR NousResearch#19607 also adds `scripts/start-llama-server.sh`. The version
  in this PR supersedes that one — it's stripped of the Qwen-specific
  detection branches (this PR is intentionally generic-llama.cpp only) and
  reworded around the new `llama-cpp` provider's defaults. Whichever PR
  lands second will need a one-line conflict resolution.
* Does not include `tools/local_web_tools.py` — that's orthogonal web-search
  work and remains in NousResearch#19607.
Abd0r added a commit to Abd0r/hermes-agent that referenced this pull request May 15, 2026
llama.cpp's `llama-server` already speaks OpenAI chat-completions, so users
could already point Hermes at it via `--provider custom`. But "custom" means
they have to set OPENAI_BASE_URL by hand, the model picker doesn't list it,
and the dashboard has no way to surface the running server. This PR makes
llama.cpp a discoverable, zero-config backend.

What ships
==========

* `plugins/model-providers/llama-cpp/` — new ProviderProfile with default
  base_url `http://127.0.0.1:8088/v1`, aliases `llamacpp` / `llama.cpp` /
  `llama_cpp` / `llama-server`, and an offline-tolerant fetch_models override
  (returns None instead of raising when the local server is down).
* `hermes_cli/auth.py` — adds llama-cpp to PROVIDER_REGISTRY (modeled on
  the lmstudio entry: api_key auth_type with optional LLAMA_CPP_API_KEY +
  LLAMA_CPP_BASE_URL env vars). Removes the old llama.cpp/llamacpp/llama-cpp
  hardcoded aliases that pointed at `custom`, so the plugin's aliases win.
* `hermes_cli/models.py` — adds the same alias mappings to _PROVIDER_ALIASES
  so `--provider llama.cpp` resolves correctly through the CLI parser path.
* `hermes_cli/model_switch.py` — adds a probe-and-surface block in
  list_authenticated_providers, mirroring the existing lmstudio pattern.
  Three surfacing modes:
    1. Live probe: `${LLAMA_CPP_BASE_URL}/models` with a 300 ms cold-discovery
       timeout. If `llama-server` responds, the row appears with the loaded
       model. This is what makes the dashboard "magically" pick up a running
       server with no config.
    2. Hint mode: LLAMA_CPP_API_KEY or LLAMA_CPP_BASE_URL set, or current
       provider matches one of the aliases — 1.5 s timeout.
    3. Sticky current: when llama-cpp is the user's selected provider but
       the server is offline, the row still appears with current_model so
       the user doesn't lose access after restart.
  When no env vars, no current selection, and no live server, the row is
  not injected — keeps the picker tidy for non-llama.cpp users.
* `plugins/model-providers/custom/__init__.py` — drops the llamacpp / llama.cpp
  / llama-cpp aliases from the generic `custom` profile (they now belong to
  the dedicated provider).
* `scripts/start-llama-server.sh` — turnkey llama-server launcher whose
  default port (8088) lines up with the plugin's default base_url, so the
  end-to-end UX is just:
      ./scripts/start-llama-server.sh ~/models/foo.gguf
      hermes chat --provider llama-cpp
  Prints an alignment hint when PORT/HOST diverge from the plugin default.
* `tests/providers/test_llama_cpp_profile.py` — 12 tests covering plugin
  registration, alias resolution end-to-end through hermes_cli.auth,
  CANONICAL_PROVIDERS auto-injection, PROVIDER_REGISTRY entry shape, picker
  surfacing in three modes (current+offline, no-clutter, alias resolution),
  and the offline-graceful fetch_models override.
* `tests/providers/test_plugin_discovery.py` — bumped expected profile count
  33 → 34.
* `website/docs/guides/local-llamacpp-setup.md` — user-facing setup guide
  modeled on the existing local-ollama-setup.md.
* `website/docs/reference/environment-variables.md` — documents
  LLAMA_CPP_API_KEY / LLAMA_CPP_BASE_URL and adds llama-cpp to the
  HERMES_INFERENCE_PROVIDER accepted-values list.

Test plan
=========

  pytest tests/providers/                                # 90 passed
  pytest tests/providers/test_llama_cpp_profile.py -v    # 12 passed
  pytest tests/hermes_cli/test_model_switch_custom_providers.py \
         tests/hermes_cli/test_user_providers_model_switch.py \
         tests/hermes_cli/test_custom_provider_model_switch.py \
         tests/hermes_cli/test_api_key_providers.py \
         tests/hermes_cli/test_auth_provider_gate.py     # 221 passed

Tested on macOS 26.4 (arm64). The launcher uses
`nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 8` so it works
on Linux + macOS + WSL2; not exercised on native Windows.

Notes
=====

* Existing PR NousResearch#19607 also adds `scripts/start-llama-server.sh`. The version
  in this PR supersedes that one — it's stripped of the Qwen-specific
  detection branches (this PR is intentionally generic-llama.cpp only) and
  reworded around the new `llama-cpp` provider's defaults. Whichever PR
  lands second will need a one-line conflict resolution.
* Does not include `tools/local_web_tools.py` — that's orthogonal web-search
  work and remains in NousResearch#19607.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have tool/web Web search and extraction type/feature New feature or request

Projects

None yet

2 participants