feat(tools): local_web_tools — free-tier web search/extract + llama-server launcher#19607
Closed
Abd0r wants to merge 3 commits into
Closed
feat(tools): local_web_tools — free-tier web search/extract + llama-server launcher#19607Abd0r wants to merge 3 commits into
Abd0r wants to merge 3 commits into
Conversation
Adds local_web_search_tool and local_web_extract_tool that mirror the JSON
contracts of web_search_tool / web_extract_tool but use free local-first
backends instead of paid APIs (Firecrawl, Parallel, Tavily, Exa, Gemini).
Search backend chain (auto-fallback):
1. SearXNG self-hosted (default http://localhost:8888)
2. Brave Search free tier (BRAVE_SEARCH_API_KEY)
3. Tavily free tier (TAVILY_API_KEY)
4. ddgr CLI
5. ddgs / duckduckgo_search Python package
Extraction:
- lynx -dump with boilerplate stripping (nav menus, button labels,
iframe markers, captcha blocks, cookie notices)
- Optional Ollama-based summarization (zero API cost)
Drop-in compatible: skills calling web_search/web_extract behave identically
when pointed at local_web_search/local_web_extract.
Self-test: python3 -m tools.local_web_tools (smoke test included).
Closes free-tier gap for users without paid web-API keys.
…support
Breaking changes from initial commit:
- OLLAMA_URL renamed to LLM_BASE_URL (backward-compat: OLLAMA_URL still honored)
- Added LLM_DEFAULT_MODEL env var
- Renamed _summarize_ollama() to _summarize_via_local_llm()
New behavior:
- Auto-detect Qwen3.5/3.6 model tags via _is_qwen35_or_36(); when detected,
automatically pass chat_template_kwargs.enable_thinking=false in the request
payload. Critical because Qwen3.5/3.6 default to thinking mode and do NOT
honor the /think /no_think directives that worked on Qwen3.
- Now compatible with any OpenAI-compat /v1/chat/completions endpoint:
* Ollama (default http://localhost:11434)
* llama.cpp (llama-server, default http://localhost:8080)
* vLLM (default http://localhost:8000)
* LM Studio (default http://localhost:1234)
- Replaced legacy duckduckgo_search package with new ddgs name; falls back
to legacy package for backward compat.
Validated end-to-end against Qwen3.5-4B-Q4_K_M via llama-server b9010
(May 2026 release) with --jinja flag — produces valid tool-call sequences and
clean cited research reports.
Refs Qwen3.5 model card guidance:
https://huggingface.co/Qwen/Qwen3.5-9B
Companion script to tools/local_web_tools.py. Boots llama.cpp's llama-server with the correct flags for OpenAI-compatible local inference, with first-class Qwen3.5/3.6 detection from the GGUF filename. Defaults: - port 8088, ctx 16384, threads = nproc - --jinja (required for Qwen3.5/3.6 chat-template + tool calling) - --n-gpu-layers 0 (CPU; override via N_GPU_LAYERS=-1 for all-on-GPU) Detects Qwen3.5/3.6 from filename and prints the required client-side flag (chat_template_kwargs.enable_thinking=false) per the official model card — since Qwen3.5+ default to thinking mode and ignore /think /no_think. Useful out of the box for any Hermes tool that talks to an OpenAI-compatible endpoint (point LLM_BASE_URL at http://127.0.0.1:8088). Friendly errors when: - GGUF path missing or invalid - llama-server not on PATH (with install hints for Linux/macOS/pip) Standalone — no skill or tool dependency. MIT.
4 tasks
This was referenced May 4, 2026
15 tasks
Abd0r
added a commit
to Abd0r/hermes-agent
that referenced
this pull request
May 15, 2026
llama.cpp's `llama-server` already speaks OpenAI chat-completions, so users could already point Hermes at it via `--provider custom`. But "custom" means they have to set OPENAI_BASE_URL by hand, the model picker doesn't list it, and the dashboard has no way to surface the running server. This PR makes llama.cpp a discoverable, zero-config backend. What ships ========== * `plugins/model-providers/llama-cpp/` — new ProviderProfile with default base_url `http://127.0.0.1:8088/v1`, aliases `llamacpp` / `llama.cpp` / `llama_cpp` / `llama-server`, and an offline-tolerant fetch_models override (returns None instead of raising when the local server is down). * `hermes_cli/auth.py` — adds llama-cpp to PROVIDER_REGISTRY (modeled on the lmstudio entry: api_key auth_type with optional LLAMA_CPP_API_KEY + LLAMA_CPP_BASE_URL env vars). Removes the old llama.cpp/llamacpp/llama-cpp hardcoded aliases that pointed at `custom`, so the plugin's aliases win. * `hermes_cli/models.py` — adds the same alias mappings to _PROVIDER_ALIASES so `--provider llama.cpp` resolves correctly through the CLI parser path. * `hermes_cli/model_switch.py` — adds a probe-and-surface block in list_authenticated_providers, mirroring the existing lmstudio pattern. Three surfacing modes: 1. Live probe: `${LLAMA_CPP_BASE_URL}/models` with a 300 ms cold-discovery timeout. If `llama-server` responds, the row appears with the loaded model. This is what makes the dashboard "magically" pick up a running server with no config. 2. Hint mode: LLAMA_CPP_API_KEY or LLAMA_CPP_BASE_URL set, or current provider matches one of the aliases — 1.5 s timeout. 3. Sticky current: when llama-cpp is the user's selected provider but the server is offline, the row still appears with current_model so the user doesn't lose access after restart. When no env vars, no current selection, and no live server, the row is not injected — keeps the picker tidy for non-llama.cpp users. * `plugins/model-providers/custom/__init__.py` — drops the llamacpp / llama.cpp / llama-cpp aliases from the generic `custom` profile (they now belong to the dedicated provider). * `scripts/start-llama-server.sh` — turnkey llama-server launcher whose default port (8088) lines up with the plugin's default base_url, so the end-to-end UX is just: ./scripts/start-llama-server.sh ~/models/foo.gguf hermes chat --provider llama-cpp Prints an alignment hint when PORT/HOST diverge from the plugin default. * `tests/providers/test_llama_cpp_profile.py` — 12 tests covering plugin registration, alias resolution end-to-end through hermes_cli.auth, CANONICAL_PROVIDERS auto-injection, PROVIDER_REGISTRY entry shape, picker surfacing in three modes (current+offline, no-clutter, alias resolution), and the offline-graceful fetch_models override. * `tests/providers/test_plugin_discovery.py` — bumped expected profile count 33 → 34. * `website/docs/guides/local-llamacpp-setup.md` — user-facing setup guide modeled on the existing local-ollama-setup.md. * `website/docs/reference/environment-variables.md` — documents LLAMA_CPP_API_KEY / LLAMA_CPP_BASE_URL and adds llama-cpp to the HERMES_INFERENCE_PROVIDER accepted-values list. Test plan ========= pytest tests/providers/ # 90 passed pytest tests/providers/test_llama_cpp_profile.py -v # 12 passed pytest tests/hermes_cli/test_model_switch_custom_providers.py \ tests/hermes_cli/test_user_providers_model_switch.py \ tests/hermes_cli/test_custom_provider_model_switch.py \ tests/hermes_cli/test_api_key_providers.py \ tests/hermes_cli/test_auth_provider_gate.py # 221 passed Tested on macOS 26.4 (arm64). The launcher uses `nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 8` so it works on Linux + macOS + WSL2; not exercised on native Windows. Notes ===== * Existing PR NousResearch#19607 also adds `scripts/start-llama-server.sh`. The version in this PR supersedes that one — it's stripped of the Qwen-specific detection branches (this PR is intentionally generic-llama.cpp only) and reworded around the new `llama-cpp` provider's defaults. Whichever PR lands second will need a one-line conflict resolution. * Does not include `tools/local_web_tools.py` — that's orthogonal web-search work and remains in NousResearch#19607.
Abd0r
added a commit
to Abd0r/hermes-agent
that referenced
this pull request
May 15, 2026
llama.cpp's `llama-server` already speaks OpenAI chat-completions, so users could already point Hermes at it via `--provider custom`. But "custom" means they have to set OPENAI_BASE_URL by hand, the model picker doesn't list it, and the dashboard has no way to surface the running server. This PR makes llama.cpp a discoverable, zero-config backend. What ships ========== * `plugins/model-providers/llama-cpp/` — new ProviderProfile with default base_url `http://127.0.0.1:8088/v1`, aliases `llamacpp` / `llama.cpp` / `llama_cpp` / `llama-server`, and an offline-tolerant fetch_models override (returns None instead of raising when the local server is down). * `hermes_cli/auth.py` — adds llama-cpp to PROVIDER_REGISTRY (modeled on the lmstudio entry: api_key auth_type with optional LLAMA_CPP_API_KEY + LLAMA_CPP_BASE_URL env vars). Removes the old llama.cpp/llamacpp/llama-cpp hardcoded aliases that pointed at `custom`, so the plugin's aliases win. * `hermes_cli/models.py` — adds the same alias mappings to _PROVIDER_ALIASES so `--provider llama.cpp` resolves correctly through the CLI parser path. * `hermes_cli/model_switch.py` — adds a probe-and-surface block in list_authenticated_providers, mirroring the existing lmstudio pattern. Three surfacing modes: 1. Live probe: `${LLAMA_CPP_BASE_URL}/models` with a 300 ms cold-discovery timeout. If `llama-server` responds, the row appears with the loaded model. This is what makes the dashboard "magically" pick up a running server with no config. 2. Hint mode: LLAMA_CPP_API_KEY or LLAMA_CPP_BASE_URL set, or current provider matches one of the aliases — 1.5 s timeout. 3. Sticky current: when llama-cpp is the user's selected provider but the server is offline, the row still appears with current_model so the user doesn't lose access after restart. When no env vars, no current selection, and no live server, the row is not injected — keeps the picker tidy for non-llama.cpp users. * `plugins/model-providers/custom/__init__.py` — drops the llamacpp / llama.cpp / llama-cpp aliases from the generic `custom` profile (they now belong to the dedicated provider). * `scripts/start-llama-server.sh` — turnkey llama-server launcher whose default port (8088) lines up with the plugin's default base_url, so the end-to-end UX is just: ./scripts/start-llama-server.sh ~/models/foo.gguf hermes chat --provider llama-cpp Prints an alignment hint when PORT/HOST diverge from the plugin default. * `tests/providers/test_llama_cpp_profile.py` — 12 tests covering plugin registration, alias resolution end-to-end through hermes_cli.auth, CANONICAL_PROVIDERS auto-injection, PROVIDER_REGISTRY entry shape, picker surfacing in three modes (current+offline, no-clutter, alias resolution), and the offline-graceful fetch_models override. * `tests/providers/test_plugin_discovery.py` — bumped expected profile count 33 → 34. * `website/docs/guides/local-llamacpp-setup.md` — user-facing setup guide modeled on the existing local-ollama-setup.md. * `website/docs/reference/environment-variables.md` — documents LLAMA_CPP_API_KEY / LLAMA_CPP_BASE_URL and adds llama-cpp to the HERMES_INFERENCE_PROVIDER accepted-values list. Test plan ========= pytest tests/providers/ # 90 passed pytest tests/providers/test_llama_cpp_profile.py -v # 12 passed pytest tests/hermes_cli/test_model_switch_custom_providers.py \ tests/hermes_cli/test_user_providers_model_switch.py \ tests/hermes_cli/test_custom_provider_model_switch.py \ tests/hermes_cli/test_api_key_providers.py \ tests/hermes_cli/test_auth_provider_gate.py # 221 passed Tested on macOS 26.4 (arm64). The launcher uses `nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 8` so it works on Linux + macOS + WSL2; not exercised on native Windows. Notes ===== * Existing PR NousResearch#19607 also adds `scripts/start-llama-server.sh`. The version in this PR supersedes that one — it's stripped of the Qwen-specific detection branches (this PR is intentionally generic-llama.cpp only) and reworded around the new `llama-cpp` provider's defaults. Whichever PR lands second will need a one-line conflict resolution. * Does not include `tools/local_web_tools.py` — that's orthogonal web-search work and remains in NousResearch#19607.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the tool-only split of #19341 (which I'm closing in favor of this). #19341 also added a
skills/research/deep-research/SKILL.mdthat overlapped with @vominh1919's existing #13412; that overlap is removed here so this PR can land independently of any deep-research methodology decision.What's left in this PR is purely additive infrastructure — a free-tier counterpart to
web_tools.pyplus a turnkey llama.cpp launcher. Useful on its own; composes with whichever deep-research skill (or other research tool) ends up shipping.What this PR adds
tools/local_web_tools.py(552 lines) — drop-in free-tier counterpart toweb_tools.py:local_web_search/local_web_extractare interchangeable withweb_search/web_extract._is_qwen35_or_36()and applieschat_template_kwargs={"enable_thinking": false}because per the official Qwen3.5 model card these models do not honor/think/no_thinkthe way Qwen3 did — only the chat-template flag works.$LLM_BASE_URLworks for Ollama / llama.cpp's llama-server / vLLM / LM Studio interchangeably.python3 -m tools.local_web_tools(smoke).scripts/start-llama-server.sh(109 lines) — turnkey llama.cpp launcher:--jinja, ctx 16384, port 8088).nprocfallback for non-Linux (macOS).llama-servernot on PATH (with install hints for Linux/macOS/pip).Why split
#19341 bundled this with a deep-research skill. @alt-glitch correctly pointed out the skill overlapped with @vominh1919's #13412 (open since 2026-04-21). Splitting lets the tool + launcher land on their own merits, and lets #13412's methodology PR proceed without coordination overhead. If @vominh1919 wants to lift the Qwen3.5/3.6 client-side notes into their SKILL.md after this lands, I'm happy to send a small follow-up; otherwise the operational quirks live cleanly in
tools/local_web_tools.pyitself.Related issues
This PR closes three open feature requests by shipping their requested backends, and partially addresses two more.
Closes (auto-close on merge):
local_web_tools.pyincludes Brave free tier in its fallback chain.local_web_tools.pyhas SearXNG as the first backend in its chain.Addresses (does not auto-close):
local_web_tools.pyships a multi-backend fallback chain (SearXNG → Brave free → Tavily free → ddgr → ddgs) all sharing the same JSON contract asweb_search. Doesn't add a user-configurable arbitrary JSON endpoint, but the multi-backend infrastructure is there.scripts/start-llama-server.shprovides the llama.cpp piece with auto-detected Qwen3.5/3.6 setup, sane defaults, and friendly errors. Ollama and vLLM remain out of scope here.If maintainers prefer a different close/keep-open call on any of these, happy to adjust.
Two implementation options — maintainers' choice
This PR currently ships Option A, which is the lower-risk drop-in. Option B is functionally equivalent but a cleaner long-term design. Happy to refactor on request.
Option A — parallel module (this PR as-is)
tools/local_web_tools.py(552 lines), new toolslocal_web_search/local_web_extract.tools/web_tools.py.web_search/web_extractusers; trivial to audit; trivial to revert.Option B — integrate into existing
web_tools.py(also open: #19796)_get_backend()priority chain.web_search/web_extractgain free-tier fallbacks transparently.hermes setup/hermes toolsinteractive selector viahermes_cli/tools_config.py.Reviewers can pick whichever of #19607 / #19796 is cleaner to merge; the other will be closed as superseded.
Files changed
tools/local_web_tools.py— new (552)scripts/start-llama-server.sh— new (109, executable)No existing files modified.
Test plan
Validated on two platforms with the same llama.cpp build (b9010, May 2026 release):
Ubuntu 24.04 (x86_64, RTX 4050 Laptop GPU)
python3 -m tools.local_web_toolsscripts/start-llama-server.sh ~/models/Qwen3.5-4B-Q4_K_M.gguf— serves onhttp://127.0.0.1:8088/v1/chat/completionsLLM_BASE_URL=http://127.0.0.1:8088produces valid cited research reportsmacOS 26 Tahoe (Apple Silicon M2)
brew install llama.cpp→llama-serverresolves on PATHscripts/start-llama-server.shboots Qwen3.5-4B-Q4_K_M cleanly on Metal backend (3.7 GB GPU memory) — server listening within 19s/v1/modelsand/propsboth respond; chat template loads withthinking=1(auto-detected from Qwen3.5 GGUF)nprocfallback path exercised (macOS lacksnproc;THREADSdefaults to8per script)python3 -m tools.local_web_toolssmoke on macOS —ddg-pythonbackend (via pip-installedddgs) returned 3 real DuckDuckGo results; lynx extraction path exercised (brew install lynx).llamacpp_localprovider (api: http://127.0.0.1:8089/v1, transport: openai_chat):hermes chat -q '...' --provider llamacpp_local -m Qwen3.5-4B-Q4_K_M.gguf -Qproduces valid completion (session opens, response returned, session closes cleanly)/v1/modelsemitsowned_by: llamacpp;/propsreports correctn_ctx/v1/chat/completionswithtool_choice: "required"+chat_template_kwargs: {enable_thinking: false}→ Qwen3.5-4B-Q4_K_M emits proper structuredtool_callsJSON (verified the model handles tool calling correctly when the chat-template flag is passed)CI: will fix anything
pytest tests/flags.License
MIT (auto per
CONTRIBUTING.md).