fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS by briandevans · Pull Request #28195 · NousResearch/hermes-agent

briandevans · 2026-05-18T20:41:50Z

What does this PR do?

When agent.tool_use_enforcement is left at its default "auto", agent/system_prompt.py only injects TOOL_USE_ENFORCEMENT_GUIDANCE if the active model name contains a substring from TOOL_USE_ENFORCEMENT_MODELS in agent/prompt_builder.py:271. The tuple was ("gpt", "codex", "gemini", "gemma", "grok", "glm") — both qwen and deepseek were missing, even though both families exhibit the exact failure mode the enforcement prompt was written for (describing intended actions instead of calling tools, hallucinating execution, ignoring existing context/memory, silently stopping mid-task).

This PR adds "qwen" and "deepseek" to the tuple, mirroring the established additive pattern in this area (merged #5595 added grok, #24715 added glm, #27797 widened grok to xai-oauth).

Mirrors agent/system_prompt.py::_build_system_prompt matching logic (substring in model_lower against the tuple) — this is the single source of truth for tool_use_enforcement="auto", so no other input precedence chain needs covering.

The "robust" alternative from the issue (default-true for all models) is intentionally not taken — it would silently flip behavior for users who currently rely on auto leaving Claude and other non-listed families unsteered, and every prior merged change in this area has been additive.

Related Issue

Fixes #28079

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

agent/prompt_builder.py — append "qwen" and "deepseek" to TOOL_USE_ENFORCEMENT_MODELS. 1-line tuple edit.
tests/agent/test_prompt_builder.py — add test_enforcement_models_includes_qwen and test_enforcement_models_includes_deepseek, mirroring the existing _includes_gpt / _includes_codex / _includes_grok pattern.
tests/run_agent/test_run_agent.py — add test_auto_injects_for_qwen (model qwen/qwen3.6-plus) and test_auto_injects_for_deepseek (model deepseek/deepseek-r1) in TestToolUseEnforcementConfig, confirming that tool_use_enforcement="auto" now causes the guidance string to appear in the system prompt for both families.

How to Test

Regression guard (proves the fix is necessary):

git stash push -- agent/prompt_builder.py
uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest \
  tests/run_agent/test_run_agent.py::TestToolUseEnforcementConfig::test_auto_injects_for_qwen \
  tests/run_agent/test_run_agent.py::TestToolUseEnforcementConfig::test_auto_injects_for_deepseek \
  tests/agent/test_prompt_builder.py::TestToolUseEnforcementGuidance::test_enforcement_models_includes_qwen \
  tests/agent/test_prompt_builder.py::TestToolUseEnforcementGuidance::test_enforcement_models_includes_deepseek -v
# 4 failures (regression proved)
git stash pop

With the production fix:

uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest \
  tests/agent/test_prompt_builder.py::TestToolUseEnforcementGuidance \
  tests/run_agent/test_run_agent.py::TestToolUseEnforcementConfig -v
# 9 + 18 = 27 passed

Real-world: set model.default: qwen/qwen3.6-plus (or any deepseek model), leave agent.tool_use_enforcement at its default "auto", run hermes chat -q "list files in this directory". Before the fix the agent often replies with a narrated plan and no tool call; after the fix TOOL_USE_ENFORCEMENT_GUIDANCE appears in the system prompt and tool use is enforced.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(agent):)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix (no unrelated commits)
I've run focused tests for the touched code and all pass (9 unit + 18 integration assertions in the two test classes)
I've added tests for my changes — 4 new tests that all fail without the production change
I've tested on my platform: macOS 26.x, Python 3.11

Documentation & Housekeeping

I've updated relevant documentation — N/A (the tuple has an inline # Add new patterns here when a model family needs explicit steering. comment that already documents the maintenance pattern)
I've updated cli-config.yaml.example if I added/changed config keys — N/A (no config keys changed; the existing agent.tool_use_enforcement key behaviour is unchanged for explicit values, only the auto default is widened)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact — N/A (pure string-substring match, no platform-specific code paths)
I've updated tool descriptions/schemas if I changed tool behavior — N/A

Sibling code paths that may need the same fix

The issue body explicitly flags mistral and llama as potentially affected (no concrete repro provided). I left them out of this PR's scope to keep the diff minimal and the additions evidence-driven — happy to widen to ("mistral", "llama") during cherry-pick if the same chatty-narration failure mode is confirmed on those families.

Screenshots / Logs

N/A — guidance-string injection has no UI surface; the change is observable only via system-prompt assembly, which is exercised by the new tests.

When `agent.tool_use_enforcement` is `"auto"` (the default), the runtime checks the active model name against `TOOL_USE_ENFORCEMENT_MODELS` in `agent/prompt_builder.py` and only injects `TOOL_USE_ENFORCEMENT_GUIDANCE` if a substring matches. Qwen and DeepSeek hit the same chatty/hallucinatory failure mode as GPT, Codex, Grok, and GLM (describing intended actions instead of calling tools, ignoring memory, silently stopping mid-execution), but neither substring was in the tuple — so the enforcement prompt was never injected for users on those families, even with `auto` left at its default. Add `"qwen"` and `"deepseek"` to the tuple, matching the established additive pattern (NousResearch#5595 added grok, NousResearch#24715 added glm, NousResearch#27797 widened grok to xai-oauth). Add four regression-guard tests that fail before the production change and pass after: two unit assertions in `test_prompt_builder.py` mirroring the existing grok/gpt checks, and two integration tests in `test_run_agent.py` confirming that a qwen/deepseek model under `tool_use_enforcement="auto"` now gets the guidance string in its system prompt. The "robust" alternative from the issue (default-true for all models) is intentionally not taken: it would silently flip behavior for users who currently rely on `auto` leaving Claude / non-listed families unsteered, and the maintainer's prior merged work in this area is uniformly additive. Fixes NousResearch#28079

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds Qwen and DeepSeek model families to the tool-use enforcement list so that their system prompts include enforcement guidance by default.

Changes:

Append "qwen" and "deepseek" to TOOL_USE_ENFORCEMENT_MODELS.
Add unit tests verifying the tuple includes the new entries.
Add integration tests verifying enforcement guidance is injected for Qwen and DeepSeek models under auto mode.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
agent/prompt_builder.py	Adds `qwen` and `deepseek` substrings to the enforcement models tuple.
tests/agent/test_prompt_builder.py	Adds membership tests for new model substrings.
tests/run_agent/test_run_agent.py	Adds auto-injection tests for Qwen and DeepSeek model IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def test_auto_injects_for_qwen(self):
+        """Qwen models default to chatty/hallucinatory tool use without enforcement."""
+        from agent.prompt_builder import TOOL_USE_ENFORCEMENT_GUIDANCE
+        agent = self._make_agent(model="qwen/qwen3.6-plus", tool_use_enforcement="auto")


Copilot flagged that `qwen/qwen3.6-plus` is not a real Qwen model identifier (no such version exists). Substring matching only needs "qwen" so the test still proves the path, but using a name that matches the issue body (`qwen-plus`, Alibaba Cloud) is clearer.

briandevans · 2026-05-18T20:48:07Z

@copilot Addressed in commit 9433eab: switched the qwen integration-test model identifier from qwen/qwen3.6-plus (not a real version) to qwen/qwen-plus (the actual Alibaba Cloud model referenced in the issue body). Substring matching is unchanged, but the test now reads as a representative deployment.

alt-glitch · 2026-05-18T20:53:17Z

Competing fix for #28079 alongside #28081. Both add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS; #28081 additionally (and redundantly) adds claude/anthropic. This PR is cleaner in scope but covers the same ground.

briandevans · 2026-05-18T21:13:48Z

Thanks @alt-glitch — flagging the overlap is appreciated. Quick positioning note for the maintainer to choose between:

This PR (fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS #28195) — scope is exactly what issue [Bug/Regression] tool_use_enforcement auto-mode excludes Qwen/DeepSeek causing hallucination #28079 calls out (qwen + deepseek), plus 4 new tests (2 unit, 2 integration) that all fail without the production line and 1 regression-guard reproduction in the PR body. Mirrors the established additive pattern (feat: add grok to TOOL_USE_ENFORCEMENT_MODELS for direct xAI usage #5595 grok, fix(prompt_builder): inject tool-use enforcement for GLM models #24715 glm, feat(grok): apply OpenAI execution guidance to xAI Grok / xai-oauth models #27797 xai-oauth).
fix: add qwen, deepseek, claude to TOOL_USE_ENFORCEMENT_MODELS #28081 — also adds claude + anthropic, which is a separable behaviour change for users currently relying on auto leaving Claude unsteered, and ships without tests.

Happy to defer to #28081 if widening to Claude/Anthropic is the desired direction; happy to widen this PR to mistral/llama during cherry-pick if the issue body's tentative mention of those families is worth covering too.

teknium1 · 2026-05-19T03:06:55Z

Merged via PR #28348 (squashed your 2-commit stack into one with your authorship preserved via rebase-merge — commit 7569007). Thanks for catching this, and for the explicit test coverage on the new entries.

Copilot AI review requested due to automatic review settings May 18, 2026 20:41

Copilot AI reviewed May 18, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder provider/qwen Qwen / Alibaba Cloud (OAuth) provider/deepseek DeepSeek API P2 Medium — degraded but workaround exists labels May 18, 2026

This was referenced May 19, 2026

fix: add qwen, deepseek, claude to TOOL_USE_ENFORCEMENT_MODELS #28081

Closed

fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS (#28195) #28348

Merged

teknium1 closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS#28195

fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS#28195
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/tool-use-enforcement-add-qwen-deepseek-28079

briandevans commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

briandevans commented May 18, 2026

Uh oh!

alt-glitch commented May 18, 2026

Uh oh!

briandevans commented May 18, 2026

Uh oh!

teknium1 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

briandevans commented May 18, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Sibling code paths that may need the same fix

Screenshots / Logs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

briandevans commented May 18, 2026

Uh oh!

alt-glitch commented May 18, 2026

Uh oh!

briandevans commented May 18, 2026

Uh oh!

teknium1 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants