Skip to content

fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS#28195

Closed
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/tool-use-enforcement-add-qwen-deepseek-28079
Closed

fix(agent): add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS#28195
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/tool-use-enforcement-add-qwen-deepseek-28079

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

What does this PR do?

When agent.tool_use_enforcement is left at its default "auto", agent/system_prompt.py only injects TOOL_USE_ENFORCEMENT_GUIDANCE if the active model name contains a substring from TOOL_USE_ENFORCEMENT_MODELS in agent/prompt_builder.py:271. The tuple was ("gpt", "codex", "gemini", "gemma", "grok", "glm") — both qwen and deepseek were missing, even though both families exhibit the exact failure mode the enforcement prompt was written for (describing intended actions instead of calling tools, hallucinating execution, ignoring existing context/memory, silently stopping mid-task).

This PR adds "qwen" and "deepseek" to the tuple, mirroring the established additive pattern in this area (merged #5595 added grok, #24715 added glm, #27797 widened grok to xai-oauth).

Mirrors agent/system_prompt.py::_build_system_prompt matching logic (substring in model_lower against the tuple) — this is the single source of truth for tool_use_enforcement="auto", so no other input precedence chain needs covering.

The "robust" alternative from the issue (default-true for all models) is intentionally not taken — it would silently flip behavior for users who currently rely on auto leaving Claude and other non-listed families unsteered, and every prior merged change in this area has been additive.

Related Issue

Fixes #28079

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • agent/prompt_builder.py — append "qwen" and "deepseek" to TOOL_USE_ENFORCEMENT_MODELS. 1-line tuple edit.
  • tests/agent/test_prompt_builder.py — add test_enforcement_models_includes_qwen and test_enforcement_models_includes_deepseek, mirroring the existing _includes_gpt / _includes_codex / _includes_grok pattern.
  • tests/run_agent/test_run_agent.py — add test_auto_injects_for_qwen (model qwen/qwen3.6-plus) and test_auto_injects_for_deepseek (model deepseek/deepseek-r1) in TestToolUseEnforcementConfig, confirming that tool_use_enforcement="auto" now causes the guidance string to appear in the system prompt for both families.

How to Test

  1. Regression guard (proves the fix is necessary):
    git stash push -- agent/prompt_builder.py
    uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest \
      tests/run_agent/test_run_agent.py::TestToolUseEnforcementConfig::test_auto_injects_for_qwen \
      tests/run_agent/test_run_agent.py::TestToolUseEnforcementConfig::test_auto_injects_for_deepseek \
      tests/agent/test_prompt_builder.py::TestToolUseEnforcementGuidance::test_enforcement_models_includes_qwen \
      tests/agent/test_prompt_builder.py::TestToolUseEnforcementGuidance::test_enforcement_models_includes_deepseek -v
    # 4 failures (regression proved)
    git stash pop
  2. With the production fix:
    uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest \
      tests/agent/test_prompt_builder.py::TestToolUseEnforcementGuidance \
      tests/run_agent/test_run_agent.py::TestToolUseEnforcementConfig -v
    # 9 + 18 = 27 passed
  3. Real-world: set model.default: qwen/qwen3.6-plus (or any deepseek model), leave agent.tool_use_enforcement at its default "auto", run hermes chat -q "list files in this directory". Before the fix the agent often replies with a narrated plan and no tool call; after the fix TOOL_USE_ENFORCEMENT_GUIDANCE appears in the system prompt and tool use is enforced.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(agent):)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run focused tests for the touched code and all pass (9 unit + 18 integration assertions in the two test classes)
  • I've added tests for my changes — 4 new tests that all fail without the production change
  • I've tested on my platform: macOS 26.x, Python 3.11

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (the tuple has an inline # Add new patterns here when a model family needs explicit steering. comment that already documents the maintenance pattern)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (no config keys changed; the existing agent.tool_use_enforcement key behaviour is unchanged for explicit values, only the auto default is widened)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact — N/A (pure string-substring match, no platform-specific code paths)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Sibling code paths that may need the same fix

The issue body explicitly flags mistral and llama as potentially affected (no concrete repro provided). I left them out of this PR's scope to keep the diff minimal and the additions evidence-driven — happy to widen to ("mistral", "llama") during cherry-pick if the same chatty-narration failure mode is confirmed on those families.

Screenshots / Logs

N/A — guidance-string injection has no UI surface; the change is observable only via system-prompt assembly, which is exercised by the new tests.

When `agent.tool_use_enforcement` is `"auto"` (the default), the
runtime checks the active model name against `TOOL_USE_ENFORCEMENT_MODELS`
in `agent/prompt_builder.py` and only injects `TOOL_USE_ENFORCEMENT_GUIDANCE`
if a substring matches. Qwen and DeepSeek hit the same chatty/hallucinatory
failure mode as GPT, Codex, Grok, and GLM (describing intended actions
instead of calling tools, ignoring memory, silently stopping mid-execution),
but neither substring was in the tuple — so the enforcement prompt was
never injected for users on those families, even with `auto` left at
its default.

Add `"qwen"` and `"deepseek"` to the tuple, matching the established
additive pattern (NousResearch#5595 added grok, NousResearch#24715 added glm, NousResearch#27797 widened
grok to xai-oauth). Add four regression-guard tests that fail before
the production change and pass after: two unit assertions in
`test_prompt_builder.py` mirroring the existing grok/gpt checks, and
two integration tests in `test_run_agent.py` confirming that a
qwen/deepseek model under `tool_use_enforcement="auto"` now gets the
guidance string in its system prompt.

The "robust" alternative from the issue (default-true for all models)
is intentionally not taken: it would silently flip behavior for users
who currently rely on `auto` leaving Claude / non-listed families
unsteered, and the maintainer's prior merged work in this area is
uniformly additive.

Fixes NousResearch#28079
Copilot AI review requested due to automatic review settings May 18, 2026 20:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds Qwen and DeepSeek model families to the tool-use enforcement list so that their system prompts include enforcement guidance by default.

Changes:

  • Append "qwen" and "deepseek" to TOOL_USE_ENFORCEMENT_MODELS.
  • Add unit tests verifying the tuple includes the new entries.
  • Add integration tests verifying enforcement guidance is injected for Qwen and DeepSeek models under auto mode.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
agent/prompt_builder.py Adds qwen and deepseek substrings to the enforcement models tuple.
tests/agent/test_prompt_builder.py Adds membership tests for new model substrings.
tests/run_agent/test_run_agent.py Adds auto-injection tests for Qwen and DeepSeek model IDs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/run_agent/test_run_agent.py Outdated
def test_auto_injects_for_qwen(self):
"""Qwen models default to chatty/hallucinatory tool use without enforcement."""
from agent.prompt_builder import TOOL_USE_ENFORCEMENT_GUIDANCE
agent = self._make_agent(model="qwen/qwen3.6-plus", tool_use_enforcement="auto")
Copilot flagged that `qwen/qwen3.6-plus` is not a real Qwen model
identifier (no such version exists). Substring matching only needs
"qwen" so the test still proves the path, but using a name that
matches the issue body (`qwen-plus`, Alibaba Cloud) is clearer.
@briandevans

Copy link
Copy Markdown
Contributor Author

@copilot Addressed in commit 9433eab: switched the qwen integration-test model identifier from qwen/qwen3.6-plus (not a real version) to qwen/qwen-plus (the actual Alibaba Cloud model referenced in the issue body). Substring matching is unchanged, but the test now reads as a representative deployment.

@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder provider/qwen Qwen / Alibaba Cloud (OAuth) provider/deepseek DeepSeek API P2 Medium — degraded but workaround exists labels May 18, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing fix for #28079 alongside #28081. Both add qwen and deepseek to TOOL_USE_ENFORCEMENT_MODELS; #28081 additionally (and redundantly) adds claude/anthropic. This PR is cleaner in scope but covers the same ground.

@briandevans

Copy link
Copy Markdown
Contributor Author

Thanks @alt-glitch — flagging the overlap is appreciated. Quick positioning note for the maintainer to choose between:

Happy to defer to #28081 if widening to Claude/Anthropic is the desired direction; happy to widen this PR to mistral/llama during cherry-pick if the issue body's tentative mention of those families is worth covering too.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #28348 (squashed your 2-commit stack into one with your authorship preserved via rebase-merge — commit 7569007). Thanks for catching this, and for the explicit test coverage on the new entries.

@teknium1 teknium1 closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists provider/deepseek DeepSeek API provider/qwen Qwen / Alibaba Cloud (OAuth) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug/Regression] tool_use_enforcement auto-mode excludes Qwen/DeepSeek causing hallucination

4 participants