Skip to content

feat(grok): apply OpenAI execution guidance to xAI Grok / xai-oauth models#27797

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-e3fd584d
May 18, 2026
Merged

feat(grok): apply OpenAI execution guidance to xAI Grok / xai-oauth models#27797
teknium1 merged 1 commit into
mainfrom
hermes/hermes-e3fd584d

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

xAI Grok models (xai-oauth + OpenRouter grok-*) now get the same family-specific execution discipline block (OPENAI_MODEL_EXECUTION_GUIDANCE) that GPT/Codex get. Same failure modes in practice: claims completion without tool calls ("to be honest, I didn't create the file yet"), suggests workarounds instead of using existing tools (proposing a folder-based memory system when the memory tool exists), replies with plans instead of executing.

The base TOOL_USE_ENFORCEMENT_GUIDANCE was already firing for grok ("grok" is in TOOL_USE_ENFORCEMENT_MODELS). This was just the family-specific second tier that GPT/Codex got and Grok didn't.

Changes

  • agent/system_prompt.py: gate at L159 also matches "grok" in _model_lower
  • agent/prompt_builder.py: docstring note that OPENAI_ prefix reflects origin, not exclusivity (body is family-agnostic — tool_persistence / mandatory_tool_use / act_dont_ask / prerequisite_checks / verification / missing_context)
  • tests/run_agent/test_run_agent.py: 4 new tests covering OpenRouter slug, xai-oauth bare name, and a claude negative control

Validation

Before After
Grok base enforcement block injected injected
Grok exec discipline (verification, mandatory_tool_use, act_dont_ask) missing injected
Claude exec discipline not injected not injected
  • TestToolUseEnforcementConfig: 16/16 pass (12 pre-existing + 4 new)
  • test_prompt_builder.py: 122/122 pass
  • E2E with real AIAgent._build_system_prompt(): grok-4.3 (xai-oauth) and x-ai/grok-4.20 (openrouter) both inject the full block including <verification>, <mandatory_tool_use>, <act_dont_ask>; claude-sonnet-4 does not.

…odels

Grok models hit the same failure modes that OPENAI_MODEL_EXECUTION_GUIDANCE
addresses for GPT/Codex: claiming completion without tool calls
('to be honest, I didn't create the file yet'), suggesting workarounds
instead of using existing tools (proposing a folder-based memory system
when the memory tool exists), replying with plans instead of executing.

TOOL_USE_ENFORCEMENT_GUIDANCE was already injected for any model whose
name contains 'grok' (TOOL_USE_ENFORCEMENT_MODELS). This extends the
follow-on family-specific block — OPENAI_MODEL_EXECUTION_GUIDANCE
(tool_persistence / mandatory_tool_use / act_dont_ask / prerequisite_checks
/ verification / missing_context) — to grok-named models too.

The OPENAI_ prefix is retained for backwards compat with imports/tests;
docstring + inline comment now note that the body is family-agnostic and
the prefix reflects origin, not exclusivity.

Tests cover the OpenRouter slug (x-ai/grok-4.3) and the xai-oauth bare
name (grok-4.3), plus a negative control on claude.

E2E verified against a real AIAgent build of the system prompt for both
xai-oauth and openrouter grok models.
@teknium1 teknium1 merged commit 9b91377 into main May 18, 2026
16 of 17 checks passed
@teknium1 teknium1 deleted the hermes/hermes-e3fd584d branch May 18, 2026 06:00
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-e3fd584d vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8749 on HEAD, 8749 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4608 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/feature New feature or request comp/agent Core agent loop, run_agent.py, prompt builder provider/xai xAI (Grok) P3 Low — cosmetic, nice to have labels May 18, 2026
Lillard01 pushed a commit to Lillard01/hermes-agent that referenced this pull request May 21, 2026
…odels (NousResearch#27797)

Grok models hit the same failure modes that OPENAI_MODEL_EXECUTION_GUIDANCE
addresses for GPT/Codex: claiming completion without tool calls
('to be honest, I didn't create the file yet'), suggesting workarounds
instead of using existing tools (proposing a folder-based memory system
when the memory tool exists), replying with plans instead of executing.

TOOL_USE_ENFORCEMENT_GUIDANCE was already injected for any model whose
name contains 'grok' (TOOL_USE_ENFORCEMENT_MODELS). This extends the
follow-on family-specific block — OPENAI_MODEL_EXECUTION_GUIDANCE
(tool_persistence / mandatory_tool_use / act_dont_ask / prerequisite_checks
/ verification / missing_context) — to grok-named models too.

The OPENAI_ prefix is retained for backwards compat with imports/tests;
docstring + inline comment now note that the body is family-agnostic and
the prefix reflects origin, not exclusivity.

Tests cover the OpenRouter slug (x-ai/grok-4.3) and the xai-oauth bare
name (grok-4.3), plus a negative control on claude.

E2E verified against a real AIAgent build of the system prompt for both
xai-oauth and openrouter grok models.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…odels (NousResearch#27797)

Grok models hit the same failure modes that OPENAI_MODEL_EXECUTION_GUIDANCE
addresses for GPT/Codex: claiming completion without tool calls
('to be honest, I didn't create the file yet'), suggesting workarounds
instead of using existing tools (proposing a folder-based memory system
when the memory tool exists), replying with plans instead of executing.

TOOL_USE_ENFORCEMENT_GUIDANCE was already injected for any model whose
name contains 'grok' (TOOL_USE_ENFORCEMENT_MODELS). This extends the
follow-on family-specific block — OPENAI_MODEL_EXECUTION_GUIDANCE
(tool_persistence / mandatory_tool_use / act_dont_ask / prerequisite_checks
/ verification / missing_context) — to grok-named models too.

The OPENAI_ prefix is retained for backwards compat with imports/tests;
docstring + inline comment now note that the body is family-agnostic and
the prefix reflects origin, not exclusivity.

Tests cover the OpenRouter slug (x-ai/grok-4.3) and the xai-oauth bare
name (grok-4.3), plus a negative control on claude.

E2E verified against a real AIAgent build of the system prompt for both
xai-oauth and openrouter grok models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have provider/xai xAI (Grok) type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants