feat(prompt): universal task-completion guidance + local Python toolchain probe by teknium1 · Pull Request #34340 · NousResearch/hermes-agent

teknium1 · 2026-05-29T03:56:05Z

Summary

Two cross-model failure modes get a single-line answer in the cached system prompt. Both gated by config (default on), both add zero overhead when not needed, both verified via real AIAgent prompt builds.

Originated from a user report: an external user couldn't reproduce a real-estate search task that worked on my machine. Their opus quit after 3 API calls with a stub; their deepseek pushed through a broken Python install and fabricated listings. Their environment had python3=3.11.15 with no pip module, pip on PATH bound to a different Python 3.12, PEP 668 enforced, no uv. The agent had no way to know that without hitting walls.

Changes

`agent/prompt_builder.py` — new constant `TASK_COMPLETION_GUIDANCE`

Short universal prompt block applied to all models (not gated by tool_use_enforcement). Targets two failure modes that aren't model-family specific:

Stopping after a stub (e.g. writing 85 bytes, running one terminal call, ending the turn with prose at finish_reason=stop despite having 90 iterations available)
Fabricating output when a real path is blocked (e.g. pip install fails, model synthesises plausible-looking data instead of admitting the blocker)

The block tells the model the deliverable is a working artifact with real tool output, and that reporting a blocker honestly is always better than inventing a result.

`tools/env_probe.py` — local Python toolchain probe

Detects python3 / pip / uv / PEP-668 state and emits ONE short line in the system prompt when something is non-default. Emits NOTHING when the environment is clean (zero token cost). Skipped entirely for remote terminal backends (docker, modal, ssh, daytona, singularity, managed_modal) — they have their own probe (_probe_remote_backend in agent/prompt_builder.py).

Example output on the broken-environment scenario:

Python toolchain: python3=3.11.15 (no pip module), python=missing (use python3), pip→python3.12 (mismatch), PEP 668=yes (use venv or uv).

Six signals checked:

python3 version
python3 -m pip availability (the module, not the CLI shim)
PEP-668 EXTERNALLY-MANAGED marker presence
pip ↔ python3 version match (catches the "bundled venv vs system Python" trap)
uv presence
python alias presence (common Debian/Ubuntu trap where only python3 exists)

Cached for the lifetime of the process; deterministic; never crashes the prompt build (all subprocess errors caught).

`agent/system_prompt.py` — wiring

Both new pieces land in the stable tier of the system prompt (alongside identity, tool guidance, environment hints). They participate in the same single-cache lifecycle — never rebuilt mid-session, never broken into separate API messages.

`agent/agent_init.py` — config attrs

agent._task_completion_guidance and agent._environment_probe read from config.yaml agent.* keys.

`hermes_cli/config.py` — defaults

agent:
  task_completion_guidance: true
  environment_probe: true

No _config_version bump required — deep-merge handles backfilling defaults for existing configs.

Validation

Test surface	Result
`tests/tools/test_env_probe.py` (new)	10/10 pass — silent-on-healthy, emits-on-real-problems, skips-remote-backends, caching, robustness
`tests/run_agent/test_run_agent.py::TestTaskCompletionGuidance` (new)	5/5 pass — claude/deepseek/gpt all get it; `False` disables; no-tools gate
`tests/run_agent/test_run_agent.py::TestEnvironmentProbeIntegration` (new)	3/3 pass — appears when problem detected, silent on clean env, config disable
`TestToolUseEnforcementConfig`	17/17 pass — no regression in existing enforcement gate
`TestBuildSystemPrompt`	9/9 pass — no regression in prompt assembly
`TestInvalidateSystemPrompt`	2/2 pass
`tests/agent/test_prompt_builder.py`	124/124 pass
`tests/hermes_cli/`	5662/5662 pass
E2E real `AIAgent` build, broken env simulated	Both blocks present, 2,178 chars total
E2E real `AIAgent` build, clean env	771-char net overhead (~192 tokens); env probe silent

Overhead

Clean environment: 771 chars (~192 tokens) from TASK_COMPLETION_GUIDANCE. Paid once per session, amortised across all turns via prefix cache.
Broken environment: add ~140 chars (~35 tokens) for the env probe line.

For context, a typical Hermes system prompt is 8-15k tokens; this is ~1-2% of that, with measurable expected behavioural lift on the failure modes it targets.

Why this is universal (not model-family-gated)

tool_use_enforcement and OPENAI_MODEL_EXECUTION_GUIDANCE already gate on model name (TOOL_USE_ENFORCEMENT_MODELS = ("gpt", "codex", "gemini", "gemma", "grok", "glm", "qwen", "deepseek")). Claude was deliberately excluded from that set.

But the Opus stub-and-stop failure that motivated this PR is exactly the failure that block exists to prevent — Claude should have been getting steered against it. Adding "claude" to TOOL_USE_ENFORCEMENT_MODELS is a heavier-handed change (the whole multi-paragraph block lands), and the cross-model failure-mode framing is cleaner. So this becomes its own short block, applied universally, gated only by a single user-facing toggle.

Infographic

When a Codex Responses turn ends with status=failed, the response carries the failure details under `response.error` as `{code, message, param, ...}`. The previous extractor pulled only `message`, so users seeing a rate-limit failure got a bare "Slow down" string indistinguishable from a generic stream truncation; an internal_error with empty message degraded to a dict dump ("{'code': 'internal_error', 'message': ''}"). Extract a `_format_responses_error()` helper that: - prefixes `code` when both code and message are present (e.g. 'rate_limit_exceeded: Slow down') - falls back to the bare `code` when message is empty - accepts both dict and attribute-style payloads (SDK and JSON-RPC paths) - preserves the prior status-only fallback when no error payload exists Apply the same helper at the sibling site in `codex_app_server_session.run_turn()` so codex-CLI subprocess turn failures get the same treatment. Tests: - 8 new unit tests for `_format_responses_error` covering both shapes, empty/missing fields, non-string fields, and the status-only fallback. - 2 regression tests on `_normalize_codex_response` for failed status with and without a code, asserting the exact RuntimeError message. - All 3603 tests in tests/agent/ pass. Adapted from anomalyco/opencode#28757.

…hain probe Two cross-model failure modes get a single-line answer in the cached system prompt. Both gated by config (default on), both add zero overhead when not needed, both verified via real AIAgent prompt builds. ## What changed `TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models. Targets two failure modes observed on a real Sarasota real-estate build task: (1) Opus stopped after writing an 85-byte stub and gave a prose response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed through a PEP-668 wall, then returned fabricated listings instead of admitting the blocker. Both behaviors are model-family-agnostic, so the guidance lives outside the existing tool_use_enforcement gate (~192 tokens, paid once per session via prefix cache). `tools/env_probe.py` — local Python toolchain probe. Detects python3/pip/uv/PEP-668 state and emits ONE short line in the system prompt when something is non-default. Emits NOTHING when the env is clean (zero token cost for normal users). Skipped entirely for remote terminal backends (docker/modal/ssh) — they have their own probe. Example output on a broken environment (the actual case): Python toolchain: python3=3.11.15 (no pip module), python=missing (use python3), pip→python3.12 (mismatch), PEP 668=yes (use venv or uv). ## Config Both flags live under `agent.` in config.yaml, default True: agent: task_completion_guidance: true # universal "finish the job" block environment_probe: true # local Python toolchain hints Neither addition required a `_config_version` bump — deep-merge fills defaults in for existing user configs. ## Validation | Test surface | Result | |---|---| | tests/tools/test_env_probe.py | 10/10 pass (probe unit) | | tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) | | TestToolUseEnforcementConfig | 17/17 pass (no regression) | | TestBuildSystemPrompt | 9/9 pass (no regression) | | TestInvalidateSystemPrompt | 2/2 pass (no regression) | | tests/agent/test_prompt_builder.py | 124/124 pass (no regression) | | tests/hermes_cli/ | 5662/5662 pass (config defaults) | | E2E AIAgent build (broken env) | Both blocks present, 2,178 chars | | E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |

…hain probe (NousResearch#34340) * fix(codex): surface error code in Responses 'failed' status errors When a Codex Responses turn ends with status=failed, the response carries the failure details under `response.error` as `{code, message, param, ...}`. The previous extractor pulled only `message`, so users seeing a rate-limit failure got a bare "Slow down" string indistinguishable from a generic stream truncation; an internal_error with empty message degraded to a dict dump ("{'code': 'internal_error', 'message': ''}"). Extract a `_format_responses_error()` helper that: - prefixes `code` when both code and message are present (e.g. 'rate_limit_exceeded: Slow down') - falls back to the bare `code` when message is empty - accepts both dict and attribute-style payloads (SDK and JSON-RPC paths) - preserves the prior status-only fallback when no error payload exists Apply the same helper at the sibling site in `codex_app_server_session.run_turn()` so codex-CLI subprocess turn failures get the same treatment. Tests: - 8 new unit tests for `_format_responses_error` covering both shapes, empty/missing fields, non-string fields, and the status-only fallback. - 2 regression tests on `_normalize_codex_response` for failed status with and without a code, asserting the exact RuntimeError message. - All 3603 tests in tests/agent/ pass. Adapted from anomalyco/opencode#28757. * feat(prompt): universal task-completion guidance + local Python toolchain probe Two cross-model failure modes get a single-line answer in the cached system prompt. Both gated by config (default on), both add zero overhead when not needed, both verified via real AIAgent prompt builds. ## What changed `TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models. Targets two failure modes observed on a real Sarasota real-estate build task: (1) Opus stopped after writing an 85-byte stub and gave a prose response with finish_reason=stop on call NousResearch#3 of 90; (2) DeepSeek pushed through a PEP-668 wall, then returned fabricated listings instead of admitting the blocker. Both behaviors are model-family-agnostic, so the guidance lives outside the existing tool_use_enforcement gate (~192 tokens, paid once per session via prefix cache). `tools/env_probe.py` — local Python toolchain probe. Detects python3/pip/uv/PEP-668 state and emits ONE short line in the system prompt when something is non-default. Emits NOTHING when the env is clean (zero token cost for normal users). Skipped entirely for remote terminal backends (docker/modal/ssh) — they have their own probe. Example output on a broken environment (the actual case): Python toolchain: python3=3.11.15 (no pip module), python=missing (use python3), pip→python3.12 (mismatch), PEP 668=yes (use venv or uv). ## Config Both flags live under `agent.` in config.yaml, default True: agent: task_completion_guidance: true # universal "finish the job" block environment_probe: true # local Python toolchain hints Neither addition required a `_config_version` bump — deep-merge fills defaults in for existing user configs. ## Validation | Test surface | Result | |---|---| | tests/tools/test_env_probe.py | 10/10 pass (probe unit) | | tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) | | TestToolUseEnforcementConfig | 17/17 pass (no regression) | | TestBuildSystemPrompt | 9/9 pass (no regression) | | TestInvalidateSystemPrompt | 2/2 pass (no regression) | | tests/agent/test_prompt_builder.py | 124/124 pass (no regression) | | tests/hermes_cli/ | 5662/5662 pass (config defaults) | | E2E AIAgent build (broken env) | Both blocks present, 2,178 chars | | E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |

teknium1 added 2 commits May 28, 2026 17:10

alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 29, 2026

teknium1 merged commit a4d8f0f into main May 29, 2026
20 of 24 checks passed

teknium1 deleted the hermes/hermes-6a4cf2d7 branch May 29, 2026 05:26

BrewTestBot mentioned this pull request Jun 6, 2026

hermes-agent 2026.6.5 Homebrew/homebrew-core#286569

Merged

1 task

github-actions Bot mentioned this pull request Jun 6, 2026

chore: bump NousResearch/hermes-agent version from v2026.5.29.2 to v2026.6.5 Docker-Hub-sirmark/docker-hermes-agent#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(prompt): universal task-completion guidance + local Python toolchain probe#34340

feat(prompt): universal task-completion guidance + local Python toolchain probe#34340
teknium1 merged 2 commits into
mainfrom
hermes/hermes-6a4cf2d7

teknium1 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 29, 2026

Summary

Changes

agent/prompt_builder.py — new constant TASK_COMPLETION_GUIDANCE

tools/env_probe.py — local Python toolchain probe

agent/system_prompt.py — wiring

agent/agent_init.py — config attrs

hermes_cli/config.py — defaults

Validation

Overhead

Why this is universal (not model-family-gated)

Infographic

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`agent/prompt_builder.py` — new constant `TASK_COMPLETION_GUIDANCE`

`tools/env_probe.py` — local Python toolchain probe

`agent/system_prompt.py` — wiring

`agent/agent_init.py` — config attrs

`hermes_cli/config.py` — defaults