feat(prompt): universal task-completion guidance + local Python toolchain probe#34340
Merged
Conversation
When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").
Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
(e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists
Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.
Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.
Adapted from anomalyco/opencode#28757.
…hain probe Two cross-model failure modes get a single-line answer in the cached system prompt. Both gated by config (default on), both add zero overhead when not needed, both verified via real AIAgent prompt builds. ## What changed `TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models. Targets two failure modes observed on a real Sarasota real-estate build task: (1) Opus stopped after writing an 85-byte stub and gave a prose response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed through a PEP-668 wall, then returned fabricated listings instead of admitting the blocker. Both behaviors are model-family-agnostic, so the guidance lives outside the existing tool_use_enforcement gate (~192 tokens, paid once per session via prefix cache). `tools/env_probe.py` — local Python toolchain probe. Detects python3/pip/uv/PEP-668 state and emits ONE short line in the system prompt when something is non-default. Emits NOTHING when the env is clean (zero token cost for normal users). Skipped entirely for remote terminal backends (docker/modal/ssh) — they have their own probe. Example output on a broken environment (the actual case): Python toolchain: python3=3.11.15 (no pip module), python=missing (use python3), pip→python3.12 (mismatch), PEP 668=yes (use venv or uv). ## Config Both flags live under `agent.` in config.yaml, default True: agent: task_completion_guidance: true # universal "finish the job" block environment_probe: true # local Python toolchain hints Neither addition required a `_config_version` bump — deep-merge fills defaults in for existing user configs. ## Validation | Test surface | Result | |---|---| | tests/tools/test_env_probe.py | 10/10 pass (probe unit) | | tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) | | TestToolUseEnforcementConfig | 17/17 pass (no regression) | | TestBuildSystemPrompt | 9/9 pass (no regression) | | TestInvalidateSystemPrompt | 2/2 pass (no regression) | | tests/agent/test_prompt_builder.py | 124/124 pass (no regression) | | tests/hermes_cli/ | 5662/5662 pass (config defaults) | | E2E AIAgent build (broken env) | Both blocks present, 2,178 chars | | E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
KKT-OPT
pushed a commit
to KKT-OPT/hermes-agent
that referenced
this pull request
May 31, 2026
…hain probe (NousResearch#34340) * fix(codex): surface error code in Responses 'failed' status errors When a Codex Responses turn ends with status=failed, the response carries the failure details under `response.error` as `{code, message, param, ...}`. The previous extractor pulled only `message`, so users seeing a rate-limit failure got a bare "Slow down" string indistinguishable from a generic stream truncation; an internal_error with empty message degraded to a dict dump ("{'code': 'internal_error', 'message': ''}"). Extract a `_format_responses_error()` helper that: - prefixes `code` when both code and message are present (e.g. 'rate_limit_exceeded: Slow down') - falls back to the bare `code` when message is empty - accepts both dict and attribute-style payloads (SDK and JSON-RPC paths) - preserves the prior status-only fallback when no error payload exists Apply the same helper at the sibling site in `codex_app_server_session.run_turn()` so codex-CLI subprocess turn failures get the same treatment. Tests: - 8 new unit tests for `_format_responses_error` covering both shapes, empty/missing fields, non-string fields, and the status-only fallback. - 2 regression tests on `_normalize_codex_response` for failed status with and without a code, asserting the exact RuntimeError message. - All 3603 tests in tests/agent/ pass. Adapted from anomalyco/opencode#28757. * feat(prompt): universal task-completion guidance + local Python toolchain probe Two cross-model failure modes get a single-line answer in the cached system prompt. Both gated by config (default on), both add zero overhead when not needed, both verified via real AIAgent prompt builds. ## What changed `TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models. Targets two failure modes observed on a real Sarasota real-estate build task: (1) Opus stopped after writing an 85-byte stub and gave a prose response with finish_reason=stop on call NousResearch#3 of 90; (2) DeepSeek pushed through a PEP-668 wall, then returned fabricated listings instead of admitting the blocker. Both behaviors are model-family-agnostic, so the guidance lives outside the existing tool_use_enforcement gate (~192 tokens, paid once per session via prefix cache). `tools/env_probe.py` — local Python toolchain probe. Detects python3/pip/uv/PEP-668 state and emits ONE short line in the system prompt when something is non-default. Emits NOTHING when the env is clean (zero token cost for normal users). Skipped entirely for remote terminal backends (docker/modal/ssh) — they have their own probe. Example output on a broken environment (the actual case): Python toolchain: python3=3.11.15 (no pip module), python=missing (use python3), pip→python3.12 (mismatch), PEP 668=yes (use venv or uv). ## Config Both flags live under `agent.` in config.yaml, default True: agent: task_completion_guidance: true # universal "finish the job" block environment_probe: true # local Python toolchain hints Neither addition required a `_config_version` bump — deep-merge fills defaults in for existing user configs. ## Validation | Test surface | Result | |---|---| | tests/tools/test_env_probe.py | 10/10 pass (probe unit) | | tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) | | TestToolUseEnforcementConfig | 17/17 pass (no regression) | | TestBuildSystemPrompt | 9/9 pass (no regression) | | TestInvalidateSystemPrompt | 2/2 pass (no regression) | | tests/agent/test_prompt_builder.py | 124/124 pass (no regression) | | tests/hermes_cli/ | 5662/5662 pass (config defaults) | | E2E AIAgent build (broken env) | Both blocks present, 2,178 chars | | E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two cross-model failure modes get a single-line answer in the cached system prompt. Both gated by config (default on), both add zero overhead when not needed, both verified via real
AIAgentprompt builds.Originated from a user report: an external user couldn't reproduce a real-estate search task that worked on my machine. Their
opusquit after 3 API calls with a stub; theirdeepseekpushed through a broken Python install and fabricated listings. Their environment hadpython3=3.11.15with no pip module,pipon PATH bound to a different Python 3.12, PEP 668 enforced, nouv. The agent had no way to know that without hitting walls.Changes
agent/prompt_builder.py— new constantTASK_COMPLETION_GUIDANCEShort universal prompt block applied to all models (not gated by
tool_use_enforcement). Targets two failure modes that aren't model-family specific:finish_reason=stopdespite having 90 iterations available)pip installfails, model synthesises plausible-looking data instead of admitting the blocker)The block tells the model the deliverable is a working artifact with real tool output, and that reporting a blocker honestly is always better than inventing a result.
tools/env_probe.py— local Python toolchain probeDetects
python3/pip/uv/ PEP-668 state and emits ONE short line in the system prompt when something is non-default. Emits NOTHING when the environment is clean (zero token cost). Skipped entirely for remote terminal backends (docker, modal, ssh, daytona, singularity, managed_modal) — they have their own probe (_probe_remote_backendinagent/prompt_builder.py).Example output on the broken-environment scenario:
Six signals checked:
python3versionpython3 -m pipavailability (the module, not the CLI shim)EXTERNALLY-MANAGEDmarker presencepip↔python3version match (catches the "bundled venv vs system Python" trap)uvpresencepythonalias presence (common Debian/Ubuntu trap where onlypython3exists)Cached for the lifetime of the process; deterministic; never crashes the prompt build (all subprocess errors caught).
agent/system_prompt.py— wiringBoth new pieces land in the
stabletier of the system prompt (alongside identity, tool guidance, environment hints). They participate in the same single-cache lifecycle — never rebuilt mid-session, never broken into separate API messages.agent/agent_init.py— config attrsagent._task_completion_guidanceandagent._environment_proberead fromconfig.yamlagent.*keys.hermes_cli/config.py— defaultsNo
_config_versionbump required — deep-merge handles backfilling defaults for existing configs.Validation
tests/tools/test_env_probe.py(new)tests/run_agent/test_run_agent.py::TestTaskCompletionGuidance(new)Falsedisables; no-tools gatetests/run_agent/test_run_agent.py::TestEnvironmentProbeIntegration(new)TestToolUseEnforcementConfigTestBuildSystemPromptTestInvalidateSystemPrompttests/agent/test_prompt_builder.pytests/hermes_cli/AIAgentbuild, broken env simulatedAIAgentbuild, clean envOverhead
TASK_COMPLETION_GUIDANCE. Paid once per session, amortised across all turns via prefix cache.For context, a typical Hermes system prompt is 8-15k tokens; this is ~1-2% of that, with measurable expected behavioural lift on the failure modes it targets.
Why this is universal (not model-family-gated)
tool_use_enforcementandOPENAI_MODEL_EXECUTION_GUIDANCEalready gate on model name (TOOL_USE_ENFORCEMENT_MODELS = ("gpt", "codex", "gemini", "gemma", "grok", "glm", "qwen", "deepseek")). Claude was deliberately excluded from that set.But the Opus stub-and-stop failure that motivated this PR is exactly the failure that block exists to prevent — Claude should have been getting steered against it. Adding "claude" to
TOOL_USE_ENFORCEMENT_MODELSis a heavier-handed change (the whole multi-paragraph block lands), and the cross-model failure-mode framing is cleaner. So this becomes its own short block, applied universally, gated only by a single user-facing toggle.Infographic