Skip to content

feat(prompt): universal task-completion guidance + local Python toolchain probe#34340

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-6a4cf2d7
May 29, 2026
Merged

feat(prompt): universal task-completion guidance + local Python toolchain probe#34340
teknium1 merged 2 commits into
mainfrom
hermes/hermes-6a4cf2d7

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Two cross-model failure modes get a single-line answer in the cached system prompt. Both gated by config (default on), both add zero overhead when not needed, both verified via real AIAgent prompt builds.

Originated from a user report: an external user couldn't reproduce a real-estate search task that worked on my machine. Their opus quit after 3 API calls with a stub; their deepseek pushed through a broken Python install and fabricated listings. Their environment had python3=3.11.15 with no pip module, pip on PATH bound to a different Python 3.12, PEP 668 enforced, no uv. The agent had no way to know that without hitting walls.

Changes

agent/prompt_builder.py — new constant TASK_COMPLETION_GUIDANCE

Short universal prompt block applied to all models (not gated by tool_use_enforcement). Targets two failure modes that aren't model-family specific:

  1. Stopping after a stub (e.g. writing 85 bytes, running one terminal call, ending the turn with prose at finish_reason=stop despite having 90 iterations available)
  2. Fabricating output when a real path is blocked (e.g. pip install fails, model synthesises plausible-looking data instead of admitting the blocker)

The block tells the model the deliverable is a working artifact with real tool output, and that reporting a blocker honestly is always better than inventing a result.

tools/env_probe.py — local Python toolchain probe

Detects python3 / pip / uv / PEP-668 state and emits ONE short line in the system prompt when something is non-default. Emits NOTHING when the environment is clean (zero token cost). Skipped entirely for remote terminal backends (docker, modal, ssh, daytona, singularity, managed_modal) — they have their own probe (_probe_remote_backend in agent/prompt_builder.py).

Example output on the broken-environment scenario:

Python toolchain: python3=3.11.15 (no pip module), python=missing (use python3), pip→python3.12 (mismatch), PEP 668=yes (use venv or uv).

Six signals checked:

  • python3 version
  • python3 -m pip availability (the module, not the CLI shim)
  • PEP-668 EXTERNALLY-MANAGED marker presence
  • pippython3 version match (catches the "bundled venv vs system Python" trap)
  • uv presence
  • python alias presence (common Debian/Ubuntu trap where only python3 exists)

Cached for the lifetime of the process; deterministic; never crashes the prompt build (all subprocess errors caught).

agent/system_prompt.py — wiring

Both new pieces land in the stable tier of the system prompt (alongside identity, tool guidance, environment hints). They participate in the same single-cache lifecycle — never rebuilt mid-session, never broken into separate API messages.

agent/agent_init.py — config attrs

agent._task_completion_guidance and agent._environment_probe read from config.yaml agent.* keys.

hermes_cli/config.py — defaults

agent:
  task_completion_guidance: true
  environment_probe: true

No _config_version bump required — deep-merge handles backfilling defaults for existing configs.

Validation

Test surface Result
tests/tools/test_env_probe.py (new) 10/10 pass — silent-on-healthy, emits-on-real-problems, skips-remote-backends, caching, robustness
tests/run_agent/test_run_agent.py::TestTaskCompletionGuidance (new) 5/5 pass — claude/deepseek/gpt all get it; False disables; no-tools gate
tests/run_agent/test_run_agent.py::TestEnvironmentProbeIntegration (new) 3/3 pass — appears when problem detected, silent on clean env, config disable
TestToolUseEnforcementConfig 17/17 pass — no regression in existing enforcement gate
TestBuildSystemPrompt 9/9 pass — no regression in prompt assembly
TestInvalidateSystemPrompt 2/2 pass
tests/agent/test_prompt_builder.py 124/124 pass
tests/hermes_cli/ 5662/5662 pass
E2E real AIAgent build, broken env simulated Both blocks present, 2,178 chars total
E2E real AIAgent build, clean env 771-char net overhead (~192 tokens); env probe silent

Overhead

  • Clean environment: 771 chars (~192 tokens) from TASK_COMPLETION_GUIDANCE. Paid once per session, amortised across all turns via prefix cache.
  • Broken environment: add ~140 chars (~35 tokens) for the env probe line.

For context, a typical Hermes system prompt is 8-15k tokens; this is ~1-2% of that, with measurable expected behavioural lift on the failure modes it targets.

Why this is universal (not model-family-gated)

tool_use_enforcement and OPENAI_MODEL_EXECUTION_GUIDANCE already gate on model name (TOOL_USE_ENFORCEMENT_MODELS = ("gpt", "codex", "gemini", "gemma", "grok", "glm", "qwen", "deepseek")). Claude was deliberately excluded from that set.

But the Opus stub-and-stop failure that motivated this PR is exactly the failure that block exists to prevent — Claude should have been getting steered against it. Adding "claude" to TOOL_USE_ENFORCEMENT_MODELS is a heavier-handed change (the whole multi-paragraph block lands), and the cross-model failure-mode framing is cleaner. So this becomes its own short block, applied universally, gated only by a single user-facing toggle.

Infographic

prompt-guidance-env-probe

teknium1 added 2 commits May 28, 2026 17:10
When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").

Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
  (e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists

Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.

Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
  empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
  with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.

Adapted from anomalyco/opencode#28757.
…hain probe

Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.

## What changed

`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).

`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.

Example output on a broken environment (the actual case):

    Python toolchain: python3=3.11.15 (no pip module),
    python=missing (use python3), pip→python3.12 (mismatch),
    PEP 668=yes (use venv or uv).

## Config

Both flags live under `agent.` in config.yaml, default True:

    agent:
      task_completion_guidance: true   # universal "finish the job" block
      environment_probe: true          # local Python toolchain hints

Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.

## Validation

| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
@alt-glitch alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 29, 2026
@teknium1 teknium1 merged commit a4d8f0f into main May 29, 2026
20 of 24 checks passed
@teknium1 teknium1 deleted the hermes/hermes-6a4cf2d7 branch May 29, 2026 05:26
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
…hain probe (NousResearch#34340)

* fix(codex): surface error code in Responses 'failed' status errors

When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").

Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
  (e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists

Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.

Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
  empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
  with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.

Adapted from anomalyco/opencode#28757.

* feat(prompt): universal task-completion guidance + local Python toolchain probe

Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.

## What changed

`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call NousResearch#3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).

`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.

Example output on a broken environment (the actual case):

    Python toolchain: python3=3.11.15 (no pip module),
    python=missing (use python3), pip→python3.12 (mismatch),
    PEP 668=yes (use venv or uv).

## Config

Both flags live under `agent.` in config.yaml, default True:

    agent:
      task_completion_guidance: true   # universal "finish the job" block
      environment_probe: true          # local Python toolchain hints

Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.

## Validation

| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants