fix(goals): pause loop when judge errors and response is terminal (#27585) by briandevans · Pull Request #27752 · NousResearch/hermes-agent

briandevans · 2026-05-18T04:31:18Z

What does this PR do?

When the goal judge raises an API/transport error (e.g. BadRequestError, ConnectionError, 4xx), judge_goal deliberately fails OPEN to ("continue", "judge error: ...", False) so a broken judge doesn't wedge progress. That assumption breaks when the assistant has already emitted a terminal response on the same turn — evaluate_after_turn re-queues a continuation prompt, the agent re-emits a near-identical terminal message, the judge keeps erroring, and the gateway channel (Discord/Telegram/etc.) gets spammed with the same completion message every few seconds until the turn budget runs out (16 of 20 turns in the reporter's case). This PR adds a safe-stop fallback: when the verdict is continue AND the reason starts with _JUDGE_ERROR_REASON_PREFIX ("judge error:") AND the assistant's response matches a small set of terminal-phrasing patterns ("goal is complete", "i'm/i am stopping", "nothing left to do"), pause the loop instead of queueing another turn. Fail-open is preserved for in-flight goals (non-terminal response → still continue), so a transient judge outage mid-task does not wedge legitimate progress.

Related Issue

Fixes #27585

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

hermes_cli/goals.py — judge_goal's error path now emits the _JUDGE_ERROR_REASON_PREFIX shared constant so the producer/consumer contract is explicit. GoalManager.evaluate_after_turn adds the safe-stop fallback: terminal-phrase regex match + judge-error prefix → pause with paused_reason="goal judge unreachable; assistant response appears terminal — pausing to avoid spam".
tests/hermes_cli/test_goals.py — new TestJudgeErrorSafeStop class (judge API error + terminal response → paused; judge API error + non-terminal → still continues; judge OK + terminal phrasing → still continues; judge done + terminal phrasing → status=done) and TestResponseLooksTerminal parametrized helper coverage.

How to Test

uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/hermes_cli/test_goals.py -v — 65 passed.
Regression guard: with the production change stashed, test_judge_api_error_with_terminal_response_auto_pauses fails with assert True is False on the should_continue assertion — exactly matching the issue's reproduction. With the fix applied, it passes alongside the existing 61 tests.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run focused tests for the touched code and all pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS 15.x

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Sibling Audit

The parse-failure auto-pause already guards the bad-judge-output case (weak models emitting non-JSON for N turns). This PR is the orthogonal guard for the unreachable-judge case — different failure mode, complementary backstop. Audited siblings: no other call sites consume the judge-error prefix. No widening needed.

Different failure mode from fix(goals): salvage PR #19160 + auto-pause on consecutive judge parse failures #21576 (auto-pause on consecutive parse failures); this guards the unreachable-judge case where the judge call itself errors out.
Different failure mode from /goal judge produces false positive when agent claims file created but write silently fails #18421 (judge false positive — claims done when not); this guards the inverse where the judge cannot evaluate and the assistant has explicitly stopped.

…usResearch#27585) When the goal judge (an auxiliary model call) raises an API/transport error, judge_goal fails OPEN to `("continue", "judge error: ...", False)` on the assumption that a broken judge must not wedge progress. The turn budget is the only backstop. That assumption breaks when the assistant has already produced a terminal response on the same turn — e.g. `"Goal is complete. I'm stopping here because the requested goal is complete."`. The fail-open re-queues a continuation prompt; the agent re-emits a near-identical terminal message; the judge keeps erroring; the gateway channel (Discord/Telegram/etc.) gets spammed every few seconds until the turn budget runs out. This adds a safe-stop fallback in `GoalManager.evaluate_after_turn`: when the verdict is `continue` AND the reason carries the `_JUDGE_ERROR_REASON_PREFIX` AND the response matches one of a small set of terminal phrasing patterns, pause the loop with a descriptive reason instead of queueing another turn. Parse-failure auto-pause already guards the bad-judge-output case; this guards the unreachable-judge case. `judge_goal`'s error path now emits the prefix via the `_JUDGE_ERROR_REASON_PREFIX` constant so the contract is explicit. Tests cover: - judge API error + terminal response → paused (regression for NousResearch#27585) - judge API error + non-terminal response → still continues (fail-open preserved on in-flight goals) - judge OK returns continue + response uses terminal phrasing → still continues (successful judge retains authority) - judge done + terminal response → status=done, safe-stop does not fire - Unit coverage of the terminal-phrase detector across positive and negative cases. Fixes NousResearch#27585 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

briandevans · 2026-06-02T00:18:16Z

Housekeeping: closing to keep my open-PR set focused on actively-reviewed work. This has been open ~14d without maintainer review and the surrounding code has continued to move, so it's unlikely to land as-is. The underlying fix still stands — happy to reopen and rebase if it would be useful. Thanks!

Copilot AI review requested due to automatic review settings May 18, 2026 04:31

zccyman mentioned this pull request May 18, 2026

fix(goals): add consecutive API error auto-pause to prevent goal spam (#27585) #27760

Open

alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists labels May 18, 2026

briandevans force-pushed the fix/goal-judge-error-terminal-response-27585 branch 4 times, most recently from 508c9ac to 76fd236 Compare May 27, 2026 05:11

briandevans force-pushed the fix/goal-judge-error-terminal-response-27585 branch from 76fd236 to fb53bbd Compare May 29, 2026 09:15

briandevans force-pushed the fix/goal-judge-error-terminal-response-27585 branch from fb53bbd to 9db61bc Compare May 30, 2026 10:11

briandevans closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(goals): pause loop when judge errors and response is terminal (#27585)#27752

fix(goals): pause loop when judge errors and response is terminal (#27585)#27752
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/goal-judge-error-terminal-response-27585

briandevans commented May 18, 2026 •

edited

Loading

Uh oh!

briandevans commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

briandevans commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Sibling Audit

Related

Uh oh!

briandevans commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

briandevans commented May 18, 2026 •

edited

Loading