Skip to content

fix(goals): salvage PR #19160 + auto-pause on consecutive judge parse failures#21576

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-1912be3d
May 8, 2026
Merged

fix(goals): salvage PR #19160 + auto-pause on consecutive judge parse failures#21576
teknium1 merged 2 commits into
mainfrom
hermes/hermes-1912be3d

Conversation

@teknium1

@teknium1 teknium1 commented May 8, 2026

Copy link
Copy Markdown
Contributor

Salvages PR #19160 (@liquidchen) and adds an auto-pause guard so weak judge models can't silently burn the entire goal turn budget.

Changes

  • cherry-picked a9f5f81 from PR fix(gateway): defer goal status notices until after response delivery #19160 (contributor authorship preserved via rebase-merge): routes goal status notices through adapter.send() instead of a non-existent send_message(), defers the notice until after the main response is delivered, and cancels queued synthetic goal continuations on /goal pause and /goal clear
  • new: GoalState.consecutive_parse_failures counter + DEFAULT_MAX_CONSECUTIVE_PARSE_FAILURES=3 threshold
  • _parse_judge_response() / judge_goal() now return a 3-tuple (verdict, reason, parse_failed)parse_failed is True only on empty/non-JSON judge output, False on API/transport errors (those stay transient, fail-open)
  • evaluate_after_turn() auto-pauses after 3 consecutive parse failures with a config-pointer message:
    ⏸ Goal paused — the judge model (3 turns) isn't returning the required JSON verdict.
    Route the judge to a stricter model in ~/.hermes/config.yaml:
      auxiliary:
        goal_judge:
          provider: openrouter
          model: google/gemini-3-flash-preview
    Then /goal resume to continue.
    
  • test fix: test_goal_verdict_send.py was sharing a hardcoded session_id="goal-sess-1" across tests. It only worked before because _post_turn_goal_continuation was a never-awaited coroutine — the coroutine never ran, so cross-test state never leaked. PR fix(gateway): defer goal status notices until after response delivery #19160's async conversion made the coroutine actually run, surfacing the latent xdist test-leakage bug. Each test now gets a unique session_id via uuid.uuid4().
  • AUTHOR_MAP entry: ytchen0719@gmail.com → liquidchen

Why

Discord user barteq reported an infinite /goal loop when using deepseek-v4-flash as the judge. The judge returned output like:

judge returned empty response
judge reply was not JSON: "Let me analyze whether the goal is fully satisfied..."

The loop fail-opened to "continue" on every such turn, burning the entire 20-turn budget.

Gille already has a config-side workaround (route auxiliary.goal_judge to google/gemini-3-flash-preview), which is the right answer long-term — but this PR makes the agent detect the failure mode and tell the user HOW to fix it, instead of running silently until the budget is exhausted.

PR #19160 (same session) was also needed because /goal clear didn't stop an already-queued synthetic continuation — related failure mode, same shape.

Validation

Before After
Targeted tests (tests/hermes_cli/test_goals.py + tests/gateway/test_goal_status_notice.py + tests/gateway/test_goal_verdict_send.py) 3 failed (async conversion) 43/43 passing
barteq scenario: judge returns prose for 3 turns runs all 20 turns pauses at turn 3 with config-pointer
Transient ConnectionError for 10 turns does NOT trip auto-pause (stays active)
2 bad + 1 good + 3 bad pauses at turn 6 (counter reset by good reply)
Counter durability across GoalManager reload persisted via state_meta

E2E tested via execute_code with real imports from the worktree, isolated HERMES_HOME, and mock judge clients returning the exact shapes barteq reported.

Credit

Closes #19160.

JC and others added 2 commits May 7, 2026 17:07
Route goal status notices through the platform adapter send API and register post-delivery callbacks so completed-goal notices appear after the final assistant response. Also cancel queued synthetic goal continuations on /goal pause and /goal clear while preserving normal queued user messages.
Weak judge models (e.g. deepseek-v4-flash) return empty strings or prose
when asked for the strict {done, reason} JSON verdict. The old code
failed-open to continue on every such turn, burning the entire turn
budget with log lines like

  judge returned empty response
  judge reply was not JSON: "Let me analyze whether the goal..."

and /goal clear could not stop it mid-loop without /stop.

After N=3 consecutive *parse* failures (transport/API errors don't
count — those are transient), the loop auto-pauses and prints:

  ⏸ Goal paused — the judge model (3 turns) isn't returning the
  required JSON verdict. Route the judge to a stricter model in
  ~/.hermes/config.yaml:
    auxiliary:
      goal_judge:
        provider: openrouter
        model: google/gemini-3-flash-preview
  Then /goal resume to continue.

The counter resets on any usable reply (both "done"/"continue" and
API errors) and persists across GoalManager reloads so cross-session
resumes carry the correct state.

Also fixes test_goal_verdict_send.py sharing a hardcoded session_id
across tests — the shared id only worked because the previous
_post_turn_goal_continuation was a never-awaited coroutine. Now that
PR #19160 made it properly awaited, the xdist test-leakage bug
surfaced. Each test gets a unique session_id via uuid suffix.
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-1912be3d vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7683 on HEAD, 7670 on base (🆕 +13)

🆕 New issues (5):

Rule Count
unresolved-attribute 3
invalid-parameter-default 1
unresolved-import 1
First entries
tests/gateway/test_goal_status_notice.py:122: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_pending_messages` on type `FakeAdapter`
tests/gateway/test_goal_verdict_send.py:64: [invalid-parameter-default] invalid-parameter-default: Default value of type `None` is not assignable to annotated parameter type `str`
tests/gateway/test_goal_status_notice.py:5: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/hermes_cli/test_goals.py:495: [unresolved-attribute] unresolved-attribute: Attribute `consecutive_parse_failures` is not defined on `None` in union `GoalState | None`
tests/gateway/test_goal_status_notice.py:146: [unresolved-attribute] unresolved-attribute: Object of type `FakeAdapter` has no attribute `_pending_messages`

✅ Fixed issues: none

Unchanged: 4028 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@teknium1 teknium1 merged commit 307c85e into main May 8, 2026
11 of 12 checks passed
@teknium1 teknium1 deleted the hermes/hermes-1912be3d branch May 8, 2026 00:33
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants