test: green up the Tests suite (1/N) — timezone-independent flow-peak tests by Bartok9 · Pull Request #6 · Bartok9/hermes-agent

Bartok9 · 2026-05-30T05:32:28Z

Context

The Tests workflow has been red on main for 6+ weeks (push-to-main failures on 2026-04-18 and 2026-05-08). The current main run reports 41 failures. This is the first of a series of focused PRs to bring the suite back to green; it removes the only failures that were introduced by recent work rather than pre-existing.

What this fixes

tests/test_proactive_scheduler.py — three flow-peak tests were timezone/wall-clock dependent:

test_clear_morning_peak
test_clear_evening_peak
test_timezone_offset_shifts_peak

They seeded synthetic message history off an unaligned reference (time.time() - 15d), which carries the current wall-clock hour. Since analyze_flow() buckets activity by UTC hour, the detected peak drifted by the runtime hour — so the tests passed locally (EDT evening) but failed on CI's UTC runner (e.g. expected morning peak 8–12, got 15).

Fix: floor the reference to midnight UTC so each seeded hour * 3600 maps to exactly that UTC hour.

Verification

Verified green across multiple timezones (21/21 each):

TZ=UTC                  21 passed
TZ=America/New_York     21 passed
TZ=Asia/Kolkata         21 passed
TZ=Pacific/Kiritimati   21 passed

Remaining work (tracked separately)

The other ~38 failures are pre-existing and fall into:

Stale tests lagging intentional code changes (e.g. e9685a5c deliberately removed the 1M Anthropic beta from _COMMON_BETAS, but test_bedrock_1m_context.py still asserts the old behavior).
Python-version-sensitive (google_chat enum pseudo-member access regressed under 3.14; CI runs 3.11).
Test-isolation/xdist-ordering artifacts under -n auto.
Environment-binary dependent (chromium / agent-browser).

These will be addressed in follow-up PRs after confirming test-vs-code authority per cluster.

Made with Cursor

Note

Low Risk
Test-only timestamp fixture changes; no runtime or scheduler logic modified.

Overview
Makes proactive scheduler flow-peak tests deterministic on CI and in any TZ by aligning synthetic message timestamps with how analyze_flow() buckets hours.

_make_messages now anchors its reference day to midnight UTC ((int(time.time()) // 86400) * 86400 - 15d) instead of time.time() - 15d, so each hour * 3600 offset lands on the intended UTC hour. Docstrings note that unaligned refs smeared peaks with the runtime wall-clock hour and broke morning/evening peak assertions on UTC runners.

test_timezone_offset_shifts_peak uses the same UTC-midnight floor for its manual ref, so “14:00 UTC” seeds stay stable regardless of when the job runs.

No production changes—test-only fix for timezone/wall-clock flakiness.

^{Reviewed by Cursor Bugbot for commit 066fc56. Bugbot is set up for automated code reviews on this repo. Configure here.}

The flow-peak tests seeded synthetic message history off an unaligned reference (time.time() - 15d), which carries the current wall-clock hour. Since analyze_flow() buckets by UTC hour, the detected peak drifted by the runtime hour, so test_clear_morning_peak / test_clear_evening_peak / test_timezone_offset_shifts_peak passed locally (EDT evening) but failed on CI's UTC runner. Floor the reference to midnight UTC so each seeded "hour" maps to exactly that UTC hour. Verified across UTC, America/New_York, Asia/Kolkata, and Pacific/Kiritimati. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-30T05:33:13Z

🔎 Lint report: `fix/test-suite-green` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7746 on HEAD, 7746 on base (➖ 0)

🆕 New issues (3):

Rule	Count
`invalid-argument-type`	3

First entries

run_agent.py:6649: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`
run_agent.py:12542: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown | str, Unknown | str | dict[str, str]] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:12539: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown | str, Unknown | str | dict[str, str]] | Any | ... omitted 3 union elements`

✅ Fixed issues (3):

Rule	Count
`invalid-argument-type`	3

First entries

run_agent.py:6649: [invalid-argument-type] invalid-argument-type: Argument to function `build_anthropic_client` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`
run_agent.py:12542: [invalid-argument-type] invalid-argument-type: Argument to function `len` is incorrect: Expected `Sized`, found `(str & ~AlwaysFalsy) | (dict[Unknown, Unknown] & ~AlwaysFalsy) | (Any & ~AlwaysFalsy) | ... omitted 3 union elements`
run_agent.py:12539: [invalid-argument-type] invalid-argument-type: Argument to function `_is_oauth_token` is incorrect: Expected `str`, found `str | dict[Unknown, Unknown] | Any | ... omitted 3 union elements`

Unchanged: 4073 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Bedrock 1M cluster (3 tests): add context-1m-2025-08-07 to _COMMON_BETAS so it reaches Bedrock + non-OAuth fast-mode requests, while keeping the client-level native/custom default header (and OAuth fast-mode) stripping it for subscriptions that 400 on the long-context beta. Fast-mode now re-includes the common betas even on models without speed=fast support. max_tokens cluster (2 tests): the unsupported-parameter retry branch popped max_tokens but never re-added max_completion_tokens, so providers that rename the param ("Unknown parameter: max_tokens") lost their cap. Re-add under the renamed key on retry (sync + async), preserving the ZAI 1210 strip-entirely behavior. Follows PR #6 (1/N: timezone-independent flow-peak scheduler tests).

…w goals (NousResearch#34196, NousResearch#34197) Two related /goal bugs: (review/reflect/suggest/analyze) unless the assistant uses a magic phrase like 'goal complete'. The synthetic continuation loop escalates reflection into producing concrete artifacts that the goal only listed as *examples* of possible help. untracked` even when the user did not ask for staging/commit/push. This races with preflight compression and survives session split, turning a scoped 'done' answer into out-of-scope artifact production. Both bugs converge on the goal-judge prompt machinery in `hermes_cli/goals.py`. The fix is layered, minimal, and reviewable: 1. Tighten JUDGE_SYSTEM_PROMPT with three new explicit guardrails: - EXPLORATORY goals (review/reflect/suggest/analyze) are completable by a substantive synthesis — do NOT require additional artifacts. - Do NOT infer incompletion from untracked / unstaged / uncommitted files unless the goal explicitly required staging/commit/push. - Do NOT require a magic phrase like 'goal complete'. - Treat 'for example' / 'maybe' / 'you could' items as illustrative, NOT as required deliverables. - Scope-narrow goals (one file, one section, one specific change) are DONE when that exact scope is confirmed done — do not expand. 2. Add a transparent keyword classifier `_classify_goal_shape(goal)` that returns 'exploratory', 'illustrative', or 'concrete'. Cheap, reviewer-friendly substring detection — the LLM judge still makes the final DONE/CONTINUE call, but it now sees what kind of goal it is. Kept intentionally simple so behaviour is easy to audit and tune from issue feedback. 3. Append a corresponding goal-shape hint to the user-prompt template when the goal is exploratory or illustrative. The hint reminds the judge that for those shapes, a high-quality synthesis IS the deliverable. Concrete goals get the original strict template unchanged. 4. The with-subgoals template (already enforces strict per-criterion evidence) deliberately does NOT receive the shape hint — the user's explicit /subgoal criteria take precedence over goal-shape heuristics. Why prompt-level vs adding new state: This intentionally avoids adding new GoalState fields, new gateway plumbing, or new compression-lifecycle coupling. The goal judge is the single point where 'should we continue?' is decided; teaching it to read goal shape correctly is the smallest change that addresses both issues' root cause without touching the compression race window described in NousResearch#34197 #2-#5. Those lifecycle concerns are real and documented in the issue's 'Proposed fixes #3-#6' — they belong in a separate gateway-side change. This PR fixes the judge's bad 'continue' verdict that triggers the lifecycle problem in the first place. Without the bad verdict, the race window in NousResearch#34197 has nothing to race over. Tests (17 new, all 67 in test_goals.py pass): - TestClassifyGoalShape (9 tests): exploratory/illustrative/concrete classification including the sanitized NousResearch#34196 repro goal text. - TestJudgeSystemPromptGuardrails (4 tests): system prompt mentions exploratory goals, warns about untracked files, warns about requiring magic phrases, warns about illustrative examples. - TestJudgePromptIncludesGoalShapeHint (4 tests): the user prompt receives the shape hint for exploratory/illustrative goals, does NOT for concrete goals, and the with-subgoals template skips the hint to preserve its strict per-criterion evidence rule. Refs: NousResearch#34196 NousResearch#34197 Closes: NousResearch#34196 NousResearch#34197 Co-authored-by: Cursor <cursoragent@cursor.com>

Bartok9 merged commit b156141 into main May 30, 2026
7 of 8 checks passed

Bartok9 deleted the fix/test-suite-green branch May 30, 2026 06:07

Bartok9 mentioned this pull request May 30, 2026

test(2/N): fix bedrock 1M-context beta + max_tokens retry hardening #8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: green up the Tests suite (1/N) — timezone-independent flow-peak tests#6

test: green up the Tests suite (1/N) — timezone-independent flow-peak tests#6
Bartok9 merged 1 commit into
mainfrom
fix/test-suite-green

Bartok9 commented May 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bartok9 commented May 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What this fixes

Verification

Remaining work (tracked separately)

Uh oh!

github-actions Bot commented May 30, 2026

🔎 Lint report: fix/test-suite-green vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bartok9 commented May 30, 2026 •

edited by cursor Bot

Loading

🔎 Lint report: `fix/test-suite-green` vs `origin/main`