fix(ci): make parallel runner's exit-4 retry robust for newly-added test files by teknium1 · Pull Request #42994 · NousResearch/hermes-agent

teknium1 · 2026-06-09T18:01:44Z

Summary

A freshly-added test file no longer reds a CI shard on a transient "file or directory not found".

The per-file runner (scripts/run_tests_parallel.py) re-runs a file when pytest exits 4 while the file exists on disk — a transient on loaded shared runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. The old single-shot retry could land in the same high-load window and fail again, and it was gated on one Path.exists() check that can itself be a flaky stat under that load. Result: a new test file that LPT pins to one shard deterministically reds that shard — no real test failure, the file just never executes.

Changes

scripts/run_tests_parallel.py:
- Extract subprocess spawn / communicate / process-tree-kill into a shared _spawn_pytest_once() (removes ~90 lines duplicated between the primary run and the retry).
- Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present.
- Add _file_present() — re-checks existence across a few spaced stats so one flaky negative doesn't wrongly conclude "missing". A genuinely-missing file still fails fast (exit 4 not swallowed).
tests/test_run_tests_parallel.py: transient-then-pass recovery, genuinely-missing fails fast (no retry), give-up after max attempts, _file_present transient/missing cases.

Validation

	Before	After
exit-4 then pass (file exists)	retried once; could re-fail in same window	retried up to 3× w/ backoff → recovers
exit-4, file truly missing	failed (correct)	failed fast, no retry (preserved)
`tests/test_run_tests_parallel.py`	1 passed	6 passed

Context

Surfaced by PR #38199, whose new tests/tools/test_write_approval.py deterministically red test (1) across multiple fresh workflow runs — the file collects + passes locally under the exact runner invocation, but the shard's per-file subprocess couldn't stat it. This fixes the runner so any test-adding PR is unaffected; #38199 will rebase onto this.

Infographic

…est files The per-file test runner re-runs a file once when pytest exits 4 ("file or directory not found") while the file exists on disk — a transient seen on loaded shared CI runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. A single immediate retry could land in the same brief high-load window and fail again, and the retry was gated on one Path.exists() check that can itself be a flaky stat under that load — so a freshly-added test file that LPT pins to one shard would deterministically red that shard on every run (no actual test failure; the file just never executes). - Extract the subprocess spawn/communicate/process-tree-kill logic into a shared _spawn_pytest_once() helper (removes ~90 lines of duplication between the primary run and the retry). - Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present on disk. - Add _file_present() which re-checks existence across a few spaced stats, so a single flaky negative stat doesn't wrongly conclude the file is missing. A genuinely-missing file (typo/deleted) still fails fast — exit 4 is not swallowed when the file truly does not exist. - Tests: transient-then-pass recovery, genuinely-missing fails fast with no retry, give-up after max attempts, and _file_present transient/missing cases.

github-actions · 2026-06-09T18:02:36Z

🔎 Lint report: `fix/parallel-runner-exit4-retry` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10612 on HEAD, 10609 on base (🆕 +3)

🆕 New issues (3):

Rule	Count
`unresolved-attribute`	2
`invalid-argument-type`	1

First entries

tests/test_run_tests_parallel.py:203: [unresolved-attribute] unresolved-attribute: Attribute `loader` is not defined on `None` in union `ModuleSpec | None`
tests/test_run_tests_parallel.py:202: [invalid-argument-type] invalid-argument-type: Argument to function `module_from_spec` is incorrect: Expected `ModuleSpec`, found `ModuleSpec | None`
tests/test_run_tests_parallel.py:203: [unresolved-attribute] unresolved-attribute: Attribute `exec_module` is not defined on `None` in union `Loader | None`

✅ Fixed issues: none

Unchanged: 5561 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

…est files (NousResearch#42994) The per-file test runner re-runs a file once when pytest exits 4 ("file or directory not found") while the file exists on disk — a transient seen on loaded shared CI runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. A single immediate retry could land in the same brief high-load window and fail again, and the retry was gated on one Path.exists() check that can itself be a flaky stat under that load — so a freshly-added test file that LPT pins to one shard would deterministically red that shard on every run (no actual test failure; the file just never executes). - Extract the subprocess spawn/communicate/process-tree-kill logic into a shared _spawn_pytest_once() helper (removes ~90 lines of duplication between the primary run and the retry). - Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present on disk. - Add _file_present() which re-checks existence across a few spaced stats, so a single flaky negative stat doesn't wrongly conclude the file is missing. A genuinely-missing file (typo/deleted) still fails fast — exit 4 is not swallowed when the file truly does not exist. - Tests: transient-then-pass recovery, genuinely-missing fails fast with no retry, give-up after max attempts, and _file_present transient/missing cases.

…est files (#42994) The per-file test runner re-runs a file once when pytest exits 4 ("file or directory not found") while the file exists on disk — a transient seen on loaded shared CI runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. A single immediate retry could land in the same brief high-load window and fail again, and the retry was gated on one Path.exists() check that can itself be a flaky stat under that load — so a freshly-added test file that LPT pins to one shard would deterministically red that shard on every run (no actual test failure; the file just never executes). - Extract the subprocess spawn/communicate/process-tree-kill logic into a shared _spawn_pytest_once() helper (removes ~90 lines of duplication between the primary run and the retry). - Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present on disk. - Add _file_present() which re-checks existence across a few spaced stats, so a single flaky negative stat doesn't wrongly conclude the file is missing. A genuinely-missing file (typo/deleted) still fails fast — exit 4 is not swallowed when the file truly does not exist. - Tests: transient-then-pass recovery, genuinely-missing fails fast with no retry, give-up after max attempts, and _file_present transient/missing cases.

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have labels Jun 9, 2026

teknium1 merged commit f082b4e into main Jun 10, 2026
23 checks passed

teknium1 deleted the fix/parallel-runner-exit4-retry branch June 10, 2026 04:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): make parallel runner's exit-4 retry robust for newly-added test files#42994

fix(ci): make parallel runner's exit-4 retry robust for newly-added test files#42994
teknium1 merged 1 commit into
mainfrom
fix/parallel-runner-exit4-retry

teknium1 commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented Jun 9, 2026

Summary

Changes

Validation

Context

Infographic

Uh oh!

github-actions Bot commented Jun 9, 2026

🔎 Lint report: fix/parallel-runner-exit4-retry vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔎 Lint report: `fix/parallel-runner-exit4-retry` vs `origin/main`