Skip to content

fix(ci): make parallel runner's exit-4 retry robust for newly-added test files#42994

Merged
teknium1 merged 1 commit into
mainfrom
fix/parallel-runner-exit4-retry
Jun 10, 2026
Merged

fix(ci): make parallel runner's exit-4 retry robust for newly-added test files#42994
teknium1 merged 1 commit into
mainfrom
fix/parallel-runner-exit4-retry

Conversation

@teknium1

@teknium1 teknium1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

A freshly-added test file no longer reds a CI shard on a transient "file or directory not found".

The per-file runner (scripts/run_tests_parallel.py) re-runs a file when pytest exits 4 while the file exists on disk — a transient on loaded shared runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. The old single-shot retry could land in the same high-load window and fail again, and it was gated on one Path.exists() check that can itself be a flaky stat under that load. Result: a new test file that LPT pins to one shard deterministically reds that shard — no real test failure, the file just never executes.

Changes

  • scripts/run_tests_parallel.py:
    • Extract subprocess spawn / communicate / process-tree-kill into a shared _spawn_pytest_once() (removes ~90 lines duplicated between the primary run and the retry).
    • Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present.
    • Add _file_present() — re-checks existence across a few spaced stats so one flaky negative doesn't wrongly conclude "missing". A genuinely-missing file still fails fast (exit 4 not swallowed).
  • tests/test_run_tests_parallel.py: transient-then-pass recovery, genuinely-missing fails fast (no retry), give-up after max attempts, _file_present transient/missing cases.

Validation

Before After
exit-4 then pass (file exists) retried once; could re-fail in same window retried up to 3× w/ backoff → recovers
exit-4, file truly missing failed (correct) failed fast, no retry (preserved)
tests/test_run_tests_parallel.py 1 passed 6 passed

Context

Surfaced by PR #38199, whose new tests/tools/test_write_approval.py deterministically red test (1) across multiple fresh workflow runs — the file collects + passes locally under the exact runner invocation, but the shard's per-file subprocess couldn't stat it. This fixes the runner so any test-adding PR is unaffected; #38199 will rebase onto this.

Infographic

parallel-runner-exit4-retry

…est files

The per-file test runner re-runs a file once when pytest exits 4 ("file or
directory not found") while the file exists on disk — a transient seen on
loaded shared CI runners where the planner collects a file (--collect-only
counts its tests) but the per-file subprocess fails to stat it moments later.

A single immediate retry could land in the same brief high-load window and
fail again, and the retry was gated on one Path.exists() check that can itself
be a flaky stat under that load — so a freshly-added test file that LPT pins to
one shard would deterministically red that shard on every run (no actual test
failure; the file just never executes).

- Extract the subprocess spawn/communicate/process-tree-kill logic into a
  shared _spawn_pytest_once() helper (removes ~90 lines of duplication between
  the primary run and the retry).
- Replace the single-shot retry with a bounded backoff loop
  (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is
  present on disk.
- Add _file_present() which re-checks existence across a few spaced stats, so a
  single flaky negative stat doesn't wrongly conclude the file is missing. A
  genuinely-missing file (typo/deleted) still fails fast — exit 4 is not
  swallowed when the file truly does not exist.
- Tests: transient-then-pass recovery, genuinely-missing fails fast with no
  retry, give-up after max attempts, and _file_present transient/missing cases.
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/parallel-runner-exit4-retry vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10612 on HEAD, 10609 on base (🆕 +3)

🆕 New issues (3):

Rule Count
unresolved-attribute 2
invalid-argument-type 1
First entries
tests/test_run_tests_parallel.py:203: [unresolved-attribute] unresolved-attribute: Attribute `loader` is not defined on `None` in union `ModuleSpec | None`
tests/test_run_tests_parallel.py:202: [invalid-argument-type] invalid-argument-type: Argument to function `module_from_spec` is incorrect: Expected `ModuleSpec`, found `ModuleSpec | None`
tests/test_run_tests_parallel.py:203: [unresolved-attribute] unresolved-attribute: Attribute `exec_module` is not defined on `None` in union `Loader | None`

✅ Fixed issues: none

Unchanged: 5561 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have labels Jun 9, 2026
@teknium1 teknium1 merged commit f082b4e into main Jun 10, 2026
23 checks passed
@teknium1 teknium1 deleted the fix/parallel-runner-exit4-retry branch June 10, 2026 04:39
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…est files (NousResearch#42994)

The per-file test runner re-runs a file once when pytest exits 4 ("file or
directory not found") while the file exists on disk — a transient seen on
loaded shared CI runners where the planner collects a file (--collect-only
counts its tests) but the per-file subprocess fails to stat it moments later.

A single immediate retry could land in the same brief high-load window and
fail again, and the retry was gated on one Path.exists() check that can itself
be a flaky stat under that load — so a freshly-added test file that LPT pins to
one shard would deterministically red that shard on every run (no actual test
failure; the file just never executes).

- Extract the subprocess spawn/communicate/process-tree-kill logic into a
  shared _spawn_pytest_once() helper (removes ~90 lines of duplication between
  the primary run and the retry).
- Replace the single-shot retry with a bounded backoff loop
  (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is
  present on disk.
- Add _file_present() which re-checks existence across a few spaced stats, so a
  single flaky negative stat doesn't wrongly conclude the file is missing. A
  genuinely-missing file (typo/deleted) still fails fast — exit 4 is not
  swallowed when the file truly does not exist.
- Tests: transient-then-pass recovery, genuinely-missing fails fast with no
  retry, give-up after max attempts, and _file_present transient/missing cases.
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
…est files (#42994)

The per-file test runner re-runs a file once when pytest exits 4 ("file or
directory not found") while the file exists on disk — a transient seen on
loaded shared CI runners where the planner collects a file (--collect-only
counts its tests) but the per-file subprocess fails to stat it moments later.

A single immediate retry could land in the same brief high-load window and
fail again, and the retry was gated on one Path.exists() check that can itself
be a flaky stat under that load — so a freshly-added test file that LPT pins to
one shard would deterministically red that shard on every run (no actual test
failure; the file just never executes).

- Extract the subprocess spawn/communicate/process-tree-kill logic into a
  shared _spawn_pytest_once() helper (removes ~90 lines of duplication between
  the primary run and the retry).
- Replace the single-shot retry with a bounded backoff loop
  (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is
  present on disk.
- Add _file_present() which re-checks existence across a few spaced stats, so a
  single flaky negative stat doesn't wrongly conclude the file is missing. A
  genuinely-missing file (typo/deleted) still fails fast — exit 4 is not
  swallowed when the file truly does not exist.
- Tests: transient-then-pass recovery, genuinely-missing fails fast with no
  retry, give-up after max attempts, and _file_present transient/missing cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants