fix(ci): make parallel runner's exit-4 retry robust for newly-added test files#42994
Merged
Conversation
…est files
The per-file test runner re-runs a file once when pytest exits 4 ("file or
directory not found") while the file exists on disk — a transient seen on
loaded shared CI runners where the planner collects a file (--collect-only
counts its tests) but the per-file subprocess fails to stat it moments later.
A single immediate retry could land in the same brief high-load window and
fail again, and the retry was gated on one Path.exists() check that can itself
be a flaky stat under that load — so a freshly-added test file that LPT pins to
one shard would deterministically red that shard on every run (no actual test
failure; the file just never executes).
- Extract the subprocess spawn/communicate/process-tree-kill logic into a
shared _spawn_pytest_once() helper (removes ~90 lines of duplication between
the primary run and the retry).
- Replace the single-shot retry with a bounded backoff loop
(_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is
present on disk.
- Add _file_present() which re-checks existence across a few spaced stats, so a
single flaky negative stat doesn't wrongly conclude the file is missing. A
genuinely-missing file (typo/deleted) still fails fast — exit 4 is not
swallowed when the file truly does not exist.
- Tests: transient-then-pass recovery, genuinely-missing fails fast with no
retry, give-up after max attempts, and _file_present transient/missing cases.
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-attribute |
2 |
invalid-argument-type |
1 |
First entries
tests/test_run_tests_parallel.py:203: [unresolved-attribute] unresolved-attribute: Attribute `loader` is not defined on `None` in union `ModuleSpec | None`
tests/test_run_tests_parallel.py:202: [invalid-argument-type] invalid-argument-type: Argument to function `module_from_spec` is incorrect: Expected `ModuleSpec`, found `ModuleSpec | None`
tests/test_run_tests_parallel.py:203: [unresolved-attribute] unresolved-attribute: Attribute `exec_module` is not defined on `None` in union `Loader | None`
✅ Fixed issues: none
Unchanged: 5561 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
changman
pushed a commit
to changman/hermes-agent
that referenced
this pull request
Jun 10, 2026
…est files (NousResearch#42994) The per-file test runner re-runs a file once when pytest exits 4 ("file or directory not found") while the file exists on disk — a transient seen on loaded shared CI runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. A single immediate retry could land in the same brief high-load window and fail again, and the retry was gated on one Path.exists() check that can itself be a flaky stat under that load — so a freshly-added test file that LPT pins to one shard would deterministically red that shard on every run (no actual test failure; the file just never executes). - Extract the subprocess spawn/communicate/process-tree-kill logic into a shared _spawn_pytest_once() helper (removes ~90 lines of duplication between the primary run and the retry). - Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present on disk. - Add _file_present() which re-checks existence across a few spaced stats, so a single flaky negative stat doesn't wrongly conclude the file is missing. A genuinely-missing file (typo/deleted) still fails fast — exit 4 is not swallowed when the file truly does not exist. - Tests: transient-then-pass recovery, genuinely-missing fails fast with no retry, give-up after max attempts, and _file_present transient/missing cases.
alt-glitch
pushed a commit
that referenced
this pull request
Jun 14, 2026
…est files (#42994) The per-file test runner re-runs a file once when pytest exits 4 ("file or directory not found") while the file exists on disk — a transient seen on loaded shared CI runners where the planner collects a file (--collect-only counts its tests) but the per-file subprocess fails to stat it moments later. A single immediate retry could land in the same brief high-load window and fail again, and the retry was gated on one Path.exists() check that can itself be a flaky stat under that load — so a freshly-added test file that LPT pins to one shard would deterministically red that shard on every run (no actual test failure; the file just never executes). - Extract the subprocess spawn/communicate/process-tree-kill logic into a shared _spawn_pytest_once() helper (removes ~90 lines of duplication between the primary run and the retry). - Replace the single-shot retry with a bounded backoff loop (_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present on disk. - Add _file_present() which re-checks existence across a few spaced stats, so a single flaky negative stat doesn't wrongly conclude the file is missing. A genuinely-missing file (typo/deleted) still fails fast — exit 4 is not swallowed when the file truly does not exist. - Tests: transient-then-pass recovery, genuinely-missing fails fast with no retry, give-up after max attempts, and _file_present transient/missing cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A freshly-added test file no longer reds a CI shard on a transient "file or directory not found".
The per-file runner (
scripts/run_tests_parallel.py) re-runs a file when pytest exits 4 while the file exists on disk — a transient on loaded shared runners where the planner collects a file (--collect-onlycounts its tests) but the per-file subprocess fails tostatit moments later. The old single-shot retry could land in the same high-load window and fail again, and it was gated on onePath.exists()check that can itself be a flakystatunder that load. Result: a new test file that LPT pins to one shard deterministically reds that shard — no real test failure, the file just never executes.Changes
scripts/run_tests_parallel.py:_spawn_pytest_once()(removes ~90 lines duplicated between the primary run and the retry)._EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is present._file_present()— re-checks existence across a few spaced stats so one flaky negative doesn't wrongly conclude "missing". A genuinely-missing file still fails fast (exit 4 not swallowed).tests/test_run_tests_parallel.py: transient-then-pass recovery, genuinely-missing fails fast (no retry), give-up after max attempts,_file_presenttransient/missing cases.Validation
tests/test_run_tests_parallel.pyContext
Surfaced by PR #38199, whose new
tests/tools/test_write_approval.pydeterministically redtest (1)across multiple fresh workflow runs — the file collects + passes locally under the exact runner invocation, but the shard's per-file subprocess couldn't stat it. This fixes the runner so any test-adding PR is unaffected; #38199 will rebase onto this.Infographic