Skip to content

fix(ci): exit-4 forensics for vanishing test files in run_tests_parallel.py#43646

Merged
teknium1 merged 2 commits into
mainfrom
infra/exit4-forensics
Jun 10, 2026
Merged

fix(ci): exit-4 forensics for vanishing test files in run_tests_parallel.py#43646
teknium1 merged 2 commits into
mainfrom
infra/exit4-forensics

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

When a per-file pytest run exhausts the exit-4 retry loop, the runner now appends filesystem forensics to the captured output — so the recurring CI-only "file or directory not found" failures become attributable from the log instead of guessed at.

Why

PR #30179's new tests/test_iron_proxy.py failed exactly one shard (test (5)) with ERROR: file or directory not found across 4 consecutive CI runs — including a fresh merge SHA on fresh runners — while:

  • the planner's --co pass counted its 88 tests seconds earlier (file provably on disk),
  • the identical slice passes locally against the exact merge commit (237 files, 5755 tests, 0 failed),
  • a tree-integrity watcher running alongside the local slice confirms no sibling test mutates the repo,
  • three unrelated branches showed the same one-shard signature the same day (test (1), test (3), test (5) — one failing on a file that exists on main).

One run also produced a stale-content mode: 8 tests failed importing _egress_proxy_args_for_docker from a docker.py that momentarily had main's content while the same process had already imported the PR's brand-new sibling module. That is impossible against a stable checkout.

The existing exit-4 backoff retry (3 spaced attempts) doesn't recover these, and the log carries nothing to diagnose them with. This PR adds the missing observability.

Changes

  • scripts/run_tests_parallel.py: when rc == 4 survives the retry loop, append a forensics block to the output: file-exists-now, retries used, parent-dir entry count + similarly-named entries, git status --porcelain dirty count + first 10 entries. Wrapped in broad try/except so forensics can never mask the rc=4. Zero behavior change otherwise.
  • tests/test_run_tests_parallel.py: 2 new tests (exhausted-retries path asserts forensics + exists=True + retry count; genuinely-missing path asserts fail-fast + exists=False).

Validation

Result
tests/test_run_tests_parallel.py 8 / 8 (2 new)
E2E via importlib module load + mocked _spawn_pytest_once forensics block present in both paths, rc preserved

Infographic

exit4-forensics

…sts exit-4 retries

A PR-added test file (tests/test_iron_proxy.py, PR #30179) repeatedly
failed exactly one CI shard with 'ERROR: file or directory not found'
across 4 runs (including a fresh merge SHA on fresh runners), while the
identical slice passes locally against the same merge commit and a
tree-integrity watcher confirms no sibling test mutates the repo. Three
unrelated branches showed the same one-shard signature the same day.

We currently cannot attribute these because the log only carries
pytest's exit-4 line. This adds a forensics block to the captured
output when exit-4 survives the retry loop:

- does the file exist NOW (post-retries)
- parent dir entry count + similarly-named entries
- git status --porcelain dirty-entry count + first 10 entries

Zero behavior change: rc stays 4, retries unchanged, forensics wrapped
in a broad try/except so they can never mask the failure.

Two new tests cover the exhausted-retries and genuinely-missing paths.
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: infra/exit4-forensics vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10701 on HEAD, 10701 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5598 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/test Test coverage or test infrastructure P3 Low — cosmetic, nice to have labels Jun 10, 2026
@teknium1 teknium1 merged commit 07ac185 into main Jun 10, 2026
28 checks passed
@teknium1 teknium1 deleted the infra/exit4-forensics branch June 10, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have type/test Test coverage or test infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants