ci(bats): restore parallel jobs after #220 fixed the bootstrap deadlock#219
ci(bats): restore parallel jobs after #220 fixed the bootstrap deadlock#219
Conversation
Greptile SummaryThis PR removes the debug per-file
Confidence Score: 2/5Not safe to merge — the stated deadlock fix (--jobs=1) is missing from the code; CI will hang the same way as before. There is exactly one changed file and exactly one purpose: cap parallelism to unblock CI. That cap is not applied, so the primary objective of the PR is unmet. mise-tasks/ci/bats/nonserial — needs
|
| Filename | Overview |
|---|---|
| mise-tasks/ci/bats/nonserial | Restores bats invocation with junit reporting, but the --jobs=1 cap stated in the PR title/description is not implemented — jobs is still set to nproc, meaning the deadlock fix is not actually applied. |
Comments Outside Diff (1)
-
mise-tasks/ci/bats/nonserial, line 4-8 (link)--jobs=1cap not implemented — will reproduce the hangThe PR title says "force --jobs=1" and the description explicitly states
--jobs >= 2reproduces the deadlock on GHA runners. Butjobsis still computed as the full CPU count (likely 4 onubuntu-latest) and passed unmodified to--jobs "$jobs"on line 31. The script will deadlock the same way as before this PR.
Reviews (5): Last reviewed commit: "ci(bats): clean up debug-loop in nonseri..." | Re-trigger Greptile
429dbd5 to
e8cef96
Compare
| --output "$report_dir" \ | ||
| --jobs "$jobs" \ | ||
| --filter-tags "!serial" \ | ||
| "${test_files[@]}" || status=$? |
There was a problem hiding this comment.
Missing --no-parallelize-within-files flag described in PR fix
Medium Severity
The PR description calls --no-parallelize-within-files "the actual fix for the hang" and says it should mirror the local mise run test:bats task, which includes this flag at mise-tasks/test/bats line 28. The flag is absent from the bats invocation here. While harmless with jobs=1, if jobs is ever corrected to dynamic CPU count (as the PR intends), the missing flag would reintroduce the deadlock the PR aims to fix.
Reviewed by Cursor Bugbot for commit 4656ca1. Configure here.
#217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial` (per-file `for` loop with 90s timeout, breaks on first failure, no junit reporting). The actual deadlock that the debug loop was trying to surface — node-gyp bootstrap recursing into the outer test workspace — was fixed in #220. Restore the original `bats --jobs N` parallel invocation with junit reporting now that the underlying deadlock is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4656ca1 to
cf4f390
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cf4f390. Configure here.
| fi | ||
| echo "::endgroup::" | ||
| done | ||
| BATS_REPORT_FILENAME="$report_name" ./test/bats/bin/bats \ |
There was a problem hiding this comment.
--jobs not forced to 1, deadlock workaround ineffective
High Severity
The entire purpose of this PR is to force --jobs=1 to work around a deadlock at --jobs >= 2, but jobs is still set to $(nproc) / $(sysctl -n hw.ncpu) (the CPU core count) and is never overridden to 1. The deadlock workaround described in the PR title and description is not actually applied.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit cf4f390. Configure here.
endevco#220) ## Summary Found the actual root cause of the post-endevco#217 bats parallel deadlock: **endevco#217's workspace-root walk-up in `install::run` interacted catastrophically with the recursive `aube install` that `node-gyp` bootstrap spawns.** ### The deadlock 1. [`node_gyp_bootstrap::bootstrap_blocking`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:152) puts its scratch project at `$XDG_CACHE_HOME/aube/tools/node-gyp/v12/` and spawns a recursive `aube install` there. 2. Under bats, `HOME=$TEST_TEMP_DIR` and `XDG_CACHE_HOME=$TEST_TEMP_DIR/.cache`. If the test has a `pnpm-workspace.yaml` at `$TEST_TEMP_DIR/` (very common — workspace tests, `allowBuilds`, lifecycle tests, etc.), then: 3. The recursive `aube install` starts at `tool_dir`, walks up via `find_workspace_root` (added in endevco#217), and finds the outer test's `pnpm-workspace.yaml`. 4. The recursive process tries to install the outer workspace and blocks on the outer project lock at `$TEST_TEMP_DIR/node_modules/` — the lock the parent `aube install` is already holding. 5. **Two-process deadlock. Both wedge forever.** ### Why it only reproduced in GHA - GHA runners hit the lifecycle-script path more densely under `--jobs >= 2` because multiple parallel tests concurrently trigger postinstall hooks. - Dev machines usually have `node-gyp` already on PATH, so [`ensure()`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:88) short-circuits and the recursive install never runs. ### Fix Write an empty `aube-workspace.yaml` in `tool_dir` alongside the synthetic `package.json`. `find_workspace_root`'s first-ancestor check finds the marker at `tool_dir` and returns immediately — the walk no longer escapes. `workspace_packages` stays empty so `has_workspace` is false and the install runs as a single-package install. ## Test plan - [x] `cargo clippy --all-targets -- -D warnings` — clean - [x] Manual reasoning: the marker kills the walk at its first ancestor, so the recursive install can no longer target the outer workspace, breaking the lock-cycle - [ ] CI green on shards that previously hung (this PR restores `--jobs` parallelism in endevco#219) ## Follow-up Close endevco#219 (was a CI-only workaround); the parallel-hang it papered over is fixed here at the root. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk: changes only the node-gyp bootstrap scratch directory setup by adding a local workspace marker to avoid workspace root walk-up and related lock deadlocks during recursive installs. > > **Overview** > Prevents the recursive `aube install` used for `node-gyp` bootstrapping from walking up into an outer workspace (e.g., test fixtures with `pnpm-workspace.yaml`) and deadlocking on the parent install’s project lock. > > This does so by writing an empty `aube-workspace.yaml` into the node-gyp tool cache directory before spawning the recursive install, ensuring workspace-root discovery stops at `tool_dir` and runs as a single-package install. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 93b9e1f. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
#217 was squash-merged with a debug version of
mise-tasks/ci/bats/nonserial— a per-fileforloop with a 90s timeout, breaks-on-first-failure, no junit reporting. Main has been running on that debug loop ever since.#220 fixed the actual root cause of the bats parallel deadlock (node-gyp bootstrap recursing back into the outer test workspace via #217's walk-up). With that fix on main, the debug script is no longer needed.
This PR restores the original
bats --jobs Nparallel invocation with junit reporting.Test plan
--jobs >= 2. With cli(install): fix node-gyp bootstrap walkup causing bats parallel hang #220 merged, the deadlock is gone, so this PR's parallel invocation should land bats wall-time back at ~2 min/shard.🤖 Generated with Claude Code
Note
Medium Risk
Changes CI test execution from a per-file, timed loop back to a single parallel
batsrun, which may reintroduce hangs or alter failure/reporting behavior in CI.Overview
Replaces the temporary debug per-file/timeout loop in
mise-tasks/ci/bats/nonserialwith a singlebatsinvocation over the shard’s file list.The new run restores JUnit output (
--report-formatter junit,--output,BATS_REPORT_FILENAME) and uses--jobs/--parallel-binary-namefor parallel execution while still filtering outserialtests.Reviewed by Cursor Bugbot for commit cf4f390. Bugbot is set up for automated code reviews on this repo. Configure here.