cli(install): fix node-gyp bootstrap walkup causing bats parallel hang#220
cli(install): fix node-gyp bootstrap walkup causing bats parallel hang#220
Conversation
#217's workspace-root walk-up in install::run interacted catastrophically with the recursive `aube install` that node-gyp bootstrap spawns. The bootstrap puts its scratch project under `$XDG_CACHE_HOME/aube/tools/node-gyp/v12/`. Inside any HOME with a `pnpm-workspace.yaml` at the top — which is exactly the shape every bats test creates (HOME=$TEST_TEMP_DIR with the test workspace yaml at the root) — the recursive `aube install` walked out of `tool_dir`, discovered the outer workspace yaml, and tried to install the *outer* workspace while the outer process held its project lock. Two-process deadlock; both wedged forever. This was the actual cause of the post-#217 GHA bats hang on shards 1+2 ubuntu and both macOS shards (any test that triggered a node-gyp bootstrap inside a workspace got stuck). It only reproduced under `--jobs >= 2` because GHA runners hit the lifecycle-script path more densely under parallelism, and never reproduced locally because most dev machines have node-gyp on PATH (bootstrap short-circuits). Fix: write an empty `aube-workspace.yaml` in `tool_dir` alongside the synthetic `package.json`. `find_workspace_root`'s first-marker check now returns `tool_dir` immediately and the recursive install runs as a single-package install (`workspace_packages` is empty so `has_workspace` stays false). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR writes an empty Confidence Score: 5/5Safe to merge — the single-line fix is minimal, well-reasoned, and confirmed correct against the actual The empty sentinel file is correctly handled by No files require special attention.
|
| Filename | Overview |
|---|---|
| crates/aube/src/commands/install/node_gyp_bootstrap.rs | Writes an empty aube-workspace.yaml sentinel in tool_dir before the recursive aube install, stopping find_workspace_root's walk at the tool directory and preventing the two-process deadlock in parallel bats tests. |
Reviews (1): Last reviewed commit: "cli(install): break workspace walk-up ou..." | Re-trigger Greptile
#217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial` (per-file `for` loop with 90s timeout, breaks on first failure, no junit reporting). The actual deadlock that the debug loop was trying to surface — node-gyp bootstrap recursing into the outer test workspace — was fixed in #220. Restore the original `bats --jobs N` parallel invocation with junit reporting now that the underlying deadlock is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ck (#219) ## Summary #217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial` — a per-file `for` loop with a 90s timeout, breaks-on-first-failure, no junit reporting. Main has been running on that debug loop ever since. #220 fixed the actual root cause of the bats parallel deadlock (node-gyp bootstrap recursing back into the outer test workspace via #217's walk-up). With that fix on main, the debug script is no longer needed. This PR restores the original `bats --jobs N` parallel invocation with junit reporting. ## Test plan - [x] CI green proves it: shards 1+2 ubuntu and both macOS shards previously hung indefinitely with `--jobs >= 2`. With #220 merged, the deadlock is gone, so this PR's parallel invocation should land bats wall-time back at ~2 min/shard. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes CI test execution from a per-file, timed loop back to a single parallel `bats` run, which may reintroduce hangs or alter failure/reporting behavior in CI. > > **Overview** > Replaces the temporary debug per-file/timeout loop in `mise-tasks/ci/bats/nonserial` with a single `bats` invocation over the shard’s file list. > > The new run restores JUnit output (`--report-formatter junit`, `--output`, `BATS_REPORT_FILENAME`) and uses `--jobs`/`--parallel-binary-name` for parallel execution while still filtering out `serial` tests. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit cf4f390. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Found the actual root cause of the post-#217 bats parallel deadlock: #217's workspace-root walk-up in
install::runinteracted catastrophically with the recursiveaube installthatnode-gypbootstrap spawns.The deadlock
node_gyp_bootstrap::bootstrap_blockingputs its scratch project at$XDG_CACHE_HOME/aube/tools/node-gyp/v12/and spawns a recursiveaube installthere.HOME=$TEST_TEMP_DIRandXDG_CACHE_HOME=$TEST_TEMP_DIR/.cache. If the test has apnpm-workspace.yamlat$TEST_TEMP_DIR/(very common — workspace tests,allowBuilds, lifecycle tests, etc.), then:aube installstarts attool_dir, walks up viafind_workspace_root(added in cli: anchor aube install at workspace root from member subdir #217), and finds the outer test'spnpm-workspace.yaml.$TEST_TEMP_DIR/node_modules/— the lock the parentaube installis already holding.Why it only reproduced in GHA
--jobs >= 2because multiple parallel tests concurrently trigger postinstall hooks.node-gypalready on PATH, soensure()short-circuits and the recursive install never runs.Fix
Write an empty
aube-workspace.yamlintool_diralongside the syntheticpackage.json.find_workspace_root's first-ancestor check finds the marker attool_dirand returns immediately — the walk no longer escapes.workspace_packagesstays empty sohas_workspaceis false and the install runs as a single-package install.Test plan
cargo clippy --all-targets -- -D warnings— clean--jobsparallelism in ci(bats): restore parallel jobs after #220 fixed the bootstrap deadlock #219)Follow-up
Close #219 (was a CI-only workaround); the parallel-hang it papered over is fixed here at the root.
🤖 Generated with Claude Code
Note
Low Risk
Low risk: changes only the node-gyp bootstrap scratch directory setup by adding a local workspace marker to avoid workspace root walk-up and related lock deadlocks during recursive installs.
Overview
Prevents the recursive
aube installused fornode-gypbootstrapping from walking up into an outer workspace (e.g., test fixtures withpnpm-workspace.yaml) and deadlocking on the parent install’s project lock.This does so by writing an empty
aube-workspace.yamlinto the node-gyp tool cache directory before spawning the recursive install, ensuring workspace-root discovery stops attool_dirand runs as a single-package install.Reviewed by Cursor Bugbot for commit 93b9e1f. Bugbot is set up for automated code reviews on this repo. Configure here.