Skip to content

cli(install): fix node-gyp bootstrap walkup causing bats parallel hang#220

Merged
jdx merged 1 commit intomainfrom
fix/node-gyp-bootstrap-walkup
Apr 22, 2026
Merged

cli(install): fix node-gyp bootstrap walkup causing bats parallel hang#220
jdx merged 1 commit intomainfrom
fix/node-gyp-bootstrap-walkup

Conversation

@jdx
Copy link
Copy Markdown
Contributor

@jdx jdx commented Apr 22, 2026

Summary

Found the actual root cause of the post-#217 bats parallel deadlock: #217's workspace-root walk-up in install::run interacted catastrophically with the recursive aube install that node-gyp bootstrap spawns.

The deadlock

  1. node_gyp_bootstrap::bootstrap_blocking puts its scratch project at $XDG_CACHE_HOME/aube/tools/node-gyp/v12/ and spawns a recursive aube install there.
  2. Under bats, HOME=$TEST_TEMP_DIR and XDG_CACHE_HOME=$TEST_TEMP_DIR/.cache. If the test has a pnpm-workspace.yaml at $TEST_TEMP_DIR/ (very common — workspace tests, allowBuilds, lifecycle tests, etc.), then:
  3. The recursive aube install starts at tool_dir, walks up via find_workspace_root (added in cli: anchor aube install at workspace root from member subdir #217), and finds the outer test's pnpm-workspace.yaml.
  4. The recursive process tries to install the outer workspace and blocks on the outer project lock at $TEST_TEMP_DIR/node_modules/ — the lock the parent aube install is already holding.
  5. Two-process deadlock. Both wedge forever.

Why it only reproduced in GHA

  • GHA runners hit the lifecycle-script path more densely under --jobs >= 2 because multiple parallel tests concurrently trigger postinstall hooks.
  • Dev machines usually have node-gyp already on PATH, so ensure() short-circuits and the recursive install never runs.

Fix

Write an empty aube-workspace.yaml in tool_dir alongside the synthetic package.json. find_workspace_root's first-ancestor check finds the marker at tool_dir and returns immediately — the walk no longer escapes. workspace_packages stays empty so has_workspace is false and the install runs as a single-package install.

Test plan

Follow-up

Close #219 (was a CI-only workaround); the parallel-hang it papered over is fixed here at the root.

🤖 Generated with Claude Code


Note

Low Risk
Low risk: changes only the node-gyp bootstrap scratch directory setup by adding a local workspace marker to avoid workspace root walk-up and related lock deadlocks during recursive installs.

Overview
Prevents the recursive aube install used for node-gyp bootstrapping from walking up into an outer workspace (e.g., test fixtures with pnpm-workspace.yaml) and deadlocking on the parent install’s project lock.

This does so by writing an empty aube-workspace.yaml into the node-gyp tool cache directory before spawning the recursive install, ensuring workspace-root discovery stops at tool_dir and runs as a single-package install.

Reviewed by Cursor Bugbot for commit 93b9e1f. Bugbot is set up for automated code reviews on this repo. Configure here.

#217's workspace-root walk-up in install::run interacted catastrophically
with the recursive `aube install` that node-gyp bootstrap spawns. The
bootstrap puts its scratch project under
`$XDG_CACHE_HOME/aube/tools/node-gyp/v12/`. Inside any HOME with a
`pnpm-workspace.yaml` at the top — which is exactly the shape every
bats test creates (HOME=$TEST_TEMP_DIR with the test workspace yaml at
the root) — the recursive `aube install` walked out of `tool_dir`,
discovered the outer workspace yaml, and tried to install the *outer*
workspace while the outer process held its project lock. Two-process
deadlock; both wedged forever.

This was the actual cause of the post-#217 GHA bats hang on shards 1+2
ubuntu and both macOS shards (any test that triggered a node-gyp
bootstrap inside a workspace got stuck). It only reproduced under
`--jobs >= 2` because GHA runners hit the lifecycle-script path more
densely under parallelism, and never reproduced locally because most
dev machines have node-gyp on PATH (bootstrap short-circuits).

Fix: write an empty `aube-workspace.yaml` in `tool_dir` alongside the
synthetic `package.json`. `find_workspace_root`'s first-marker check
now returns `tool_dir` immediately and the recursive install runs as a
single-package install (`workspace_packages` is empty so
`has_workspace` stays false).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR writes an empty aube-workspace.yaml sentinel into the node-gyp bootstrap tool_dir before spawning the recursive aube install. Because find_workspace_root checks start itself first via Path::ancestors(), the walk stops immediately at tool_dir instead of climbing into the outer test's temp workspace and deadlocking on its project lock. WorkspaceConfig::load already handles an empty yaml file by returning Self::default() (empty packages list), so the recursive install correctly runs as a single-package install with has_workspace = false.

Confidence Score: 5/5

Safe to merge — the single-line fix is minimal, well-reasoned, and confirmed correct against the actual WorkspaceConfig::load and find_workspace_root implementations.

The empty sentinel file is correctly handled by WorkspaceConfig::load (returns Self::default() for empty content), find_workspace_root stops at tool_dir on its first iteration since Path::ancestors() includes start, and has_workspace stays false because workspace_packages is empty. No P0/P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
crates/aube/src/commands/install/node_gyp_bootstrap.rs Writes an empty aube-workspace.yaml sentinel in tool_dir before the recursive aube install, stopping find_workspace_root's walk at the tool directory and preventing the two-process deadlock in parallel bats tests.

Reviews (1): Last reviewed commit: "cli(install): break workspace walk-up ou..." | Re-trigger Greptile

@jdx jdx merged commit 55fc657 into main Apr 22, 2026
16 checks passed
@jdx jdx deleted the fix/node-gyp-bootstrap-walkup branch April 22, 2026 23:24
jdx added a commit that referenced this pull request Apr 22, 2026
#217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial`
(per-file `for` loop with 90s timeout, breaks on first failure, no
junit reporting). The actual deadlock that the debug loop was trying
to surface — node-gyp bootstrap recursing into the outer test
workspace — was fixed in #220. Restore the original `bats --jobs N`
parallel invocation with junit reporting now that the underlying
deadlock is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdx added a commit that referenced this pull request Apr 22, 2026
…ck (#219)

## Summary

#217 was squash-merged with a debug version of
`mise-tasks/ci/bats/nonserial` — a per-file `for` loop with a 90s
timeout, breaks-on-first-failure, no junit reporting. Main has been
running on that debug loop ever since.

#220 fixed the actual root cause of the bats parallel deadlock (node-gyp
bootstrap recursing back into the outer test workspace via #217's
walk-up). With that fix on main, the debug script is no longer needed.

This PR restores the original `bats --jobs N` parallel invocation with
junit reporting.

## Test plan

- [x] CI green proves it: shards 1+2 ubuntu and both macOS shards
previously hung indefinitely with `--jobs >= 2`. With #220 merged, the
deadlock is gone, so this PR's parallel invocation should land bats
wall-time back at ~2 min/shard.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes CI test execution from a per-file, timed loop back to a single
parallel `bats` run, which may reintroduce hangs or alter
failure/reporting behavior in CI.
> 
> **Overview**
> Replaces the temporary debug per-file/timeout loop in
`mise-tasks/ci/bats/nonserial` with a single `bats` invocation over the
shard’s file list.
> 
> The new run restores JUnit output (`--report-formatter junit`,
`--output`, `BATS_REPORT_FILENAME`) and uses
`--jobs`/`--parallel-binary-name` for parallel execution while still
filtering out `serial` tests.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
cf4f390. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant