cli: anchor aube install at workspace root from member subdir#217
Conversation
Greptile SummaryThis PR fixes
Confidence Score: 4/5Core install fix and dlx patch are safe to merge; the temporary CI script change carries a real risk of spurious upload failures that should be addressed before landing. The Rust changes are correct and well-tested. Score is 4 rather than 5 because the nonserial CI script has a P1 issue: the JUnit report file is never written but the upload is still attempted unconditionally, which can flip a passing CI run to a failure exit code. mise-tasks/ci/bats/nonserial — upload call references a report file that is no longer generated by the debug loop
|
| Filename | Overview |
|---|---|
| crates/aube/src/commands/install/mod.rs | Prefers workspace root (with package.json guard) over nearest package.json; mirrors ensure_installed logic and is well-commented |
| crates/aube/src/commands/dlx.rs | Explicitly pins project_dir to scratch temp path to prevent new workspace-root discovery from resolving to an enclosing workspace during dlx installs |
| mise-tasks/ci/bats/nonserial | Temporary debug loop drops JUnit report generation but still attempts upload, risking spurious CI failures; also breaks on first failure hiding subsequent test results |
| test/workspace_member_install_walks_up.bats | Regression test for member-dir install; asserts no standalone artifacts appear in member and root state is byte-identical after member install |
Comments Outside Diff (1)
-
mise-tasks/ci/bats/nonserial, line 46-51 (link)JUnit report never written — upload will fail even on a clean run
The new loop removes
--report-formatter junitand--output "$report_dir", so$report_pathis never created. On line 47 the upload is still called with that path; if the upload tool returns non-zero for a missing file (a reasonable default),upload_statusbecomes non-zero andexit "$upload_status"fires — failing the CI step even when every test passed. The report generation flags need to be added back to eachbatsinvocation (or the upload call should be guarded with a[[ -f "$report_path" ]]check until the debug loop is reverted).
Reviews (5): Last reviewed commit: "DEBUG: surface hanging bats file in CI" | Re-trigger Greptile
67852db to
c4b89ad
Compare
| // anchor logic in `ensure_installed`. | ||
| match crate::dirs::find_workspace_root(&initial_cwd) | ||
| .or_else(|| crate::dirs::find_project_root(&initial_cwd)) | ||
| { |
There was a problem hiding this comment.
Production fix accidentally included in test-only PR
Medium Severity
The PR title (test(install): pin failing repro …) and description explicitly state this is a test-only change, with the actual fix being "out of scope for this PR." The unchecked test-plan item confirms the test is supposed to fail right now. However, the diff includes the production code change in install/mod.rs that swaps find_project_root for find_workspace_root().or_else(|| find_project_root()) — which is the fix itself. With the fix present the new bats test will pass, contradicting the stated intent of landing a pinned-failing regression test first.
Reviewed by Cursor Bugbot for commit c4b89ad. Configure here.
…e root `aube install` from inside a workspace member walks up to the *nearest* package.json (the member's own) instead of the workspace root, so it runs the full standalone pipeline: own lockfile, own .aube/ virtual store, own .aube-state — and re-downloads anything not already in the global cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same fix as the auto-install anchor change: prefer `find_workspace_root` over `find_project_root` so `aube install` from inside a workspace member resolves up to the workspace root instead of treating the member as a standalone project. Without this, the member install runs the full pipeline (own lockfile, own .aube/ store, own .aube-state) and re-downloads anything not already in the global cache. Also addresses PR review feedback: drop the dead pnpm-lock.yaml check (aube never writes that file) and switch the state snapshot comparison from `[` to `assert_equal` for readable diffs on regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c4b89ad to
ba0451c
Compare
`dlx::run` chdirs into a scratch tempdir and invokes `install::run`, which now (after the workspace-root anchor change) walks upward looking for a workspace yaml or workspaces field. The scratch dir lives under TMPDIR — which inside a workspace project resolves to the user's workspace root, so the dlx install would target the wrong tree (corrupting the user's workspace state, or in CI parallel bats runs hanging while contending for a tree it has no business installing). Set `project_dir` explicitly to the scratch path so install::run uses that as cwd unconditionally and skips the walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Temporary: run shard files sequentially with a 90s per-file timeout so the bats job kills the hanging test instead of running until the 6-hour GHA limit. Will revert after the hang is identified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4436eee. Configure here.
| rc=$? | ||
| echo "::endgroup::" | ||
| echo "BATS file failed or timed out: $tf (exit $rc)" | ||
| status=$rc |
There was a problem hiding this comment.
Exit code always zero after negated if condition
High Severity
rc=$? inside the then block of if ! timeout ... bats ...; then always captures 0 because ! negates the pipeline's exit status before the if evaluates it. This means status is never set to a nonzero value from a test failure, the log message always says (exit 0), and the script may exit successfully even when tests fail — CI could silently pass with broken tests depending on whether the report upload also fails.
Reviewed by Cursor Bugbot for commit 4436eee. Configure here.
| break | ||
| fi | ||
| echo "::endgroup::" | ||
| done |
There was a problem hiding this comment.
Temporary debug CI code committed in unrelated PR
Medium Severity
The mise-tasks/ci/bats/nonserial changes are explicitly marked DEBUG (TEMP) and Revert this change once the hang is identified. This removes parallel test execution, drops JUnit report generation (so the upload on line 47 targets a nonexistent file), and adds break on first failure — all unmentioned in the PR description which is solely about workspace member install anchoring.
Reviewed by Cursor Bugbot for commit 4436eee. Configure here.
#217 was squash-merged with a DEBUG version of the bats nonserial script that I had pushed to surface a hanging test by name. The debug loop runs each file sequentially with a 90s per-file timeout — slow, fragile (any legitimately-slow test trips the timeout), and breaks-on-first- failure (so only one shard failure surfaces). Restore the original parallel `bats --jobs N` invocation and add `--no-parallelize-within-files`, mirroring the local `mise run test:bats` task. The within-file flag is the actual fix for the hang #217 surfaced: across-file parallelism alone fits comfortably on GHA runners; layering within-file parallelism on top multiplies concurrent `aube install` processes high enough to deadlock on FD limits / Verdaccio connections / kernel I/O. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #217's workspace-root walk-up in install::run triggered a deadlock in bats parallel shards 1+2 (and both macOS shards) under any --jobs >= 2. Verified: - --jobs=4 hangs indefinitely on shards 1+2 ubuntu and both macOS - --jobs=2 hangs the same shards - --jobs=1 passes (this commit) - --no-parallelize-within-files alone does not help Cannot reproduce locally on a 10-core M-series Mac with identical flags, so this is GHA-runner-specific. Likely candidates: FD limits, Verdaccio connection pool, or kernel I/O contention compounded by the heavier install pipeline that the walk-up enables for tests running from a workspace member. Force --jobs=1 for nonserial shards as a fix-forward; across-shard parallelism still works via the shard split, so wall-time stays bounded by the slowest shard. Drop the unused --no-parallelize- within-files and --parallel-binary-name flags now that --jobs is 1. Follow-up: identify the deadlock with strace/lsof on a repro runner and restore parallelism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#220) ## Summary Found the actual root cause of the post-#217 bats parallel deadlock: **#217's workspace-root walk-up in `install::run` interacted catastrophically with the recursive `aube install` that `node-gyp` bootstrap spawns.** ### The deadlock 1. [`node_gyp_bootstrap::bootstrap_blocking`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:152) puts its scratch project at `$XDG_CACHE_HOME/aube/tools/node-gyp/v12/` and spawns a recursive `aube install` there. 2. Under bats, `HOME=$TEST_TEMP_DIR` and `XDG_CACHE_HOME=$TEST_TEMP_DIR/.cache`. If the test has a `pnpm-workspace.yaml` at `$TEST_TEMP_DIR/` (very common — workspace tests, `allowBuilds`, lifecycle tests, etc.), then: 3. The recursive `aube install` starts at `tool_dir`, walks up via `find_workspace_root` (added in #217), and finds the outer test's `pnpm-workspace.yaml`. 4. The recursive process tries to install the outer workspace and blocks on the outer project lock at `$TEST_TEMP_DIR/node_modules/` — the lock the parent `aube install` is already holding. 5. **Two-process deadlock. Both wedge forever.** ### Why it only reproduced in GHA - GHA runners hit the lifecycle-script path more densely under `--jobs >= 2` because multiple parallel tests concurrently trigger postinstall hooks. - Dev machines usually have `node-gyp` already on PATH, so [`ensure()`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:88) short-circuits and the recursive install never runs. ### Fix Write an empty `aube-workspace.yaml` in `tool_dir` alongside the synthetic `package.json`. `find_workspace_root`'s first-ancestor check finds the marker at `tool_dir` and returns immediately — the walk no longer escapes. `workspace_packages` stays empty so `has_workspace` is false and the install runs as a single-package install. ## Test plan - [x] `cargo clippy --all-targets -- -D warnings` — clean - [x] Manual reasoning: the marker kills the walk at its first ancestor, so the recursive install can no longer target the outer workspace, breaking the lock-cycle - [ ] CI green on shards that previously hung (this PR restores `--jobs` parallelism in #219) ## Follow-up Close #219 (was a CI-only workaround); the parallel-hang it papered over is fixed here at the root. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Low Risk** > Low risk: changes only the node-gyp bootstrap scratch directory setup by adding a local workspace marker to avoid workspace root walk-up and related lock deadlocks during recursive installs. > > **Overview** > Prevents the recursive `aube install` used for `node-gyp` bootstrapping from walking up into an outer workspace (e.g., test fixtures with `pnpm-workspace.yaml`) and deadlocking on the parent install’s project lock. > > This does so by writing an empty `aube-workspace.yaml` into the node-gyp tool cache directory before spawning the recursive install, ensuring workspace-root discovery stops at `tool_dir` and runs as a single-package install. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 93b9e1f. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial` (per-file `for` loop with 90s timeout, breaks on first failure, no junit reporting). The actual deadlock that the debug loop was trying to surface — node-gyp bootstrap recursing into the outer test workspace — was fixed in #220. Restore the original `bats --jobs N` parallel invocation with junit reporting now that the underlying deadlock is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ck (#219) ## Summary #217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial` — a per-file `for` loop with a 90s timeout, breaks-on-first-failure, no junit reporting. Main has been running on that debug loop ever since. #220 fixed the actual root cause of the bats parallel deadlock (node-gyp bootstrap recursing back into the outer test workspace via #217's walk-up). With that fix on main, the debug script is no longer needed. This PR restores the original `bats --jobs N` parallel invocation with junit reporting. ## Test plan - [x] CI green proves it: shards 1+2 ubuntu and both macOS shards previously hung indefinitely with `--jobs >= 2`. With #220 merged, the deadlock is gone, so this PR's parallel invocation should land bats wall-time back at ~2 min/shard. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes CI test execution from a per-file, timed loop back to a single parallel `bats` run, which may reintroduce hangs or alter failure/reporting behavior in CI. > > **Overview** > Replaces the temporary debug per-file/timeout loop in `mise-tasks/ci/bats/nonserial` with a single `bats` invocation over the shard’s file list. > > The new run restores JUnit output (`--report-formatter junit`, `--output`, `BATS_REPORT_FILENAME`) and uses `--jobs`/`--parallel-binary-name` for parallel execution while still filtering out `serial` tests. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit cf4f390. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
aube installfrom inside a workspace member walked up to the nearestpackage.json(the member's own) and ran the full standalone pipeline: ownaube-lock.yaml, ownnode_modules/.aube/virtual store, own.aube-state, plus re-downloading anything not already in the global cache.aube installcodepath atcrates/aube/src/commands/install/mod.rs:1715(cli: anchor auto-install freshness check at workspace root #215 only coveredensure_installedforaube run/exec/start).find_workspace_root, fall back tofind_project_root, then to the cwd.aube installfrompackages/a, and asserts no standalone artifacts appear inside the member.Why
User-reported bug: workspace member install was re-downloading dependencies after a clean root install. The original report blamed
#hash exotic dependencies, but that was a red herring — the truncated long-name dirs (<prefix>_<32hex>fromdep_path_to_filename) just happened to be what they noticed in the per-member.aube/that should never have been created.Test plan
mise run test:bats test/workspace_member_install_walks_up.bats— passesmise run test:bats test/workspace.bats test/install.bats test/project_root_walk_up.bats— all green (no regressions in the surrounding install paths)cargo clippy --all-targets -- -D warnings— clean🤖 Generated with Claude Code
Note
Medium Risk
Changes how
aube installchooses its project root in workspace scenarios, which can affect where lockfiles/state are written and what gets installed. Also temporarily alters CI BATS execution (timeouts/serial per-file), which may mask parallel-only failures.Overview
aube installnow resolves its working directory by preferring the workspace root (only if it contains apackage.json), falling back to the nearestpackage.json, preventing workspace-member installs from creating their ownaube-lock.yaml,.aube/store, and.aube-state.aube dlxexplicitly setsInstallOptions.project_dirto the scratch temp project to avoid accidentally walking up into a surrounding workspace. Adds a serial BATS regression test for the workspace-member install behavior, and temporarily changes CI non-serial BATS to run per-file with a 90s timeout to surface hangs.Reviewed by Cursor Bugbot for commit 4436eee. Bugbot is set up for automated code reviews on this repo. Configure here.