Skip to content

cli: anchor aube install at workspace root from member subdir#217

Merged
jdx merged 4 commits intomainfrom
claude/hardcore-lumiere-a87670
Apr 22, 2026
Merged

cli: anchor aube install at workspace root from member subdir#217
jdx merged 4 commits intomainfrom
claude/hardcore-lumiere-a87670

Conversation

@jdx
Copy link
Copy Markdown
Contributor

@jdx jdx commented Apr 22, 2026

Summary

Why

User-reported bug: workspace member install was re-downloading dependencies after a clean root install. The original report blamed #hash exotic dependencies, but that was a red herring — the truncated long-name dirs (<prefix>_<32hex> from dep_path_to_filename) just happened to be what they noticed in the per-member .aube/ that should never have been created.

Test plan

  • mise run test:bats test/workspace_member_install_walks_up.bats — passes
  • mise run test:bats test/workspace.bats test/install.bats test/project_root_walk_up.bats — all green (no regressions in the surrounding install paths)
  • cargo clippy --all-targets -- -D warnings — clean

🤖 Generated with Claude Code


Note

Medium Risk
Changes how aube install chooses its project root in workspace scenarios, which can affect where lockfiles/state are written and what gets installed. Also temporarily alters CI BATS execution (timeouts/serial per-file), which may mask parallel-only failures.

Overview
aube install now resolves its working directory by preferring the workspace root (only if it contains a package.json), falling back to the nearest package.json, preventing workspace-member installs from creating their own aube-lock.yaml, .aube/ store, and .aube-state.

aube dlx explicitly sets InstallOptions.project_dir to the scratch temp project to avoid accidentally walking up into a surrounding workspace. Adds a serial BATS regression test for the workspace-member install behavior, and temporarily changes CI non-serial BATS to run per-file with a 90s timeout to surface hangs.

Reviewed by Cursor Bugbot for commit 4436eee. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR fixes aube install to anchor at the workspace root (when it has a package.json) rather than the nearest package.json, preventing workspace members from getting their own private lockfile, virtual store, and re-downloaded dependencies. The dlx command is also patched to pin its scratch dir as project_dir, preventing the new root-discovery logic from accidentally resolving upward into an enclosing workspace.

  • mise-tasks/ci/bats/nonserial: The debug loop drops --report-formatter junit / --output but still calls the upload with the now-unwritten report path \u2014 if the upload tool errors on a missing file, the CI step will fail even on a clean run. The break on first failure also silently skips all remaining test files in the shard.

Confidence Score: 4/5

Core install fix and dlx patch are safe to merge; the temporary CI script change carries a real risk of spurious upload failures that should be addressed before landing.

The Rust changes are correct and well-tested. Score is 4 rather than 5 because the nonserial CI script has a P1 issue: the JUnit report file is never written but the upload is still attempted unconditionally, which can flip a passing CI run to a failure exit code.

mise-tasks/ci/bats/nonserial — upload call references a report file that is no longer generated by the debug loop

Important Files Changed

Filename Overview
crates/aube/src/commands/install/mod.rs Prefers workspace root (with package.json guard) over nearest package.json; mirrors ensure_installed logic and is well-commented
crates/aube/src/commands/dlx.rs Explicitly pins project_dir to scratch temp path to prevent new workspace-root discovery from resolving to an enclosing workspace during dlx installs
mise-tasks/ci/bats/nonserial Temporary debug loop drops JUnit report generation but still attempts upload, risking spurious CI failures; also breaks on first failure hiding subsequent test results
test/workspace_member_install_walks_up.bats Regression test for member-dir install; asserts no standalone artifacts appear in member and root state is byte-identical after member install

Comments Outside Diff (1)

  1. mise-tasks/ci/bats/nonserial, line 46-51 (link)

    P1 JUnit report never written — upload will fail even on a clean run

    The new loop removes --report-formatter junit and --output "$report_dir", so $report_path is never created. On line 47 the upload is still called with that path; if the upload tool returns non-zero for a missing file (a reasonable default), upload_status becomes non-zero and exit "$upload_status" fires — failing the CI step even when every test passed. The report generation flags need to be added back to each bats invocation (or the upload call should be guarded with a [[ -f "$report_path" ]] check until the debug loop is reverted).

    Fix in Claude Code

Fix All in Claude Code

Reviews (5): Last reviewed commit: "DEBUG: surface hanging bats file in CI" | Re-trigger Greptile

Comment thread test/workspace_member_install_walks_up.bats Outdated
Comment thread test/workspace_member_install_walks_up.bats Outdated
@jdx jdx force-pushed the claude/hardcore-lumiere-a87670 branch from 67852db to c4b89ad Compare April 22, 2026 21:28
@jdx jdx changed the title test(install): pin failing repro for workspace-member install cli: anchor aube install at workspace root from member subdir Apr 22, 2026
// anchor logic in `ensure_installed`.
match crate::dirs::find_workspace_root(&initial_cwd)
.or_else(|| crate::dirs::find_project_root(&initial_cwd))
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production fix accidentally included in test-only PR

Medium Severity

The PR title (test(install): pin failing repro …) and description explicitly state this is a test-only change, with the actual fix being "out of scope for this PR." The unchecked test-plan item confirms the test is supposed to fail right now. However, the diff includes the production code change in install/mod.rs that swaps find_project_root for find_workspace_root().or_else(|| find_project_root()) — which is the fix itself. With the fix present the new bats test will pass, contradicting the stated intent of landing a pinned-failing regression test first.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c4b89ad. Configure here.

jdx and others added 2 commits April 22, 2026 16:35
…e root

`aube install` from inside a workspace member walks up to the *nearest*
package.json (the member's own) instead of the workspace root, so it
runs the full standalone pipeline: own lockfile, own .aube/ virtual
store, own .aube-state — and re-downloads anything not already in the
global cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same fix as the auto-install anchor change: prefer `find_workspace_root`
over `find_project_root` so `aube install` from inside a workspace
member resolves up to the workspace root instead of treating the member
as a standalone project. Without this, the member install runs the full
pipeline (own lockfile, own .aube/ store, own .aube-state) and
re-downloads anything not already in the global cache.

Also addresses PR review feedback: drop the dead pnpm-lock.yaml check
(aube never writes that file) and switch the state snapshot comparison
from `[` to `assert_equal` for readable diffs on regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdx jdx force-pushed the claude/hardcore-lumiere-a87670 branch from c4b89ad to ba0451c Compare April 22, 2026 21:37
@jdx jdx enabled auto-merge (squash) April 22, 2026 21:46
jdx and others added 2 commits April 22, 2026 17:14
`dlx::run` chdirs into a scratch tempdir and invokes `install::run`,
which now (after the workspace-root anchor change) walks upward looking
for a workspace yaml or workspaces field. The scratch dir lives under
TMPDIR — which inside a workspace project resolves to the user's
workspace root, so the dlx install would target the wrong tree
(corrupting the user's workspace state, or in CI parallel bats runs
hanging while contending for a tree it has no business installing).

Set `project_dir` explicitly to the scratch path so install::run uses
that as cwd unconditionally and skips the walk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Temporary: run shard files sequentially with a 90s per-file timeout so
the bats job kills the hanging test instead of running until the
6-hour GHA limit. Will revert after the hang is identified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdx jdx merged commit 98bf0e3 into main Apr 22, 2026
16 checks passed
@jdx jdx deleted the claude/hardcore-lumiere-a87670 branch April 22, 2026 22:29
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4436eee. Configure here.

rc=$?
echo "::endgroup::"
echo "BATS file failed or timed out: $tf (exit $rc)"
status=$rc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exit code always zero after negated if condition

High Severity

rc=$? inside the then block of if ! timeout ... bats ...; then always captures 0 because ! negates the pipeline's exit status before the if evaluates it. This means status is never set to a nonzero value from a test failure, the log message always says (exit 0), and the script may exit successfully even when tests fail — CI could silently pass with broken tests depending on whether the report upload also fails.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4436eee. Configure here.

break
fi
echo "::endgroup::"
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary debug CI code committed in unrelated PR

Medium Severity

The mise-tasks/ci/bats/nonserial changes are explicitly marked DEBUG (TEMP) and Revert this change once the hang is identified. This removes parallel test execution, drops JUnit report generation (so the upload on line 47 targets a nonexistent file), and adds break on first failure — all unmentioned in the PR description which is solely about workspace member install anchoring.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4436eee. Configure here.

jdx added a commit that referenced this pull request Apr 22, 2026
#217 was squash-merged with a DEBUG version of the bats nonserial script
that I had pushed to surface a hanging test by name. The debug loop
runs each file sequentially with a 90s per-file timeout — slow, fragile
(any legitimately-slow test trips the timeout), and breaks-on-first-
failure (so only one shard failure surfaces).

Restore the original parallel `bats --jobs N` invocation and add
`--no-parallelize-within-files`, mirroring the local `mise run test:bats`
task. The within-file flag is the actual fix for the hang #217 surfaced:
across-file parallelism alone fits comfortably on GHA runners; layering
within-file parallelism on top multiplies concurrent `aube install`
processes high enough to deadlock on FD limits / Verdaccio connections /
kernel I/O.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdx added a commit that referenced this pull request Apr 22, 2026
PR #217's workspace-root walk-up in install::run triggered a deadlock
in bats parallel shards 1+2 (and both macOS shards) under any
--jobs >= 2. Verified:

- --jobs=4 hangs indefinitely on shards 1+2 ubuntu and both macOS
- --jobs=2 hangs the same shards
- --jobs=1 passes (this commit)
- --no-parallelize-within-files alone does not help

Cannot reproduce locally on a 10-core M-series Mac with identical
flags, so this is GHA-runner-specific. Likely candidates: FD limits,
Verdaccio connection pool, or kernel I/O contention compounded by
the heavier install pipeline that the walk-up enables for tests
running from a workspace member.

Force --jobs=1 for nonserial shards as a fix-forward; across-shard
parallelism still works via the shard split, so wall-time stays
bounded by the slowest shard. Drop the unused --no-parallelize-
within-files and --parallel-binary-name flags now that --jobs is 1.

Follow-up: identify the deadlock with strace/lsof on a repro runner
and restore parallelism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdx added a commit that referenced this pull request Apr 22, 2026
#220)

## Summary

Found the actual root cause of the post-#217 bats parallel deadlock:
**#217's workspace-root walk-up in `install::run` interacted
catastrophically with the recursive `aube install` that `node-gyp`
bootstrap spawns.**

### The deadlock

1.
[`node_gyp_bootstrap::bootstrap_blocking`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:152)
puts its scratch project at `$XDG_CACHE_HOME/aube/tools/node-gyp/v12/`
and spawns a recursive `aube install` there.
2. Under bats, `HOME=$TEST_TEMP_DIR` and
`XDG_CACHE_HOME=$TEST_TEMP_DIR/.cache`. If the test has a
`pnpm-workspace.yaml` at `$TEST_TEMP_DIR/` (very common — workspace
tests, `allowBuilds`, lifecycle tests, etc.), then:
3. The recursive `aube install` starts at `tool_dir`, walks up via
`find_workspace_root` (added in #217), and finds the outer test's
`pnpm-workspace.yaml`.
4. The recursive process tries to install the outer workspace and blocks
on the outer project lock at `$TEST_TEMP_DIR/node_modules/` — the lock
the parent `aube install` is already holding.
5. **Two-process deadlock. Both wedge forever.**

### Why it only reproduced in GHA

- GHA runners hit the lifecycle-script path more densely under `--jobs
>= 2` because multiple parallel tests concurrently trigger postinstall
hooks.
- Dev machines usually have `node-gyp` already on PATH, so
[`ensure()`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:88)
short-circuits and the recursive install never runs.

### Fix

Write an empty `aube-workspace.yaml` in `tool_dir` alongside the
synthetic `package.json`. `find_workspace_root`'s first-ancestor check
finds the marker at `tool_dir` and returns immediately — the walk no
longer escapes. `workspace_packages` stays empty so `has_workspace` is
false and the install runs as a single-package install.

## Test plan

- [x] `cargo clippy --all-targets -- -D warnings` — clean
- [x] Manual reasoning: the marker kills the walk at its first ancestor,
so the recursive install can no longer target the outer workspace,
breaking the lock-cycle
- [ ] CI green on shards that previously hung (this PR restores `--jobs`
parallelism in #219)

## Follow-up

Close #219 (was a CI-only workaround); the parallel-hang it papered over
is fixed here at the root.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: changes only the node-gyp bootstrap scratch directory setup
by adding a local workspace marker to avoid workspace root walk-up and
related lock deadlocks during recursive installs.
> 
> **Overview**
> Prevents the recursive `aube install` used for `node-gyp`
bootstrapping from walking up into an outer workspace (e.g., test
fixtures with `pnpm-workspace.yaml`) and deadlocking on the parent
install’s project lock.
> 
> This does so by writing an empty `aube-workspace.yaml` into the
node-gyp tool cache directory before spawning the recursive install,
ensuring workspace-root discovery stops at `tool_dir` and runs as a
single-package install.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
93b9e1f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdx added a commit that referenced this pull request Apr 22, 2026
#217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial`
(per-file `for` loop with 90s timeout, breaks on first failure, no
junit reporting). The actual deadlock that the debug loop was trying
to surface — node-gyp bootstrap recursing into the outer test
workspace — was fixed in #220. Restore the original `bats --jobs N`
parallel invocation with junit reporting now that the underlying
deadlock is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jdx added a commit that referenced this pull request Apr 22, 2026
…ck (#219)

## Summary

#217 was squash-merged with a debug version of
`mise-tasks/ci/bats/nonserial` — a per-file `for` loop with a 90s
timeout, breaks-on-first-failure, no junit reporting. Main has been
running on that debug loop ever since.

#220 fixed the actual root cause of the bats parallel deadlock (node-gyp
bootstrap recursing back into the outer test workspace via #217's
walk-up). With that fix on main, the debug script is no longer needed.

This PR restores the original `bats --jobs N` parallel invocation with
junit reporting.

## Test plan

- [x] CI green proves it: shards 1+2 ubuntu and both macOS shards
previously hung indefinitely with `--jobs >= 2`. With #220 merged, the
deadlock is gone, so this PR's parallel invocation should land bats
wall-time back at ~2 min/shard.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes CI test execution from a per-file, timed loop back to a single
parallel `bats` run, which may reintroduce hangs or alter
failure/reporting behavior in CI.
> 
> **Overview**
> Replaces the temporary debug per-file/timeout loop in
`mise-tasks/ci/bats/nonserial` with a single `bats` invocation over the
shard’s file list.
> 
> The new run restores JUnit output (`--report-formatter junit`,
`--output`, `BATS_REPORT_FILENAME`) and uses
`--jobs`/`--parallel-binary-name` for parallel execution while still
filtering out `serial` tests.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
cf4f390. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant