Skip to content

ci(bats): restore parallel jobs after #220 fixed the bootstrap deadlock#219

Merged
jdx merged 1 commit intomainfrom
ci/restore-bats-parallel-after-debug
Apr 22, 2026
Merged

ci(bats): restore parallel jobs after #220 fixed the bootstrap deadlock#219
jdx merged 1 commit intomainfrom
ci/restore-bats-parallel-after-debug

Conversation

@jdx
Copy link
Copy Markdown
Contributor

@jdx jdx commented Apr 22, 2026

Summary

#217 was squash-merged with a debug version of mise-tasks/ci/bats/nonserial — a per-file for loop with a 90s timeout, breaks-on-first-failure, no junit reporting. Main has been running on that debug loop ever since.

#220 fixed the actual root cause of the bats parallel deadlock (node-gyp bootstrap recursing back into the outer test workspace via #217's walk-up). With that fix on main, the debug script is no longer needed.

This PR restores the original bats --jobs N parallel invocation with junit reporting.

Test plan

🤖 Generated with Claude Code


Note

Medium Risk
Changes CI test execution from a per-file, timed loop back to a single parallel bats run, which may reintroduce hangs or alter failure/reporting behavior in CI.

Overview
Replaces the temporary debug per-file/timeout loop in mise-tasks/ci/bats/nonserial with a single bats invocation over the shard’s file list.

The new run restores JUnit output (--report-formatter junit, --output, BATS_REPORT_FILENAME) and uses --jobs/--parallel-binary-name for parallel execution while still filtering out serial tests.

Reviewed by Cursor Bugbot for commit cf4f390. Bugbot is set up for automated code reviews on this repo. Configure here.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR removes the debug per-file for loop from mise-tasks/ci/bats/nonserial and replaces it with a single bats --jobs N invocation with JUnit reporting — intended to restore a clean runner while capping parallelism at --jobs=1 to avoid the GHA deadlock identified in #217.

  • --jobs=1 not applied: The PR title and description say it forces --jobs=1, but jobs is computed as the full CPU count (nproc / hw.ncpu) and passed directly to --jobs \"$jobs\". On a 4-core ubuntu-latest runner this will deadlock exactly as described, making the fix ineffective.

Confidence Score: 2/5

Not safe to merge — the stated deadlock fix (--jobs=1) is missing from the code; CI will hang the same way as before.

There is exactly one changed file and exactly one purpose: cap parallelism to unblock CI. That cap is not applied, so the primary objective of the PR is unmet.

mise-tasks/ci/bats/nonserial — needs jobs=1 (or equivalent) before the bats invocation.

Important Files Changed

Filename Overview
mise-tasks/ci/bats/nonserial Restores bats invocation with junit reporting, but the --jobs=1 cap stated in the PR title/description is not implemented — jobs is still set to nproc, meaning the deadlock fix is not actually applied.

Comments Outside Diff (1)

  1. mise-tasks/ci/bats/nonserial, line 4-8 (link)

    P1 --jobs=1 cap not implemented — will reproduce the hang

    The PR title says "force --jobs=1" and the description explicitly states --jobs >= 2 reproduces the deadlock on GHA runners. But jobs is still computed as the full CPU count (likely 4 on ubuntu-latest) and passed unmodified to --jobs "$jobs" on line 31. The script will deadlock the same way as before this PR.

    Fix in Claude Code

Fix All in Claude Code

Reviews (5): Last reviewed commit: "ci(bats): clean up debug-loop in nonseri..." | Re-trigger Greptile

@jdx jdx enabled auto-merge (squash) April 22, 2026 22:46
@jdx jdx force-pushed the ci/restore-bats-parallel-after-debug branch from 429dbd5 to e8cef96 Compare April 22, 2026 23:01
Comment thread mise-tasks/ci/bats/nonserial Outdated
Comment thread mise-tasks/ci/bats/nonserial Outdated
@jdx jdx disabled auto-merge April 22, 2026 23:05
@jdx jdx changed the title ci(bats): restore parallel jobs after debug script merged into #217 ci(bats): clean up nonserial script + force --jobs=1 (post-#217 deadlock workaround) Apr 22, 2026
--output "$report_dir" \
--jobs "$jobs" \
--filter-tags "!serial" \
"${test_files[@]}" || status=$?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --no-parallelize-within-files flag described in PR fix

Medium Severity

The PR description calls --no-parallelize-within-files "the actual fix for the hang" and says it should mirror the local mise run test:bats task, which includes this flag at mise-tasks/test/bats line 28. The flag is absent from the bats invocation here. While harmless with jobs=1, if jobs is ever corrected to dynamic CPU count (as the PR intends), the missing flag would reintroduce the deadlock the PR aims to fix.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4656ca1. Configure here.

#217 was squash-merged with a debug version of `mise-tasks/ci/bats/nonserial`
(per-file `for` loop with 90s timeout, breaks on first failure, no
junit reporting). The actual deadlock that the debug loop was trying
to surface — node-gyp bootstrap recursing into the outer test
workspace — was fixed in #220. Restore the original `bats --jobs N`
parallel invocation with junit reporting now that the underlying
deadlock is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jdx jdx force-pushed the ci/restore-bats-parallel-after-debug branch from 4656ca1 to cf4f390 Compare April 22, 2026 23:27
@jdx jdx changed the title ci(bats): clean up nonserial script + force --jobs=1 (post-#217 deadlock workaround) ci(bats): restore parallel jobs after #220 fixed the bootstrap deadlock Apr 22, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cf4f390. Configure here.

fi
echo "::endgroup::"
done
BATS_REPORT_FILENAME="$report_name" ./test/bats/bin/bats \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--jobs not forced to 1, deadlock workaround ineffective

High Severity

The entire purpose of this PR is to force --jobs=1 to work around a deadlock at --jobs >= 2, but jobs is still set to $(nproc) / $(sysctl -n hw.ncpu) (the CPU core count) and is never overridden to 1. The deadlock workaround described in the PR title and description is not actually applied.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit cf4f390. Configure here.

@jdx jdx merged commit 6e3410d into main Apr 22, 2026
16 checks passed
@jdx jdx deleted the ci/restore-bats-parallel-after-debug branch April 22, 2026 23:33
pull Bot pushed a commit to rozsazoltan-forks/aube that referenced this pull request Apr 23, 2026
endevco#220)

## Summary

Found the actual root cause of the post-endevco#217 bats parallel deadlock:
**endevco#217's workspace-root walk-up in `install::run` interacted
catastrophically with the recursive `aube install` that `node-gyp`
bootstrap spawns.**

### The deadlock

1.
[`node_gyp_bootstrap::bootstrap_blocking`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:152)
puts its scratch project at `$XDG_CACHE_HOME/aube/tools/node-gyp/v12/`
and spawns a recursive `aube install` there.
2. Under bats, `HOME=$TEST_TEMP_DIR` and
`XDG_CACHE_HOME=$TEST_TEMP_DIR/.cache`. If the test has a
`pnpm-workspace.yaml` at `$TEST_TEMP_DIR/` (very common — workspace
tests, `allowBuilds`, lifecycle tests, etc.), then:
3. The recursive `aube install` starts at `tool_dir`, walks up via
`find_workspace_root` (added in endevco#217), and finds the outer test's
`pnpm-workspace.yaml`.
4. The recursive process tries to install the outer workspace and blocks
on the outer project lock at `$TEST_TEMP_DIR/node_modules/` — the lock
the parent `aube install` is already holding.
5. **Two-process deadlock. Both wedge forever.**

### Why it only reproduced in GHA

- GHA runners hit the lifecycle-script path more densely under `--jobs
>= 2` because multiple parallel tests concurrently trigger postinstall
hooks.
- Dev machines usually have `node-gyp` already on PATH, so
[`ensure()`](crates/aube/src/commands/install/node_gyp_bootstrap.rs:88)
short-circuits and the recursive install never runs.

### Fix

Write an empty `aube-workspace.yaml` in `tool_dir` alongside the
synthetic `package.json`. `find_workspace_root`'s first-ancestor check
finds the marker at `tool_dir` and returns immediately — the walk no
longer escapes. `workspace_packages` stays empty so `has_workspace` is
false and the install runs as a single-package install.

## Test plan

- [x] `cargo clippy --all-targets -- -D warnings` — clean
- [x] Manual reasoning: the marker kills the walk at its first ancestor,
so the recursive install can no longer target the outer workspace,
breaking the lock-cycle
- [ ] CI green on shards that previously hung (this PR restores `--jobs`
parallelism in endevco#219)

## Follow-up

Close endevco#219 (was a CI-only workaround); the parallel-hang it papered over
is fixed here at the root.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: changes only the node-gyp bootstrap scratch directory setup
by adding a local workspace marker to avoid workspace root walk-up and
related lock deadlocks during recursive installs.
> 
> **Overview**
> Prevents the recursive `aube install` used for `node-gyp`
bootstrapping from walking up into an outer workspace (e.g., test
fixtures with `pnpm-workspace.yaml`) and deadlocking on the parent
install’s project lock.
> 
> This does so by writing an empty `aube-workspace.yaml` into the
node-gyp tool cache directory before spawning the recursive install,
ensuring workspace-root discovery stops at `tool_dir` and runs as a
single-package install.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
93b9e1f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant