Restructure setup phases to improve failure handling and fix bugs by DavidMiserak · Pull Request #1883 · garrytan/gstack

DavidMiserak · 2026-06-05T15:47:21Z

Summary

Restructures setup into four phases ordered by failure probability, so a bad network day or missing sudo never leaves users with zero skills installed
Makes Playwright/Chromium failures non-fatal — was exit 1, now a named warning with $_PW_FAIL_REASON; skills, hooks, and migrations are already in place before Phase D runs
Fixes two bugs found in code review

Phase structure

Phase	What runs	Fail risk
A	Filesystem ops: `~/.gstack/`, welcome message, GBrain detection	Always succeeds
B	Local compute: binary build, skill linking, migrations, GBrain regen	Low
C	`settings.json` mutation: team/plan-tune hooks	Medium, fully reversible
D	Network/sudo/package managers: coreutils, emoji font, Playwright	Highest — warn not exit

Previously, a Playwright install failure (network issue, bad mirror, missing sudo) would exit 1 mid-run, leaving skills unregistered. Now Phase D is best-effort; re-running ./setup retries only that phase.

Bug fixes

Windows non-ASCII bun path (lines 587, 792, 824): link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs were calling bun run directly instead of bun_cmd, bypassing the Windows non-ASCII path workaround. Silent failures masked by subsequent directory-existence checks.
GBrain regen exit code masked (line 1200): bun_cmd run gen:skill-docs:user ... | tail -3 caused tail's exit 0 to swallow bun's failure, so the || log "warning..." clause never fired. Removed the pipe.

Test plan

Run ./setup on a clean install — confirm skills link, hooks install, and Playwright failure (if any) prints a warning instead of aborting
Run ./setup --host codex on Windows with non-ASCII username — confirm Codex skill docs generate
Run ./setup with GBrain installed — confirm warning fires if gen:skill-docs:user fails

Restructures setup into four phases ordered by failure probability so a bad network day or missing sudo never leaves users with zero skills: Phase A — trivial filesystem ops (always succeed) Phase B — local compute: binary build, skill linking, migrations Phase C — settings.json mutation (reversible) Phase D — network/sudo/package managers (best-effort, warn not exit) Playwright/Chromium failures are now non-fatal warnings with a named $_PW_FAIL_REASON. Skills, hooks, and migrations are already installed before Phase D runs; re-running ./setup retries only the network step. Also fixes two bugs found in code review: - Use bun_cmd wrapper (not bare bun) in Codex/Factory/OpenCode skill generation helpers — fixes silent failures on Windows with non-ASCII bun paths - Remove | tail -3 pipe in GBrain regen so bun exit code is preserved and the error warning actually fires on failure

trunk-io · 2026-06-05T15:47:24Z

Merging to main in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

jbetala7 · 2026-06-07T10:49:38Z

Thanks for digging into the failure-handling here — the two bug fixes are real and worth landing.

Bug fixes (correct, please split out):

The bun run → bun_cmd run change in link_codex/factory/opencode_skill_dirs (origin/main setup:714, 919, 951) is a genuine bug: bun_cmd is the only path that honors the non-ASCII Windows BUN_CMD copy from prepare_bun_for_windows_compile, and the literal bun run bypasses it, so the regen silently fails on a non-ASCII Windows username (masked by the following dir-existence check).
Removing | tail -3 from the gen:skill-docs:user call (origin/main setup:1297) is also correct — the pipe makes the subshell exit status tail's, so the || log "warning…" guard right after it can never fire and bun failures get swallowed.

These two are tiny and independently testable — a focused PR landing just them (with a test that actually exercises the exit-code path, not only a source-string grep) would merge fast.

The phase restructure is hard to land as-is. It moves ~200 lines and renumbers nearly every step, which (a) can't be reviewed as a behavior-preserving move because it also folds in real behavior changes — Playwright exit 1 → best-effort warn, and the GBrain detect/regen split — and (b) structurally conflicts with every other open setup PR. In particular the Playwright "warn don't exit" change overlaps directly with #1838 (pins the Chromium install to the bundled Playwright version, closes #1829), which is already narrowly scoped and issue-linked for that same block.

Suggested split: (1) the two bug fixes, (2) the Playwright warn-don't-exit change as its own issue-linked PR coordinated with #1838, (3) the phase reorder last, against fresh main, with an issue describing the failure modes it fixes. Happy to re-review the bug-fix PR quickly.

DavidMiserak · 2026-06-07T17:59:20Z

Thanks @jbetala7 — agreed on the split. Carved into three PRs as you
suggested:

v1.57.10.0 fix(setup): route gen:skill-docs through bun_cmd; drop tail -3 pipe that masks bun failures #1898 — the two bug fixes (bun_cmd routing in
link_{codex,factory,opencode}_skill_dirs + dropping | tail -3
from gen:skill-docs:user). Four-line setup diff plus 5 tests that
exercise the exit-code path (stubs bun_cmd to fail, runs the actual
shell pattern, asserts the || log warning fires only on the fixed
shape and does NOT fire on the buggy ... | tail -3 shape). No
source-string grep on its own.
v1.57.4.0 fix(setup): Playwright Chromium failure becomes best-effort warn, not exit 1 #1900 — Playwright exit 1 → best-effort warn with a
$_PW_FAIL_REASON accumulator. Includes the coordination note with
fix(setup): pin Playwright browser install to bundled version so headed browse connect works (#1829) #1838 you mentioned; happy to rebase whichever lands second.
v1.57.4.0 refactor(setup): group steps into four phases ordered by failure probability #1903 — the phase reorder, against fresh main, closing setup: Playwright Chromium download failure aborts setup before skills/hooks/migrations land #1902
which describes the failure modes (skills/hooks/migrations not
landing when a Playwright download fails before them). The diff
intentionally does NOT include the bug fixes from v1.57.10.0 fix(setup): route gen:skill-docs through bun_cmd; drop tail -3 pipe that masks bun failures #1898 or the
Playwright behavior change from v1.57.4.0 fix(setup): Playwright Chromium failure becomes best-effort warn, not exit 1 #1900 — verified by grep so a
reviewer can read it as "a move + the gbrain detect/regen split that
the move requires + one safety guard for the split."

Closing this PR — superseded by the three above.

Recommended merge order: #1898 first (tiny, independent), then
#1838 + #1900 (either order), then #1903 last so it rebases on the
others rather than the others rebasing on it.

@jbetala7

…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.

@jbetala7

… exit 1 Replaces both `exit 1` paths in the Playwright install/verify block with a $_PW_FAIL_REASON accumulator that prints a named warning to stderr and returns control. Previously a Chromium download failure (network blip, locked-down corporate machine, missing Node.js on Windows) would abort ./setup mid-run with no skills registered, no hooks installed, no migrations applied — and a re-run would start over from the top. Now the failure only disables browser-driven features (/qa, /design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and migrations land normally. The warning names which sub-step failed (chromium-install / windows-no-node / windows-node-modules / post-install-launch) so users and bug reports can be specific, and tells the user to re-run ./setup to retry just this step. Tests (5) exercise the actual block from setup with stubbed ensure_playwright_browser and bunx, simulate install + post-install failures, and assert the script exits 0 with the named warning. One side-by-side test inlines the OLD bug shape and confirms it returned non-zero, so the exit-code difference is proven, not just asserted. Coordinates with garrytan#1838 (pins Playwright install version): both touch the same code block, so whichever lands first needs the other to rebase. The changes are complementary — garrytan#1838 fixes which Chromium build gets installed; this PR fixes what happens when the install fails. Carved out from garrytan#1883 per @jbetala7's review request.

@jbetala7

…ability setup used to interleave trivial filesystem ops, expensive compute, settings.json mutations, and network/sudo steps in whatever order the script grew. A Playwright Chromium download failing at step 2 would abort before skills got registered at step 4, hooks landed at step 10, or migrations ran at step 8. This commit groups work by where failures actually come from: Phase A — filesystem (always succeeds): ~/.gstack/, welcome, GBrain detection Phase B — local compute (low fail): bun build, skill linking, migrations, regen Phase C — settings.json mutation (medium fail, reversible): team-mode + plan-tune hooks Phase D — external network/sudo/package managers (highest fail): coreutils, font, Playwright Each phase is bracketed with a header comment; steps within are renumbered (A1, A2, B1, B2a...). The biggest behavior shift is the GBrain detect/regen split that the new order requires: detection moves to A3 (filesystem scan only, no dependencies), regen moves to B6 (runs after the binary is built and skill docs are generated). Detection persists to ~/.gstack/gbrain-detection.json so B6 can read it. A safety guard for that split: when gstack-gbrain-detect exits non-zero, A3 removes both the temp file AND any pre-existing detection file from a prior successful run. Without the guard, a stale "ok" detection file would survive a failed detect and silently apply ~250 tokens of overhead per planning skill to a user whose gbrain may no longer be installed. Behavior preservation: - bun_cmd routing in link_{codex,factory,opencode}_skill_dirs is unchanged (still buggy here). The fix is in garrytan#1898. - gen:skill-docs:user still pipes through `tail -3` in B6 (still buggy here). The fix is in garrytan#1898. - Playwright Chromium install/verify in D3 still exits 1 on failure (unchanged). The warn-don't-exit conversion is in garrytan#1900 and coordinates with garrytan#1838. Splitting those keeps this PR reviewable as "a move plus the gbrain split that the move requires," per @jbetala7's request on garrytan#1883.

@jbetala7

…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.

@jbetala7

… exit 1 Replaces both `exit 1` paths in the Playwright install/verify block with a $_PW_FAIL_REASON accumulator that prints a named warning to stderr and returns control. Previously a Chromium download failure (network blip, locked-down corporate machine, missing Node.js on Windows) would abort ./setup mid-run with no skills registered, no hooks installed, no migrations applied — and a re-run would start over from the top. Now the failure only disables browser-driven features (/qa, /design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and migrations land normally. The warning names which sub-step failed (chromium-install / windows-no-node / windows-node-modules / post-install-launch) so users and bug reports can be specific, and tells the user to re-run ./setup to retry just this step. Tests (5) exercise the actual block from setup with stubbed ensure_playwright_browser and bunx, simulate install + post-install failures, and assert the script exits 0 with the named warning. One side-by-side test inlines the OLD bug shape and confirms it returned non-zero, so the exit-code difference is proven, not just asserted. Coordinates with garrytan#1838 (pins Playwright install version): both touch the same code block, so whichever lands first needs the other to rebase. The changes are complementary — garrytan#1838 fixes which Chromium build gets installed; this PR fixes what happens when the install fails. Carved out from garrytan#1883 per @jbetala7's review request.

@jbetala7

… exit 1 Replaces both `exit 1` paths in the Playwright install/verify block with a $_PW_FAIL_REASON accumulator that prints a named warning to stderr and returns control. Previously a Chromium download failure (network blip, locked-down corporate machine, missing Node.js on Windows) would abort ./setup mid-run with no skills registered, no hooks installed, no migrations applied — and a re-run would start over from the top. Now the failure only disables browser-driven features (/qa, /design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and migrations land normally. The warning names which sub-step failed (chromium-install / windows-no-node / windows-node-modules / post-install-launch) so users and bug reports can be specific, and tells the user to re-run ./setup to retry just this step. Tests (5) exercise the actual block from setup with stubbed ensure_playwright_browser and bunx, simulate install + post-install failures, and assert the script exits 0 with the named warning. One side-by-side test inlines the OLD bug shape and confirms it returned non-zero, so the exit-code difference is proven, not just asserted. Coordinates with garrytan#1838 (pins Playwright install version): both touch the same code block, so whichever lands first needs the other to rebase. The changes are complementary — garrytan#1838 fixes which Chromium build gets installed; this PR fixes what happens when the install fails. Carved out from garrytan#1883 per @jbetala7's review request.

@jbetala7

…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.

@jbetala7

…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.

@jbetala7

…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.

@jbetala7

…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.

@jbetala7

…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.

DavidMiserak and others added 2 commits June 5, 2026 11:36

fix setup review comments

befd3a4

DavidMiserak closed this Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure setup phases to improve failure handling and fix bugs#1883

Restructure setup phases to improve failure handling and fix bugs#1883
DavidMiserak wants to merge 2 commits into
garrytan:mainfrom
DavidMiserak:fix/reduce-risk

DavidMiserak commented Jun 5, 2026

Uh oh!

trunk-io Bot commented Jun 5, 2026

Uh oh!

jbetala7 commented Jun 7, 2026

Uh oh!

DavidMiserak commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DavidMiserak commented Jun 5, 2026

Summary

Phase structure

Bug fixes

Test plan

Uh oh!

trunk-io Bot commented Jun 5, 2026

Uh oh!

jbetala7 commented Jun 7, 2026

Uh oh!

DavidMiserak commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants