Restructure setup phases to improve failure handling and fix bugs#1883
Restructure setup phases to improve failure handling and fix bugs#1883DavidMiserak wants to merge 2 commits into
Conversation
Restructures setup into four phases ordered by failure probability so a bad network day or missing sudo never leaves users with zero skills: Phase A — trivial filesystem ops (always succeed) Phase B — local compute: binary build, skill linking, migrations Phase C — settings.json mutation (reversible) Phase D — network/sudo/package managers (best-effort, warn not exit) Playwright/Chromium failures are now non-fatal warnings with a named $_PW_FAIL_REASON. Skills, hooks, and migrations are already installed before Phase D runs; re-running ./setup retries only the network step. Also fixes two bugs found in code review: - Use bun_cmd wrapper (not bare bun) in Codex/Factory/OpenCode skill generation helpers — fixes silent failures on Windows with non-ASCII bun paths - Remove | tail -3 pipe in GBrain regen so bun exit code is preserved and the error warning actually fires on failure
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
Thanks for digging into the failure-handling here — the two bug fixes are real and worth landing. Bug fixes (correct, please split out):
These two are tiny and independently testable — a focused PR landing just them (with a test that actually exercises the exit-code path, not only a source-string grep) would merge fast. The phase restructure is hard to land as-is. It moves ~200 lines and renumbers nearly every step, which (a) can't be reviewed as a behavior-preserving move because it also folds in real behavior changes — Playwright Suggested split: (1) the two bug fixes, (2) the Playwright warn-don't-exit change as its own issue-linked PR coordinated with #1838, (3) the phase reorder last, against fresh |
|
Thanks @jbetala7 — agreed on the split. Carved into three PRs as you
Closing this PR — superseded by the three above. Recommended merge order: #1898 first (tiny, independent), then |
…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.
… exit 1 Replaces both `exit 1` paths in the Playwright install/verify block with a $_PW_FAIL_REASON accumulator that prints a named warning to stderr and returns control. Previously a Chromium download failure (network blip, locked-down corporate machine, missing Node.js on Windows) would abort ./setup mid-run with no skills registered, no hooks installed, no migrations applied — and a re-run would start over from the top. Now the failure only disables browser-driven features (/qa, /design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and migrations land normally. The warning names which sub-step failed (chromium-install / windows-no-node / windows-node-modules / post-install-launch) so users and bug reports can be specific, and tells the user to re-run ./setup to retry just this step. Tests (5) exercise the actual block from setup with stubbed ensure_playwright_browser and bunx, simulate install + post-install failures, and assert the script exits 0 with the named warning. One side-by-side test inlines the OLD bug shape and confirms it returned non-zero, so the exit-code difference is proven, not just asserted. Coordinates with garrytan#1838 (pins Playwright install version): both touch the same code block, so whichever lands first needs the other to rebase. The changes are complementary — garrytan#1838 fixes which Chromium build gets installed; this PR fixes what happens when the install fails. Carved out from garrytan#1883 per @jbetala7's review request.
…ability
setup used to interleave trivial filesystem ops, expensive compute,
settings.json mutations, and network/sudo steps in whatever order the
script grew. A Playwright Chromium download failing at step 2 would
abort before skills got registered at step 4, hooks landed at step 10,
or migrations ran at step 8.
This commit groups work by where failures actually come from:
Phase A — filesystem (always succeeds): ~/.gstack/, welcome, GBrain detection
Phase B — local compute (low fail): bun build, skill linking, migrations, regen
Phase C — settings.json mutation (medium fail, reversible): team-mode + plan-tune hooks
Phase D — external network/sudo/package managers (highest fail): coreutils, font, Playwright
Each phase is bracketed with a header comment; steps within are
renumbered (A1, A2, B1, B2a...).
The biggest behavior shift is the GBrain detect/regen split that the
new order requires: detection moves to A3 (filesystem scan only, no
dependencies), regen moves to B6 (runs after the binary is built and
skill docs are generated). Detection persists to
~/.gstack/gbrain-detection.json so B6 can read it.
A safety guard for that split: when gstack-gbrain-detect exits non-zero,
A3 removes both the temp file AND any pre-existing detection file from a
prior successful run. Without the guard, a stale "ok" detection file
would survive a failed detect and silently apply ~250 tokens of overhead
per planning skill to a user whose gbrain may no longer be installed.
Behavior preservation:
- bun_cmd routing in link_{codex,factory,opencode}_skill_dirs is
unchanged (still buggy here). The fix is in garrytan#1898.
- gen:skill-docs:user still pipes through `tail -3` in B6 (still buggy
here). The fix is in garrytan#1898.
- Playwright Chromium install/verify in D3 still exits 1 on failure
(unchanged). The warn-don't-exit conversion is in garrytan#1900 and
coordinates with garrytan#1838.
Splitting those keeps this PR reviewable as "a move plus the gbrain
split that the move requires," per @jbetala7's request on garrytan#1883.
…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.
… exit 1 Replaces both `exit 1` paths in the Playwright install/verify block with a $_PW_FAIL_REASON accumulator that prints a named warning to stderr and returns control. Previously a Chromium download failure (network blip, locked-down corporate machine, missing Node.js on Windows) would abort ./setup mid-run with no skills registered, no hooks installed, no migrations applied — and a re-run would start over from the top. Now the failure only disables browser-driven features (/qa, /design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and migrations land normally. The warning names which sub-step failed (chromium-install / windows-no-node / windows-node-modules / post-install-launch) so users and bug reports can be specific, and tells the user to re-run ./setup to retry just this step. Tests (5) exercise the actual block from setup with stubbed ensure_playwright_browser and bunx, simulate install + post-install failures, and assert the script exits 0 with the named warning. One side-by-side test inlines the OLD bug shape and confirms it returned non-zero, so the exit-code difference is proven, not just asserted. Coordinates with garrytan#1838 (pins Playwright install version): both touch the same code block, so whichever lands first needs the other to rebase. The changes are complementary — garrytan#1838 fixes which Chromium build gets installed; this PR fixes what happens when the install fails. Carved out from garrytan#1883 per @jbetala7's review request.
… exit 1 Replaces both `exit 1` paths in the Playwright install/verify block with a $_PW_FAIL_REASON accumulator that prints a named warning to stderr and returns control. Previously a Chromium download failure (network blip, locked-down corporate machine, missing Node.js on Windows) would abort ./setup mid-run with no skills registered, no hooks installed, no migrations applied — and a re-run would start over from the top. Now the failure only disables browser-driven features (/qa, /design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and migrations land normally. The warning names which sub-step failed (chromium-install / windows-no-node / windows-node-modules / post-install-launch) so users and bug reports can be specific, and tells the user to re-run ./setup to retry just this step. Tests (5) exercise the actual block from setup with stubbed ensure_playwright_browser and bunx, simulate install + post-install failures, and assert the script exits 0 with the named warning. One side-by-side test inlines the OLD bug shape and confirms it returned non-zero, so the exit-code difference is proven, not just asserted. Coordinates with garrytan#1838 (pins Playwright install version): both touch the same code block, so whichever lands first needs the other to rebase. The changes are complementary — garrytan#1838 fixes which Chromium build gets installed; this PR fixes what happens when the install fails. Carved out from garrytan#1883 per @jbetala7's review request.
…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.
…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.
…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.
…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.
…hat masks bun failures Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the focused fast-merge PR they asked for. 1. link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs called literal `bun run gen:skill-docs ...` instead of routing through `bun_cmd`, the wrapper installed by prepare_bun_for_windows_compile. On a non-ASCII Windows username the literal call would exit non-zero; the next line checks if the output directory exists and prints a soft warning, masking the failure. 2. The gbrain-detected SKILL.md regen ran `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The pipe makes the subshell's exit status `tail`'s (always 0), swallowing bun's failure so the `|| log "warning..."` clause right after never fires regardless of whether bun succeeded. Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail, run the actual shell pattern, assert the `|| log` clause fires only on the fixed shape and does NOT fire on the buggy shape with the pipe.
Summary
setupinto four phases ordered by failure probability, so a bad network day or missingsudonever leaves users with zero skills installedexit 1, now a named warning with$_PW_FAIL_REASON; skills, hooks, and migrations are already in place before Phase D runsPhase structure
~/.gstack/, welcome message, GBrain detectionsettings.jsonmutation: team/plan-tune hooksPreviously, a Playwright install failure (network issue, bad mirror, missing sudo) would
exit 1mid-run, leaving skills unregistered. Now Phase D is best-effort; re-running./setupretries only that phase.Bug fixes
link_codex_skill_dirs,link_factory_skill_dirs, andlink_opencode_skill_dirswere callingbun rundirectly instead ofbun_cmd, bypassing the Windows non-ASCII path workaround. Silent failures masked by subsequent directory-existence checks.bun_cmd run gen:skill-docs:user ... | tail -3causedtail's exit 0 to swallow bun's failure, so the|| log "warning..."clause never fired. Removed the pipe.Test plan
./setupon a clean install — confirm skills link, hooks install, and Playwright failure (if any) prints a warning instead of aborting./setup --host codexon Windows with non-ASCII username — confirm Codex skill docs generate./setupwith GBrain installed — confirm warning fires ifgen:skill-docs:userfails