Skip to content

Restructure setup phases to improve failure handling and fix bugs#1883

Closed
DavidMiserak wants to merge 2 commits into
garrytan:mainfrom
DavidMiserak:fix/reduce-risk
Closed

Restructure setup phases to improve failure handling and fix bugs#1883
DavidMiserak wants to merge 2 commits into
garrytan:mainfrom
DavidMiserak:fix/reduce-risk

Conversation

@DavidMiserak

Copy link
Copy Markdown

Summary

  • Restructures setup into four phases ordered by failure probability, so a bad network day or missing sudo never leaves users with zero skills installed
  • Makes Playwright/Chromium failures non-fatal — was exit 1, now a named warning with $_PW_FAIL_REASON; skills, hooks, and migrations are already in place before Phase D runs
  • Fixes two bugs found in code review

Phase structure

Phase What runs Fail risk
A Filesystem ops: ~/.gstack/, welcome message, GBrain detection Always succeeds
B Local compute: binary build, skill linking, migrations, GBrain regen Low
C settings.json mutation: team/plan-tune hooks Medium, fully reversible
D Network/sudo/package managers: coreutils, emoji font, Playwright Highest — warn not exit

Previously, a Playwright install failure (network issue, bad mirror, missing sudo) would exit 1 mid-run, leaving skills unregistered. Now Phase D is best-effort; re-running ./setup retries only that phase.

Bug fixes

  • Windows non-ASCII bun path (lines 587, 792, 824): link_codex_skill_dirs, link_factory_skill_dirs, and link_opencode_skill_dirs were calling bun run directly instead of bun_cmd, bypassing the Windows non-ASCII path workaround. Silent failures masked by subsequent directory-existence checks.
  • GBrain regen exit code masked (line 1200): bun_cmd run gen:skill-docs:user ... | tail -3 caused tail's exit 0 to swallow bun's failure, so the || log "warning..." clause never fired. Removed the pipe.

Test plan

  • Run ./setup on a clean install — confirm skills link, hooks install, and Playwright failure (if any) prints a warning instead of aborting
  • Run ./setup --host codex on Windows with non-ASCII username — confirm Codex skill docs generate
  • Run ./setup with GBrain installed — confirm warning fires if gen:skill-docs:user fails

DavidMiserak and others added 2 commits June 5, 2026 11:36
Restructures setup into four phases ordered by failure probability so a
bad network day or missing sudo never leaves users with zero skills:

  Phase A — trivial filesystem ops (always succeed)
  Phase B — local compute: binary build, skill linking, migrations
  Phase C — settings.json mutation (reversible)
  Phase D — network/sudo/package managers (best-effort, warn not exit)

Playwright/Chromium failures are now non-fatal warnings with a named
$_PW_FAIL_REASON. Skills, hooks, and migrations are already installed
before Phase D runs; re-running ./setup retries only the network step.

Also fixes two bugs found in code review:
- Use bun_cmd wrapper (not bare bun) in Codex/Factory/OpenCode skill
  generation helpers — fixes silent failures on Windows with non-ASCII
  bun paths
- Remove | tail -3 pipe in GBrain regen so bun exit code is preserved
  and the error warning actually fires on failure
@trunk-io

trunk-io Bot commented Jun 5, 2026

Copy link
Copy Markdown

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@jbetala7

jbetala7 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Thanks for digging into the failure-handling here — the two bug fixes are real and worth landing.

Bug fixes (correct, please split out):

  • The bun runbun_cmd run change in link_codex/factory/opencode_skill_dirs (origin/main setup:714, 919, 951) is a genuine bug: bun_cmd is the only path that honors the non-ASCII Windows BUN_CMD copy from prepare_bun_for_windows_compile, and the literal bun run bypasses it, so the regen silently fails on a non-ASCII Windows username (masked by the following dir-existence check).
  • Removing | tail -3 from the gen:skill-docs:user call (origin/main setup:1297) is also correct — the pipe makes the subshell exit status tail's, so the || log "warning…" guard right after it can never fire and bun failures get swallowed.

These two are tiny and independently testable — a focused PR landing just them (with a test that actually exercises the exit-code path, not only a source-string grep) would merge fast.

The phase restructure is hard to land as-is. It moves ~200 lines and renumbers nearly every step, which (a) can't be reviewed as a behavior-preserving move because it also folds in real behavior changes — Playwright exit 1 → best-effort warn, and the GBrain detect/regen split — and (b) structurally conflicts with every other open setup PR. In particular the Playwright "warn don't exit" change overlaps directly with #1838 (pins the Chromium install to the bundled Playwright version, closes #1829), which is already narrowly scoped and issue-linked for that same block.

Suggested split: (1) the two bug fixes, (2) the Playwright warn-don't-exit change as its own issue-linked PR coordinated with #1838, (3) the phase reorder last, against fresh main, with an issue describing the failure modes it fixes. Happy to re-review the bug-fix PR quickly.

@DavidMiserak

Copy link
Copy Markdown
Author

Thanks @jbetala7 — agreed on the split. Carved into three PRs as you
suggested:

  1. v1.57.10.0 fix(setup): route gen:skill-docs through bun_cmd; drop tail -3 pipe that masks bun failures #1898 — the two bug fixes (bun_cmd routing in
    link_{codex,factory,opencode}_skill_dirs + dropping | tail -3
    from gen:skill-docs:user). Four-line setup diff plus 5 tests that
    exercise the exit-code path (stubs bun_cmd to fail, runs the actual
    shell pattern, asserts the || log warning fires only on the fixed
    shape and does NOT fire on the buggy ... | tail -3 shape). No
    source-string grep on its own.

  2. v1.57.4.0 fix(setup): Playwright Chromium failure becomes best-effort warn, not exit 1 #1900 — Playwright exit 1 → best-effort warn with a
    $_PW_FAIL_REASON accumulator. Includes the coordination note with
    fix(setup): pin Playwright browser install to bundled version so headed browse connect works (#1829) #1838 you mentioned; happy to rebase whichever lands second.

  3. v1.57.4.0 refactor(setup): group steps into four phases ordered by failure probability #1903 — the phase reorder, against fresh main, closing setup: Playwright Chromium download failure aborts setup before skills/hooks/migrations land #1902
    which describes the failure modes (skills/hooks/migrations not
    landing when a Playwright download fails before them). The diff
    intentionally does NOT include the bug fixes from v1.57.10.0 fix(setup): route gen:skill-docs through bun_cmd; drop tail -3 pipe that masks bun failures #1898 or the
    Playwright behavior change from v1.57.4.0 fix(setup): Playwright Chromium failure becomes best-effort warn, not exit 1 #1900 — verified by grep so a
    reviewer can read it as "a move + the gbrain detect/regen split that
    the move requires + one safety guard for the split."

Closing this PR — superseded by the three above.

Recommended merge order: #1898 first (tiny, independent), then
#1838 + #1900 (either order), then #1903 last so it rebases on the
others rather than the others rebasing on it.

DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
…hat masks bun failures

Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the
focused fast-merge PR they asked for.

1. link_codex_skill_dirs, link_factory_skill_dirs, and
   link_opencode_skill_dirs called literal `bun run gen:skill-docs ...`
   instead of routing through `bun_cmd`, the wrapper installed by
   prepare_bun_for_windows_compile. On a non-ASCII Windows username the
   literal call would exit non-zero; the next line checks if the output
   directory exists and prints a soft warning, masking the failure.

2. The gbrain-detected SKILL.md regen ran
   `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The
   pipe makes the subshell's exit status `tail`'s (always 0), swallowing
   bun's failure so the `|| log "warning..."` clause right after never
   fires regardless of whether bun succeeded.

Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail,
run the actual shell pattern, assert the `|| log` clause fires only on
the fixed shape and does NOT fire on the buggy shape with the pipe.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
… exit 1

Replaces both `exit 1` paths in the Playwright install/verify block with a
$_PW_FAIL_REASON accumulator that prints a named warning to stderr and
returns control. Previously a Chromium download failure (network blip,
locked-down corporate machine, missing Node.js on Windows) would abort
./setup mid-run with no skills registered, no hooks installed, no
migrations applied — and a re-run would start over from the top.

Now the failure only disables browser-driven features (/qa,
/design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and
migrations land normally. The warning names which sub-step failed
(chromium-install / windows-no-node / windows-node-modules /
post-install-launch) so users and bug reports can be specific, and
tells the user to re-run ./setup to retry just this step.

Tests (5) exercise the actual block from setup with stubbed
ensure_playwright_browser and bunx, simulate install + post-install
failures, and assert the script exits 0 with the named warning. One
side-by-side test inlines the OLD bug shape and confirms it returned
non-zero, so the exit-code difference is proven, not just asserted.

Coordinates with garrytan#1838 (pins Playwright install version): both touch the
same code block, so whichever lands first needs the other to rebase. The
changes are complementary — garrytan#1838 fixes which Chromium build gets
installed; this PR fixes what happens when the install fails.

Carved out from garrytan#1883 per @jbetala7's review request.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
…ability

setup used to interleave trivial filesystem ops, expensive compute,
settings.json mutations, and network/sudo steps in whatever order the
script grew. A Playwright Chromium download failing at step 2 would
abort before skills got registered at step 4, hooks landed at step 10,
or migrations ran at step 8.

This commit groups work by where failures actually come from:

  Phase A — filesystem (always succeeds): ~/.gstack/, welcome, GBrain detection
  Phase B — local compute (low fail): bun build, skill linking, migrations, regen
  Phase C — settings.json mutation (medium fail, reversible): team-mode + plan-tune hooks
  Phase D — external network/sudo/package managers (highest fail): coreutils, font, Playwright

Each phase is bracketed with a header comment; steps within are
renumbered (A1, A2, B1, B2a...).

The biggest behavior shift is the GBrain detect/regen split that the
new order requires: detection moves to A3 (filesystem scan only, no
dependencies), regen moves to B6 (runs after the binary is built and
skill docs are generated). Detection persists to
~/.gstack/gbrain-detection.json so B6 can read it.

A safety guard for that split: when gstack-gbrain-detect exits non-zero,
A3 removes both the temp file AND any pre-existing detection file from a
prior successful run. Without the guard, a stale "ok" detection file
would survive a failed detect and silently apply ~250 tokens of overhead
per planning skill to a user whose gbrain may no longer be installed.

Behavior preservation:
- bun_cmd routing in link_{codex,factory,opencode}_skill_dirs is
  unchanged (still buggy here). The fix is in garrytan#1898.
- gen:skill-docs:user still pipes through `tail -3` in B6 (still buggy
  here). The fix is in garrytan#1898.
- Playwright Chromium install/verify in D3 still exits 1 on failure
  (unchanged). The warn-don't-exit conversion is in garrytan#1900 and
  coordinates with garrytan#1838.

Splitting those keeps this PR reviewable as "a move plus the gbrain
split that the move requires," per @jbetala7's request on garrytan#1883.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
…hat masks bun failures

Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the
focused fast-merge PR they asked for.

1. link_codex_skill_dirs, link_factory_skill_dirs, and
   link_opencode_skill_dirs called literal `bun run gen:skill-docs ...`
   instead of routing through `bun_cmd`, the wrapper installed by
   prepare_bun_for_windows_compile. On a non-ASCII Windows username the
   literal call would exit non-zero; the next line checks if the output
   directory exists and prints a soft warning, masking the failure.

2. The gbrain-detected SKILL.md regen ran
   `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The
   pipe makes the subshell's exit status `tail`'s (always 0), swallowing
   bun's failure so the `|| log "warning..."` clause right after never
   fires regardless of whether bun succeeded.

Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail,
run the actual shell pattern, assert the `|| log` clause fires only on
the fixed shape and does NOT fire on the buggy shape with the pipe.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
… exit 1

Replaces both `exit 1` paths in the Playwright install/verify block with a
$_PW_FAIL_REASON accumulator that prints a named warning to stderr and
returns control. Previously a Chromium download failure (network blip,
locked-down corporate machine, missing Node.js on Windows) would abort
./setup mid-run with no skills registered, no hooks installed, no
migrations applied — and a re-run would start over from the top.

Now the failure only disables browser-driven features (/qa,
/design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and
migrations land normally. The warning names which sub-step failed
(chromium-install / windows-no-node / windows-node-modules /
post-install-launch) so users and bug reports can be specific, and
tells the user to re-run ./setup to retry just this step.

Tests (5) exercise the actual block from setup with stubbed
ensure_playwright_browser and bunx, simulate install + post-install
failures, and assert the script exits 0 with the named warning. One
side-by-side test inlines the OLD bug shape and confirms it returned
non-zero, so the exit-code difference is proven, not just asserted.

Coordinates with garrytan#1838 (pins Playwright install version): both touch the
same code block, so whichever lands first needs the other to rebase. The
changes are complementary — garrytan#1838 fixes which Chromium build gets
installed; this PR fixes what happens when the install fails.

Carved out from garrytan#1883 per @jbetala7's review request.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
… exit 1

Replaces both `exit 1` paths in the Playwright install/verify block with a
$_PW_FAIL_REASON accumulator that prints a named warning to stderr and
returns control. Previously a Chromium download failure (network blip,
locked-down corporate machine, missing Node.js on Windows) would abort
./setup mid-run with no skills registered, no hooks installed, no
migrations applied — and a re-run would start over from the top.

Now the failure only disables browser-driven features (/qa,
/design-review, make-pdf, sidebar, /pair-agent); skills, hooks, and
migrations land normally. The warning names which sub-step failed
(chromium-install / windows-no-node / windows-node-modules /
post-install-launch) so users and bug reports can be specific, and
tells the user to re-run ./setup to retry just this step.

Tests (5) exercise the actual block from setup with stubbed
ensure_playwright_browser and bunx, simulate install + post-install
failures, and assert the script exits 0 with the named warning. One
side-by-side test inlines the OLD bug shape and confirms it returned
non-zero, so the exit-code difference is proven, not just asserted.

Coordinates with garrytan#1838 (pins Playwright install version): both touch the
same code block, so whichever lands first needs the other to rebase. The
changes are complementary — garrytan#1838 fixes which Chromium build gets
installed; this PR fixes what happens when the install fails.

Carved out from garrytan#1883 per @jbetala7's review request.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
…hat masks bun failures

Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the
focused fast-merge PR they asked for.

1. link_codex_skill_dirs, link_factory_skill_dirs, and
   link_opencode_skill_dirs called literal `bun run gen:skill-docs ...`
   instead of routing through `bun_cmd`, the wrapper installed by
   prepare_bun_for_windows_compile. On a non-ASCII Windows username the
   literal call would exit non-zero; the next line checks if the output
   directory exists and prints a soft warning, masking the failure.

2. The gbrain-detected SKILL.md regen ran
   `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The
   pipe makes the subshell's exit status `tail`'s (always 0), swallowing
   bun's failure so the `|| log "warning..."` clause right after never
   fires regardless of whether bun succeeded.

Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail,
run the actual shell pattern, assert the `|| log` clause fires only on
the fixed shape and does NOT fire on the buggy shape with the pipe.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 8, 2026
…hat masks bun failures

Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the
focused fast-merge PR they asked for.

1. link_codex_skill_dirs, link_factory_skill_dirs, and
   link_opencode_skill_dirs called literal `bun run gen:skill-docs ...`
   instead of routing through `bun_cmd`, the wrapper installed by
   prepare_bun_for_windows_compile. On a non-ASCII Windows username the
   literal call would exit non-zero; the next line checks if the output
   directory exists and prints a soft warning, masking the failure.

2. The gbrain-detected SKILL.md regen ran
   `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The
   pipe makes the subshell's exit status `tail`'s (always 0), swallowing
   bun's failure so the `|| log "warning..."` clause right after never
   fires regardless of whether bun succeeded.

Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail,
run the actual shell pattern, assert the `|| log` clause fires only on
the fixed shape and does NOT fire on the buggy shape with the pipe.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 9, 2026
…hat masks bun failures

Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the
focused fast-merge PR they asked for.

1. link_codex_skill_dirs, link_factory_skill_dirs, and
   link_opencode_skill_dirs called literal `bun run gen:skill-docs ...`
   instead of routing through `bun_cmd`, the wrapper installed by
   prepare_bun_for_windows_compile. On a non-ASCII Windows username the
   literal call would exit non-zero; the next line checks if the output
   directory exists and prints a soft warning, masking the failure.

2. The gbrain-detected SKILL.md regen ran
   `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The
   pipe makes the subshell's exit status `tail`'s (always 0), swallowing
   bun's failure so the `|| log "warning..."` clause right after never
   fires regardless of whether bun succeeded.

Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail,
run the actual shell pattern, assert the `|| log` clause fires only on
the fixed shape and does NOT fire on the buggy shape with the pipe.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 10, 2026
…hat masks bun failures

Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the
focused fast-merge PR they asked for.

1. link_codex_skill_dirs, link_factory_skill_dirs, and
   link_opencode_skill_dirs called literal `bun run gen:skill-docs ...`
   instead of routing through `bun_cmd`, the wrapper installed by
   prepare_bun_for_windows_compile. On a non-ASCII Windows username the
   literal call would exit non-zero; the next line checks if the output
   directory exists and prints a soft warning, masking the failure.

2. The gbrain-detected SKILL.md regen ran
   `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The
   pipe makes the subshell's exit status `tail`'s (always 0), swallowing
   bun's failure so the `|| log "warning..."` clause right after never
   fires regardless of whether bun succeeded.

Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail,
run the actual shell pattern, assert the `|| log` clause fires only on
the fixed shape and does NOT fire on the buggy shape with the pipe.
DavidMiserak added a commit to DavidMiserak/gstack that referenced this pull request Jun 11, 2026
…hat masks bun failures

Two surgical fixes called out by @jbetala7 on garrytan#1883, carved out as the
focused fast-merge PR they asked for.

1. link_codex_skill_dirs, link_factory_skill_dirs, and
   link_opencode_skill_dirs called literal `bun run gen:skill-docs ...`
   instead of routing through `bun_cmd`, the wrapper installed by
   prepare_bun_for_windows_compile. On a non-ASCII Windows username the
   literal call would exit non-zero; the next line checks if the output
   directory exists and prints a soft warning, masking the failure.

2. The gbrain-detected SKILL.md regen ran
   `bun_cmd run gen:skill-docs:user --host claude 2>&1 | tail -3`. The
   pipe makes the subshell's exit status `tail`'s (always 0), swallowing
   bun's failure so the `|| log "warning..."` clause right after never
   fires regardless of whether bun succeeded.

Tests exercise the exit-code path, not source-grep: stub bun_cmd to fail,
run the actual shell pattern, assert the `|| log` clause fires only on
the fixed shape and does NOT fire on the buggy shape with the pipe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants