feat: worktree isolation for E2E tests + infrastructure elegance (v0.11.12.0) by garrytan · Pull Request #425 · garrytan/gstack

garrytan · 2026-03-24T05:29:56Z

Summary

E2E tests now run in git worktrees. Gemini and Codex tests no longer pollute the working tree. Each test suite gets an isolated worktree, and useful changes the AI agent makes are automatically harvested as patches for cherry-picking.
Harvest deduplication via SHA-256 — identical patches across runs are detected and skipped.
describeWithWorktree() helper — any E2E test can opt into worktree isolation with a one-line wrapper.
Gen-skill-docs modular resolver pipeline — the monolithic 1700-line generator is split into 8 focused resolver modules.
Project-scoped eval storage — results live in ~/.gstack/projects/$SLUG/evals/ instead of global ~/.gstack-dev/evals/.

Test Coverage

All new code paths have test coverage. 12 unit tests for WorktreeManager covering lifecycle, harvest, dedup, and error handling. Full E2E suite ran with EVALS_ALL=1 (completed in ~55 min).

Pre-Landing Review

No issues found. All code is test infrastructure — no SQL, no LLM output, no security surface.

Adversarial Review

Claude adversarial subagent found 14 findings:

Auto-fixed: copyDirSync symlink loop prevention (skip symlinks to avoid infinite recursion when .claude/skills/gstack is a symlink)
Pre-existing (not this PR): Codex timeout recording, setupBrowseShims EEXIST, find-browse path quoting, --yolo assertion
Edge cases (low probability): concurrent pruneStale, dedup locking, exit handler warning, git reset by agent, dedup index growth

No critical gaps. All error paths have try/catch with stderr logging.

TODOS

Added: "Extend worktree isolation to Claude E2E tests" (P3, deferred from CEO review)
No items completed in this PR.

Reviews

CEO Review: CLEAR (SELECTIVE EXPANSION — 3 expansions accepted)
Eng Review: CLEAR (FULL_REVIEW — 2 issues resolved)

Test plan

bun test passes (0 failures, <3s)
EVALS_ALL=1 bun run test:e2e completed (~55 min)
Worktree unit tests pass (12 tests)
Adversarial review findings addressed

🤖 Generated with Claude Code

Break the 3000-line monolith into 10 domain modules under scripts/resolvers/: types, constants, preamble, utility, browse, design, testing, review, codex-helpers, and index. Each module owns one domain of template generation. The preamble module introduces a 4-tier composition system (T1-T4) so skills only pay for the preamble sections they actually need, reducing token usage for lightweight skills by ~40%. Adds a token budget dashboard that prints after every generation run showing per-skill and total token counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tag all 23 templates with preamble-tier (T1-T4). Lightweight skills like /browse and /benchmark get a minimal preamble (~40% fewer tokens), while review skills get the full stack. Regenerate all SKILL.md files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move eval results and E2E run artifacts from ~/.gstack-dev/evals/ to ~/.gstack/projects/$SLUG/evals/ so each project's eval history lives alongside its other gstack data. Falls back to legacy path if slug detection fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # SKILL.md # cso/SKILL.md # cso/SKILL.md.tmpl # scripts/gen-skill-docs.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reusable platform module (lib/worktree.ts) that creates git worktrees for test isolation and harvests useful changes as patches. Includes SHA-256 dedup, original SHA tracking for committed change detection, and automatic gitignored artifact copying (.agents/, browse/dist/). 12 unit tests covering lifecycle, harvest, dedup, and error handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add createTestWorktree(), harvestAndCleanup(), and describeWithWorktree() helpers to e2e-helpers.ts. Add harvest field to EvalTestEntry for eval-store integration. Register lib/worktree.ts as a global touchfile. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Switch both test suites from cwd: ROOT to worktree isolation. Gemini (--yolo) no longer pollutes the working tree. Codex (read-only) gets worktree for consistency. Useful changes are harvested as patches for cherry-picking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # package.json # scripts/gen-skill-docs.ts

Adversarial review caught that .claude/skills/gstack may be a symlink back to the repo root, causing copyDirSync to recurse infinitely when copying gitignored artifacts into worktrees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md

github-actions · 2026-03-24T05:35:15Z

E2E Evals: ❌ FAIL

74/90 tests passed | $15.48 total cost | 12 parallel runners

Suite	Result	Status	Cost
e2e-browse	7/7	✅	$0.33
e2e-deploy	4/4	✅	$0.54
e2e-design	7/7	✅	$2.04
e2e-plan	6/6	✅	$2.62
e2e-qa-bugs	3/3	✅	$1.62
e2e-qa-workflow	4/4	✅	$1.53
e2e-review	7/7	✅	$1.85
e2e-routing	8/19	❌	$3.67
e2e-workflow	4/9	❌	$0.8
llm-judge	24/24	✅	$0.48

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

Failures

❌ journey-ideation: success
❌ journey-qa: success
❌ journey-visual-qa: success
❌ journey-debug: success
❌ journey-qa: success
❌ journey-visual-qa: success
❌ journey-debug: success
❌ journey-qa: success
❌ journey-visual-qa: success
❌ journey-debug: success
❌ journey-design-system: success
❌ /ship local workflow: success
❌ /ship local workflow: success
❌ /ship local workflow: success
❌ /setup-browser-cookies detect: error_max_turns
❌ /setup-browser-cookies detect: error_max_turns

The LLM consistently presents well-formatted A/B choices with pros/cons but doesn't always use the exact string "RECOMMENDATION". Accept case-insensitive "recommend", "option a", "which do you want", or "which approach" as equivalent signals of a structured recommendation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Covers the resolver modules split in v0.11.12.0 (garrytan#425): - preamble: update check, session tracking, AskUserQuestion format, contributor mode - testing: test bootstrap, coverage audit (plan/ship/review modes) - utility: slug eval/setup, base branch detect, QA methodology - codex-helpers: frontmatter parsing, description condensing, skill naming, host transform - constants: AI slop blacklist, OpenAI rejections, error handling - browse: setup instructions, binary detection 34 tests, 57 assertions. No source code modified.

…11.12.0) (#425) * refactor: extract gen-skill-docs into modular resolver architecture Break the 3000-line monolith into 10 domain modules under scripts/resolvers/: types, constants, preamble, utility, browse, design, testing, review, codex-helpers, and index. Each module owns one domain of template generation. The preamble module introduces a 4-tier composition system (T1-T4) so skills only pay for the preamble sections they actually need, reducing token usage for lightweight skills by ~40%. Adds a token budget dashboard that prints after every generation run showing per-skill and total token counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: tiered preamble — skills only pay for what they use Tag all 23 templates with preamble-tier (T1-T4). Lightweight skills like /browse and /benchmark get a minimal preamble (~40% fewer tokens), while review skills get the full stack. Regenerate all SKILL.md files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: migrate eval storage to project-scoped paths Move eval results and E2E run artifacts from ~/.gstack-dev/evals/ to ~/.gstack/projects/$SLUG/evals/ so each project's eval history lives alongside its other gstack data. Falls back to legacy path if slug detection fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sync package.json version with VERSION after merge Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add WorktreeManager for isolated test environments Reusable platform module (lib/worktree.ts) that creates git worktrees for test isolation and harvests useful changes as patches. Includes SHA-256 dedup, original SHA tracking for committed change detection, and automatic gitignored artifact copying (.agents/, browse/dist/). 12 unit tests covering lifecycle, harvest, dedup, and error handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate worktree isolation into E2E test infrastructure Add createTestWorktree(), harvestAndCleanup(), and describeWithWorktree() helpers to e2e-helpers.ts. Add harvest field to EvalTestEntry for eval-store integration. Register lib/worktree.ts as a global touchfile. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: run Gemini and Codex E2E tests in worktrees Switch both test suites from cwd: ROOT to worktree isolation. Gemini (--yolo) no longer pollutes the working tree. Codex (read-only) gets worktree for consistency. Useful changes are harvested as patches for cherry-picking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: skip symlinks in copyDirSync to prevent infinite recursion Adversarial review caught that .claude/skills/gstack may be a symlink back to the repo root, causing copyDirSync to recurse infinitely when copying gitignored artifacts into worktrees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.11.12.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: relax session-awareness assertion to accept structured options The LLM consistently presents well-formatted A/B choices with pros/cons but doesn't always use the exact string "RECOMMENDATION". Accept case-insensitive "recommend", "option a", "which do you want", or "which approach" as equivalent signals of a structured recommendation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…11.12.0) (garrytan#425) * refactor: extract gen-skill-docs into modular resolver architecture Break the 3000-line monolith into 10 domain modules under scripts/resolvers/: types, constants, preamble, utility, browse, design, testing, review, codex-helpers, and index. Each module owns one domain of template generation. The preamble module introduces a 4-tier composition system (T1-T4) so skills only pay for the preamble sections they actually need, reducing token usage for lightweight skills by ~40%. Adds a token budget dashboard that prints after every generation run showing per-skill and total token counts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: tiered preamble — skills only pay for what they use Tag all 23 templates with preamble-tier (T1-T4). Lightweight skills like /browse and /benchmark get a minimal preamble (~40% fewer tokens), while review skills get the full stack. Regenerate all SKILL.md files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: migrate eval storage to project-scoped paths Move eval results and E2E run artifacts from ~/.gstack-dev/evals/ to ~/.gstack/projects/$SLUG/evals/ so each project's eval history lives alongside its other gstack data. Falls back to legacy path if slug detection fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: sync package.json version with VERSION after merge Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add WorktreeManager for isolated test environments Reusable platform module (lib/worktree.ts) that creates git worktrees for test isolation and harvests useful changes as patches. Includes SHA-256 dedup, original SHA tracking for committed change detection, and automatic gitignored artifact copying (.agents/, browse/dist/). 12 unit tests covering lifecycle, harvest, dedup, and error handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: integrate worktree isolation into E2E test infrastructure Add createTestWorktree(), harvestAndCleanup(), and describeWithWorktree() helpers to e2e-helpers.ts. Add harvest field to EvalTestEntry for eval-store integration. Register lib/worktree.ts as a global touchfile. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: run Gemini and Codex E2E tests in worktrees Switch both test suites from cwd: ROOT to worktree isolation. Gemini (--yolo) no longer pollutes the working tree. Codex (read-only) gets worktree for consistency. Useful changes are harvested as patches for cherry-picking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: skip symlinks in copyDirSync to prevent infinite recursion Adversarial review caught that .claude/skills/gstack may be a symlink back to the repo root, causing copyDirSync to recurse infinitely when copying gitignored artifacts into worktrees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version and changelog (v0.11.12.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: relax session-awareness assertion to accept structured options The LLM consistently presents well-formatted A/B choices with pros/cons but doesn't always use the exact string "RECOMMENDATION". Accept case-insensitive "recommend", "option a", "which do you want", or "which approach" as equivalent signals of a structured recommendation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

garrytan and others added 12 commits March 23, 2026 10:52

Merge remote-tracking branch 'origin/main' into garrytan/elegance

71a0c47

# Conflicts: # SKILL.md # cso/SKILL.md # cso/SKILL.md.tmpl # scripts/gen-skill-docs.ts

fix: sync package.json version with VERSION after merge

062b92f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into garrytan/elegance

d5e857d

# Conflicts: # package.json # scripts/gen-skill-docs.ts

chore: bump version and changelog (v0.11.12.0)

4daecff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into garrytan/elegance

90cf700

# Conflicts: # CHANGELOG.md

garrytan merged commit dc5e053 into main Mar 24, 2026
18 checks passed

HMAKT99 mentioned this pull request Mar 24, 2026

test: 34 tests for modular resolver pipeline (6 untested modules) #434

Closed

3 tasks

silviot mentioned this pull request Apr 3, 2026

bin/gstack-global-discover is a tracked generated binary — bloats clone size and causes perpetual dirty working tree #779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: worktree isolation for E2E tests + infrastructure elegance (v0.11.12.0)#425

feat: worktree isolation for E2E tests + infrastructure elegance (v0.11.12.0)#425
garrytan merged 13 commits into
mainfrom
garrytan/elegance

garrytan commented Mar 24, 2026

Uh oh!

github-actions Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Mar 24, 2026

Summary

Test Coverage

Pre-Landing Review

Adversarial Review

TODOS

Reviews

Test plan

Uh oh!

github-actions Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Evals: ❌ FAIL

Failures

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Mar 24, 2026 •

edited

Loading