Skip to content

feat: add handoff artifact generation for session persistence#7

Closed
Jah-yee wants to merge 1 commit into
garrytan:mainfrom
Jah-yee:feature/report-artifacts
Closed

feat: add handoff artifact generation for session persistence#7
Jah-yee wants to merge 1 commit into
garrytan:mainfrom
Jah-yee:feature/report-artifacts

Conversation

@Jah-yee

@Jah-yee Jah-yee commented Mar 12, 2026

Copy link
Copy Markdown

Summary

This PR adds automatic handoff artifact generation to solve issue #6 - persisting review context between sessions.

Changes

  • Add REVIEW_ARTIFACT.md generation to plan-ceo-review SKILL
  • Add REVIEW_CONTEXT.json generation to plan-ceo-review SKILL
  • Add same artifact generation to plan-eng-review SKILL
  • Artifacts capture full review context including:
    • Executive summary
    • All section findings
    • Key decisions
    • Unresolved questions
    • TODOs and deferred items
    • Diagrams

Why This Helps

Currently, after running /plan-ceo-review or /plan-eng-review, only a short TODO.md remains. When the session is cleared, all the detailed analysis is lost. This PR adds automatic artifact generation that saves a comprehensive report before the review concludes, so future sessions can load the full context.

Fixes #6

- Add REVIEW_ARTIFACT.md and REVIEW_CONTEXT.json generation to plan-ceo-review
- Add same artifact generation to plan-eng-review
- Artifacts capture full review context for future sessions

Fixes #6
@garrytan

Copy link
Copy Markdown
Owner

good idea, going to do it in another way

@garrytan garrytan closed this Mar 12, 2026
@vasiliyk

Copy link
Copy Markdown

@dependabot close

garrytan added a commit that referenced this pull request Mar 23, 2026
- install.sh: curl-pipe-bash installer with prereq checks (git, bun),
  upgrade detection (git pull if already installed), transparency note
  about update-check pings
- setup: add install ping at end (gstack-update-check --force) to
  register day-zero installs in Supabase
- Install ping only in setup (not install.sh) to avoid double-counting
  (Codex review fix #7)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rafiulnakib pushed a commit to rafiulnakib/gstack that referenced this pull request Mar 26, 2026
…scanner limitations

Module split: scan-imports.ts (1252 lines) → scanner/{core,aliases,routes,dead-code,css,monorepo,non-ts}.ts
Fixes: garrytan#1 non-TS file discovery, garrytan#2 Vite AST alias parsing, garrytan#3 React Router AST route discovery,
garrytan#4 dynamic import tracking, garrytan#5 configurable max depth, garrytan#6 git frequency fallback,
garrytan#7 MEGA depth cap CLI flag, garrytan#8 dead code false positive reduction, garrytan#9 CSS import graphs,
garrytan#10 monorepo auto-detection and multi-root scanning.
24601 pushed a commit to 24601/gastack that referenced this pull request Mar 29, 2026
- install.sh: curl-pipe-bash installer with prereq checks (git, bun),
  upgrade detection (git pull if already installed), transparency note
  about update-check pings
- setup: add install ping at end (gstack-update-check --force) to
  register day-zero installs in Supabase
- Install ping only in setup (not install.sh) to avoid double-counting
  (Codex review fix garrytan#7)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 28, 2026
The productivity multiplier. /scrape discovers the flow; /skillify writes
it as deterministic Playwright-via-browse-client code so the next /scrape
on the same intent runs in ~200ms.

11-step flow with three locked contracts from the v1.19.0.0 plan review:

D1 — Provenance guard. Walk back ≤10 agent turns for a clearly-bounded
/scrape result. Refuse with one specific message if cold. No silent
synthesis from chat fragments.

D2 — Synthesis input slice. Extract ONLY the final-attempt $B calls that
produced the JSON the user accepted, plus the user's intent string. Drop
failed selectors, drop unrelated chat, drop earlier-session content.
Closes Codex finding #6 by picking option (b) from the design doc:
re-prompt from agent's own context, not a structured recorder.

D3 — Atomic write. Stage to ~/.gstack/.tmp/skillify-<spawnId>/, run
$B skill test against the temp dir, only rename into the final tier path
on test pass + user approval. Test fail or approval reject = rm -rf the
temp dir entirely.

Default tier: global (~/.gstack/browser-skills/<name>/). --project flag
overrides to per-project. Generated test must include at least one ★★
assertion (parsed JSON has expected shape + non-empty key fields), not a
smoke ★ assertion.

Bun runtime distribution (Codex finding #7) carries over to Phase 4.
Documented in the skill's Limits section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 28, 2026
TODOS.md:
- Narrows existing P1 (was "/scrape and /automate") to "/scrape and
  /skillify" — the /scrape + /skillify wedge ships in this branch.
  Codex finding #6 (synthesis) removed from Cons (resolved by D2);
  finding #7 (Bun runtime) stays as the open carry-over.
- Adds new ## P0 above PACING_UPDATES_V0 for the /automate follow-up.
  Same skillify pattern as /scrape, different trust profile (per-step
  confirmation gate when running non-codified). Reuses /skillify and
  the D3 helper as-is. Effort M.

BROWSER_SKILLS_V1.md:
- Phase table re-organized into 1, 2a, 2b, 3, 4. Phase 1 + Phase 2a
  consolidate into v1.19.0.0 ship (the v1.16.0.0 branch-internal
  bump never landed on main).
- New "Phase 2a" sub-section captures the four decisions locked
  during /plan-eng-review:
    D1 — provenance guard (≤10 turn walk-back, refuse if cold)
    D2 — synthesis input slice (final-attempt $B calls only,
         closes Codex finding #6)
    D3 — atomic write discipline (temp-dir-then-rename via new
         browse/src/browser-skill-write.ts helper)
    D4 — full test scope (5 gate E2E + 1 unit + smoke)
- New "Phase 2b" sketch for /automate: same skillify machinery,
  per-mutating-step confirmation gate, deferred to next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 28, 2026
Consolidates the v1.16.0.0 branch-internal bump (Phase 1 runtime, never
landed on main) with Phase 2a (/scrape + /skillify + atomic-write helper)
into one v1.19.0.0 ship per CLAUDE.md "Never orphan branch-internal
versions" rule.

Headline: Browser-skills land end-to-end. /scrape <intent> first call
drives the page; second call runs the codified script in 200ms.

The unified CHANGELOG entry covers:
- Phase 1 runtime: $B skill list/show/run/test/rm, scoped tokens,
  3-tier storage, bundled hackernews-frontpage reference.
- Phase 2a: /scrape + /skillify gstack skills, browser-skill-write.ts
  atomic helper, 5 gate-tier E2E + 34 unit assertions.

Numbers table updated: 5 new modules (+browser-skill-write), 2 new
gstack skills, 6 of 8 Codex outside-voice findings resolved (synthesis
#6 closed by D2; Bun runtime #7 + OS sandbox #1 stay deferred to Phase 4).

/automate (Phase 2b) is split out as P0 in TODOS for the next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
samuelbat15 added a commit to starboxe29/gstack that referenced this pull request Apr 28, 2026
…ytan#7)

Add FIRST_RUN_COMPLETE variable (immutable session-start snapshot) to all
skill preambles. Session 1 shows only the lake intro. Telemetry, proactive,
and routing injection are deferred to session 2+, reducing first-run
interruptions from 4 → 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Likas07 added a commit to Likas07/gstack that referenced this pull request Apr 28, 2026
…omplete

Two doc updates:

1. **New:** `docs/designs/2026-04-26-daedalus-host-support.md` — public
   record of the daedalus-host fork work. Maps Steps 1-11 from the
   personal office-hours design doc (`~/.gstack/projects/garrytan-gstack/...`,
   not in repo) to the commits that landed each one. Captures architecture
   decisions worth remembering (leak-test stripExempt order, host-conditional
   syntax constraints, why codex.ts carries both directions of pathRewrites,
   how the codex boundary became host-aware). Resolves OQs garrytan#1-3 and garrytan#6 with
   what was decided; documents OQs garrytan#4 (voice naming), garrytan#5 (gstack-daedalus
   outside-voice analog), and garrytan#7 (mainline merge strategy) as deferred.

2. **Update:** `docs/plans/2026-04-26-daedalus-host-support.md` — original
   host-add implementation plan. Status banner added at the top noting it's
   complete and pointing to the new design doc for the follow-on work
   (PRIMARY_HOST flip + leak gate + codex unblock — separate scope from the
   guest-host plan this file covered). All 20 task checkboxes ticked.

The personal design doc at `~/.gstack/projects/garrytan-gstack/likas-daedalus-host-design-20260426-224816.md`
remains the source-of-intent (it's where scope and OQs were originally
debated). The new repo-tracked doc is the source-of-record for what shipped.

bun test green; 43/43 skills gated; final state matches the design's
success criteria modulo the dogfooding window (Step 10, ongoing user task).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 29, 2026
)

* feat(gbrain-sync): queue primitives + writer shims

Adds bin/gstack-brain-enqueue (atomic append to sync queue) and
bin/gstack-jsonl-merge (git merge driver, ts-sort with SHA-256 fallback).
Wires one backgrounded enqueue call into learnings-log, timeline-log,
review-log, and developer-profile --migrate. question-log and
question-preferences stay local per Codex v2 decision.

gstack-config gains gbrain_sync_mode (off/artifacts-only/full) and
gbrain_sync_mode_prompted keys, plus GSTACK_HOME env alignment so
tests don't leak into real ~/.gstack/config.yaml.

* feat(gbrain-sync): --once drain + secret scan + push

bin/gstack-brain-sync is the core sync binary. Subcommands: --once
(drain queue, allowlist-filter, privacy-class-filter, secret-scan
staged diff, commit with template, push with fetch+merge retry),
--status, --skip-file <path>, --drop-queue --yes, --discover-new
(cursor-based detection of artifact writes that skip the shim).

Secret regex families: AWS keys, GitHub tokens (ghp_/gho_/ghu_/ghs_/
ghr_/github_pat_), OpenAI sk-, PEM blocks, JWTs, bearer-token-in-JSON.
On hit: unstage, preserve queue, print remediation hint (--skip-file
or edit), exit clean. No daemon — invoked by preamble at skill
boundaries.

* feat(gbrain-sync): init, restore, uninstall, consumer registry

bin/gstack-brain-init: idempotent first-run. git init ~/.gstack/,
.gitignore=*, canonical .brain-allowlist + .brain-privacy-map.json,
pre-commit secret-scan hook (defense-in-depth), merge driver registration
via git config, gh repo create --private OR arbitrary --remote <url>,
initial push, ~/.gstack-brain-remote.txt for new-machine discovery,
GBrain consumer registration via HTTP POST.

bin/gstack-brain-restore: safe new-machine bootstrap. Refuses clobber
of existing allowlisted files, clones to staging, rsync-copies tracked
files, re-registers merge drivers (required — not cloned from remote),
rehydrates consumers.json, prompts for per-consumer tokens.

bin/gstack-brain-uninstall: clean off-ramp. Removes .git + .brain-*
files + consumers.json + config keys. Preserves user data (learnings,
plans, retros, profile). Optional --delete-remote for GitHub repos.

bin/gstack-brain-consumer + bin/gstack-brain-reader (symlink alias):
registry management. Internal 'consumer' term; user-facing 'reader'
per DX review decision.

* feat(gbrain-sync): preamble block — privacy gate + boundary sync

scripts/resolvers/preamble/generate-brain-sync-block.ts emits bash that
runs at every skill invocation:
- Detects ~/.gstack-brain-remote.txt on machines without local .git
  and surfaces a restore-available hint (does NOT auto-run restore).
- Runs gstack-brain-sync --once at skill start to drain any pending
  writes (and at skill end via prose instruction).
- Once-per-day auto-pull (cached via .brain-last-pull) for append-only
  JSONL files.
- Emits BRAIN_SYNC: status line every skill run.

Also emits prose for the host LLM to fire the one-time privacy
stop-gate (full / artifacts-only / off) when gbrain is detected and
gbrain_sync_mode_prompted is false. Wired into preamble.ts composition.

* test(gbrain-sync): 27-test consolidated suite

test/brain-sync.test.ts covers:
- Config: validation, defaults, GSTACK_HOME env isolation
- Enqueue: no-op gates, skip list, concurrent atomicity, JSON escape
- JSONL merge driver: 3-way + ts-sort + SHA-256 fallback
- Init + sync: canonical file creation, merge driver registration,
  push-reject + fetch+merge retry path
- Init refuses different remote (idempotency)
- Cross-machine restore round-trip (machine A write → machine B sees)
- Secret scan across all 6 regex families (AWS, GH, OpenAI, PEM, JWT,
  bearer-JSON). --skip-file unblock remediation
- Uninstall removes sync config, preserves user data
- --discover-new idempotence via mtime+size cursor

Behaviors verified via integration smokes during implementation. Known
follow-up: bun-test 5s default timeout needs 30s wrapper for
spawnSync-heavy tests.

* docs(gbrain-sync): user guide + error lookup + README section

docs/gbrain-sync.md: setup walkthrough, privacy modes, cross-machine
workflow, secret protection, two-machine conflict handling, uninstall,
troubleshooting reference.

docs/gbrain-sync-errors.md: problem/cause/fix index for every
user-visible error. Patterned on Rust's error docs + Stripe's API
error reference.

README.md: 'Cross-machine memory with GBrain sync' section near the
top (discovery moment), plus docs-table entry.

* chore: bump version and changelog (v1.7.0.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: regenerate SKILL.md files for gbrain-sync preamble block

Re-runs bun run gen:skill-docs after adding generateBrainSyncBlock
to scripts/resolvers/preamble.ts in a2aa8a0. CI check-freshness
caught the drift. All 36 SKILL.md files regenerated with the new
skill-start bash block + privacy-gate prose + skill-end sync
instructions baked in.

* fix(test): session-awareness reads AskUserQuestion Format from a Tier 2+ SKILL.md

The test was reading ROOT/SKILL.md (browse skill, Tier 1) which never
contained '## AskUserQuestion Format' — that section is only emitted
for Tier 2+ skills by scripts/resolvers/preamble.ts. As a result the
agent was prompted with an empty format guide and only emitted
'RECOMMENDATION' intermittently, making the test flaky.

Pre-existing on main (same ROOT/SKILL.md shape there) — surfaced now
because the agent run didn't hit the RECOMMENDATION/recommend/option a
fallback strings in this particular attempt.

Fix: read from office-hours/SKILL.md (Tier 3, always has the section)
with a fallback that scans for the first top-level skill dir whose
SKILL.md contains the header. Future template moves won't break this
test again.

* feat(browse): domain-skills storage + state machine

New module browse/src/domain-skills.ts implements the per-site notes
the agent writes for itself, persisted as type:"domain" rows alongside
/learn's per-project learnings.

Three scopes layered: per-project default, global by explicit promotion.
Project-active shadows global for the same host.

State machine (T6 — codex outside-voice):
  quarantined --3 uses w/o flag--> active(project) --promote--> global
        ^                                |
        +----- classifier flag during use

- Append-only JSONL with O_APPEND for atomic small writes
- Tolerant parser drops partial trailing line on read
- Tombstone for deletes (compactor cleans up later)
- Version log per (host, scope) enables rollback
- Hostname derived from active tab top-level origin (T3 confused-deputy fix)
- writeSkill rejects classifier_score >= 0.85 with structured error

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): domain-skills storage + state machine

14 tests covering:
- T3 hostname normalization (lowercase, www. strip, port/path/query strip,
  subdomain-exact preserved)
- T4 scope shadowing (per-project active shadows global for same host)
- T5 persistence (version monotonicity, tolerant parser drops partial line)
- T6 state machine (quarantined → active after N=3 uses, classifier-flag
  blocks promotion, save-time score >= 0.85 rejected)
- Rollback by version log (restore prior body, advance version counter)
- Tombstone deletion (read returns null after delete)

All 14 pass in 27ms via bun test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B domain-skill subcommands

Wire the domain-skills storage layer into the browse CLI as a META command:

  $B domain-skill save              save body from stdin or --from-file
                                     (host derived from active tab — T3)
  $B domain-skill list              list all skills visible to current project
  $B domain-skill show <host>       print skill body
  $B domain-skill edit <host>       open in $EDITOR
  $B domain-skill promote-to-global <host>  cross-project promotion (T4)
  $B domain-skill rollback <host> [--global]  restore prior version
  $B domain-skill rm <host> [--global]        tombstone

Save path runs L1-L3 content filters from content-security.ts (importable
in compiled binary, unlike L4 ML classifier — see CLAUDE.md). The L4
classifier scan happens in sidebar-agent at prompt-injection load time.

Output is structured (problem + cause + suggested-action) per DX D7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B cdp escape hatch — deny-default allowlist + two-tier mutex

Codex T2: flip CDP posture to deny-default. Allowed methods enumerated in
cdp-allowlist.ts with (scope: tab|browser, output: trusted|untrusted,
justification) per entry.

Initial allowlist (~25 methods) covers:
- Accessibility tree extraction (read-only)
- DOM/CSS inspection (read-only)
- Performance metrics
- Tracing
- Emulation viewport/UA override
- Page screenshot/PDF capture (output is binary, no marker injection vector)
- Network.enable/disable (no bodies/cookies — those are exfil surfaces)
- Runtime.getProperties (NO evaluate/callFunctionOn — those would be RCE)

Page.navigate is INTENTIONALLY NOT allowed; agents use $B goto which
goes through the URL blocklist.

Codex T7: two-tier mutex. tab-scoped methods take per-tab lock; browser-
scoped take global lock that blocks all tab locks. 5s acquire timeout
yields CDPMutexAcquireTimeout (no silent hangs). All lock acquires use
try/finally so errors don't leak the lock.

Path A from spike: uses Playwright's newCDPSession() per page. No second
WebSocket, no need for --remote-debugging-port. CDPSession is cached
per page in a WeakMap and cleared on page close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): CDP allowlist + two-tier mutex

13 tests:
- Allowlist linter: every entry has 4 required fields, no duplicates,
  justification length > 20 chars
- Deny-list verification: dangerous methods (Runtime.evaluate, Page.navigate,
  Network.getResponseBody, Browser.close, Target.attachToTarget, etc.) are
  NOT allowed (Codex T2 categories 4-7)
- Per-tab mutex serializes ops on same tab
- Per-tab mutex allows parallel ops across different tabs
- Global lock blocks tab locks; tab locks block global lock
- Acquire timeout yields CDPMutexAcquireTimeout (no silent hang)
- Timeout error names the tab id and the timeout budget

Also extends Network.disable justification to satisfy linter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): telemetry signals + project-slug helper

Lightweight telemetry per DX D9: piggybacks on ~/.gstack/analytics/ pattern.
Hostname + aggregate counters only, no body content. GSTACK_TELEMETRY_OFF=1
silences. Fire-and-forget — never blocks calling path.

Signals fired so far:
- domain_skill_saved {host, scope, state, bytes}
- domain_skill_save_blocked {host, reason}

(domain_skill_fired and cdp_method_* fired in subsequent commits.)

Also extracts project-slug resolution into project-slug.ts so server.ts
and domain-skill-commands.ts share one cached lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): sidebar prompt-context injection + CDP telemetry

server.ts spawnClaude now:
- Imports per-project domain skill matching the active tab's hostname
  via readDomainSkill()
- Wraps the body in UNTRUSTED EXTERNAL CONTENT envelope (so the L4
  classifier in sidebar-agent sees it at load time per Eng D4)
- Appends as <domain-skill source="..." host="..." version="..."> block
- Fires domain_skill_fired telemetry (host, source, version)
- Calls recordSkillUse fire-and-forget so the auto-promote-after-N=3
  state machine advances on each successful prompt injection

System prompt also gets a one-liner introducing $B domain-skill commands
to agents (DX D4 start-of-task discoverability hint).

cdp-bridge.ts fires:
- cdp_method_denied (drives next allow-list growth)
- cdp_method_lock_acquire_ms (P50/P99 quantile observability)
- cdp_method_called (allowed methods)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): telemetry module

3 tests covering:
- logTelemetry writes JSONL with ts injected
- GSTACK_TELEMETRY_OFF=1 silences all events
- logTelemetry never throws on disk failures

Uses GSTACK_HOME env var to redirect writes to a tmp dir; the telemetry
module reads HOME lazily so test mutations take effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: domain-skills reference + error lookup table

docs/domain-skills.md mirrors the layered shape of docs/gbrain-sync.md
(DX D8): how agents use it, state machine, storage layout, security model
(L1-L3 + L4 layered defense), error reference table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(readme): browser-harness-js plug + domain-skills section

New "Domain skills + raw CDP escape hatch" section under "The sprint"
covering both v1.8.0.0 features. Plugs browser-use/browser-harness-js
as the no-rails alternative for users who want raw CDP without gstack's
security stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.8.0.0)

Branch-scoped bump on top of merged 1.7.0.0 base. CHANGELOG entry covers
the full v1.8.0.0 scope: $B domain-skill, $B cdp escape hatch, two-tier
mutex, telemetry signals, sidebar prompt-context injection. Includes
Codex outside-voice trail (7 of 20 findings resolved, 12 mooted by T1
scope drop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* todos: 7 follow-ups from v1.8.0.0 review trail

P1: Self-authoring $B commands with out-of-process worker isolation
    (Codex T1 deferred from v1.8.0.0 — needs real isolation design)
P2: Migrate /learn to SQLite (Codex T5 long-term primitive fix)
P2: Remove plan-mode handshake from /plan-devex-review (skill bug)
P3: GBrain skillpack publishing for domain-skills
P3: Replay/record demonstrated flows to domain-skills
P3: $B commands review batch-mode UX (alternative to inline approval)
P3: Heuristic command-gap watcher (DX D4 alternative C)

Each entry has the standard What/Why/Pros/Cons/Context/Effort/Priority/
Depends-on shape so anyone picking these up later has full context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): lazy GSTACK_HOME resolution in domain-skills

Module-level constants (GLOBAL_FILE, derived path) were evaluated at
module-load and cached. When E2E and unit tests run in the same Bun
test pass and set GSTACK_HOME differently, the second test sees the
first test's path. Switch to lazy gstackHome() / globalFile() / projectFile()
helpers so process.env mutations take effect.

Mirrors the pattern already used in telemetry.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): E2E gate-tier tests for domain-skills + CDP

domain-skills-e2e.test.ts (4 tests):
- save derives host from active tab top-level origin (T3)
- save lands quarantined; list surfaces it
- readSkill returns null until 3 uses without flag promote to active (T6)
- save without an active page errors with structured guidance

cdp-e2e.test.ts (8 tests):
- Accessibility.getFullAXTree returns wrapped JSON (allowed, untrusted-output)
- Performance.getMetrics returns plain JSON (allowed, trusted-output)
- Runtime.evaluate DENIED with structured guidance (T2 RCE block)
- Page.navigate DENIED (must use $B goto for blocklist routing)
- Network.getResponseBody DENIED (exfil block)
- malformed JSON params surfaces clear error
- non Domain.method format surfaces clear error
- $B cdp help returns help text

Both files boot a real Chromium via BrowserManager.launch() and exercise
the dispatch handlers end-to-end. Total 12 E2E tests in <2s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regenerate SKILL.md files with new $B commands

bun run gen:skill-docs picks up the domain-skill and cdp META_COMMANDS
entries added in commands.ts. Both top-level SKILL.md and browse/SKILL.md
now list the new commands in their Meta and Inspection tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship SKILL.md golden baselines for v1.7.0.0

Pre-existing failures inherited from garrytan/gbrain-support: the GBrain
Sync preamble block (added in v1.7.0.0) appears in regenerated SKILL.md
output but the golden baselines in test/fixtures/golden/ were never
updated. Three failures fixed:

  golden-file regression > Claude ship skill matches golden baseline
  golden-file regression > Codex ship skill matches golden baseline
  golden-file regression > Factory ship skill matches golden baseline

Goldens regenerated by copying the current ship/SKILL.md, codex
.agents/skills/gstack-ship/SKILL.md, and .factory/skills/gstack-ship/SKILL.md
files. Diff is the v1.7.0.0 GBrain Sync preamble block + privacy stop-gate
(no behavioral changes — just preamble text).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-sync): bearer-token regex catches values with leading space

Pre-existing bug from v1.7.0.0: the bearer-token-json secret pattern
required values matching [A-Za-z0-9_./+=-]{16,}, which rejected the
"Bearer <token>" form because the literal space after "Bearer" wasn't
in the character class. Real Authorization headers use "Bearer <token>"
syntax, and the test fixture
  '"authorization":"Bearer abcdef1234567890abcdef1234567890"'
sat unscanned despite being a leak-class secret.

One-character fix: add space to the value character class. Test
'gstack-brain-sync secret scan > blocks bearer-json' now passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(brain-sync): GSTACK_HOME isolation test compares mtime, not content

Pre-existing flaky test: the GSTACK_HOME-overrides-real-config test asserted
the real ~/.gstack/config.yaml does NOT contain "gbrain_sync_mode: full"
after the test. That fails for any user whose real config legitimately has
that key set from prior usage — the test's invariant is "the command did
not modify the real file," not "the real file lacks any specific value."

Switch to mtime + content snapshot: capture both BEFORE running the command,
then verify both are unchanged after. Also add a positive assertion that
the tmpHome config DID get the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): exempt deliberate large fixtures from 2MB limit

Pre-existing failure: the "git tracks no files larger than 2MB" test
caught browse/test/fixtures/security-bench-haiku-responses.json (28.8MB
of replay data committed in v1.6.4.0 for security benchmark gate tests).

The test exists to catch accidentally-committed binaries (Mach-O dist
binaries, etc), not to forbid all large files. Add an explicit
LARGE_FIXTURE_EXEMPTIONS allowlist so deliberate replay fixtures pass
the gate while accidental binaries still fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skill-token): mint scoped tokens per skill spawn

Wraps token-registry.createToken/revokeToken with skill-specific
clientId encoding (skill:<name>:<spawn-id>) and read+write defaults.
Skill scripts get a per-spawn capability token bound to browser-driving
commands; the daemon root token never leaves the harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-client): SDK for browser-skill scripts

Thin wrapper over POST /command with bearer auth. Resolves daemon
port + token from GSTACK_PORT + GSTACK_SKILL_TOKEN env vars first
(set by $B skill run when spawning), falls back to .gstack/browse.json
for standalone debug runs.

Convenience methods cover the read+write surface skills typically need:
goto, click, fill, text, html, snapshot, links, forms, accessibility,
attrs, media, data, scroll, press, type, select, wait, hover, screenshot.
Low-level command(cmd, args) escape hatch for anything else.

This is the canonical SDK source. Each browser-skill ships a sibling
copy at <skill>/_lib/browse-client.ts so each skill is fully portable
and version-pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): 3-tier storage helpers

listBrowserSkills() walks project > global > bundled (first-wins),
parses SKILL.md frontmatter, no INDEX.json. readBrowserSkill() does
the same for a single name. tombstoneBrowserSkill() moves a skill
into .tombstones/<name>-<ts>/ for recoverability.

Frontmatter parser handles the subset browser-skills need: scalars
(host, description, trusted, version, source), string lists
(triggers), and arg-mapping lists ([{name, description}, ...]).
Quoted values handle colons; trusted defaults to false.

Bundled tier path is auto-detected from the binary install location;
project tier comes from git rev-parse; global is ~/.gstack/. All tier
paths are overridable for hermetic tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): \$B skill list/show/run/test/rm subcommands

handleSkillCommand dispatches to per-subcommand handlers; spawnSkill is
the load-bearing function that:

  1. Mints a per-spawn scoped token (read+write only) bound to the
     skill name + spawn-id.
  2. Builds the spawn env:
     - trusted: passes process.env minus GSTACK_TOKEN (defense in depth).
     - untrusted: minimal allowlist (LANG, LC_ALL, TERM, TZ) + locked
       PATH; explicitly drops anything matching TOKEN/KEY/SECRET/etc.
       Also drops AWS_/AZURE_/GCP_/GOOGLE_APPLICATION_/ANTHROPIC_/OPENAI_/
       GITHUB_/GH_/SSH_/GPG_/NPM_TOKEN/PYPI_ patterns.
   3. Always injects GSTACK_PORT + GSTACK_SKILL_TOKEN last (cannot be
     overridden by parent env).
  4. Spawns bun run script.ts -- <args> with cwd=skillDir, captures
     stdout (1MB cap), stderr, and timeout-kills past the deadline.
  5. Revokes the token in finally{}, always.

list output prints the resolved tier inline so "why did it run that
one?" never becomes a debugging mystery (Codex finding #4 mitigation).

server.ts threads the listen port to meta-commands via MetaCommandOpts.daemonPort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): bundled hackernews-frontpage reference skill

Smallest interesting browser-skill: scrapes HN front page, returns
30 stories as JSON. No auth, stable HTML, fully fixture-tested.

Files:
  SKILL.md                          frontmatter + prose
  script.ts                         exports parseStoriesFromHtml(html)
                                    main: goto + html + parse + JSON.stringify
  _lib/browse-client.ts             vendored copy of the SDK
  fixtures/hn-2026-04-26.html       captured front page (5 stories)
  script.test.ts                    13 assertions against the fixture

The parser is a pure function over HTML so script.test.ts runs
without a daemon (just imports parseStoriesFromHtml and asserts).

This exercises every Phase 1 component end-to-end:
  - browse-client SDK (script imports browse from ./_lib/)
  - 3-tier lookup (hackernews-frontpage lives in the bundled tier)
  - scoped tokens (read+write is enough for goto + html)
  - spawn lifecycle (\$B skill run hackernews-frontpage)
  - file-fixture testing (\$B skill test hackernews-frontpage)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): cover bundled browser-skills

Adds 7 assertions per bundled skill at <root>/browser-skills/<name>/:
  - SKILL.md exists
  - frontmatter parses with required fields (name/host/triggers/args)
  - script.ts exists
  - _lib/browse-client.ts exists and matches the canonical SDK byte-for-byte
  - script.test.ts exists
  - script.ts imports browse from ./_lib/browse-client

The byte-identical SDK check enforces the version-pinning contract:
when the canonical SDK at browse/src/browse-client.ts changes, every
bundled skill's _lib/ copy must be re-synced or this test fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(designs): add BROWSER_SKILLS_V1 design doc

Captures the 13 locked decisions, two-axis trust model (daemon-side
scoped tokens + process-side env access), 3-tier lookup, file
layout, and full responses to all 8 Codex outside-voice findings.
Includes Phase 2-4 sketches for future branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): replace self-authoring-\$B P1 with browser-skills phases

Phase 1 of the browser-skills design shipped on this branch (sidesteps
the in-daemon isolation problem the original P1 was blocked on). The
new entries enumerate the work that remains:

  P1: Phase 2 (/scrape + /automate skill templates)
  P2: Phase 3 (resolver injection at session start)
  P2: Phase 4 (eval infra + fixture staleness + OS sandbox)

Cross-references docs/designs/BROWSER_SKILLS_V1.md for the full
architecture and the 8 Codex review findings + responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.9.0.0 — browser-skills runtime

VERSION 1.8.0.0 → 1.9.0.0. CHANGELOG entry leads with what humans
can do today (hand-write deterministic browser scripts, run them in
200ms via \$B skill run). Notes explicitly that agent authoring
lands in next release; no fabricated perf numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills-e2e): exercise dispatch with bundled hackernews-frontpage

Covers the full \$B skill list/show/test pipeline against the real
bundled reference skill (defaultTierPaths picks up <repo>/browser-skills/).
Verifies frontmatter shape, the three-tier walk surfaces the bundled
entry, and \$B skill test successfully runs the bundled script.test.ts
in a child bun process.

\$B skill run end-to-end against the live network is intentionally NOT
covered here (would be flaky against news.ycombinator.com); the spawn
lifecycle is exercised in browser-skill-commands.test.ts using inline
synthetic skills.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regen SKILL.md to surface the skill META command

bun run gen:skill-docs picked up the new \`skill\` command from
COMMAND_DESCRIPTIONS in browse/src/commands.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.9.0.0 → v1.13.0.0

Main shipped through v1.11.1.0 while this branch was in flight; v1.12.x
is presumed claimed by another in-flight branch. Use v1.13.0.0 as the
next available slot.

Updated VERSION, package.json, and the CHANGELOG header. Entry body
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.13.0.0 → v1.16.0.0

Main shipped v1.13.0.0 (claude outside-voice skill), v1.14.0.0
(sidebar REPL), and v1.15.0.0 (slim preamble + plan-mode E2E)
while this branch was in flight. Use v1.16.0.0 as the next
available slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-skills): atomic write helper for /skillify (D3)

stageSkill writes a candidate skill into ~/.gstack/.tmp/skillify-<spawnId>/
with restrictive perms. commitSkill does an atomic fs.renameSync into the
final tier path with realpath/lstat discipline (refuses symlinked staging
dirs, refuses to clobber existing skills). discardStaged is the cleanup
path for test failures and approval rejections, idempotent and bounded
to the per-spawn wrapper. validateSkillName enforces lowercase/digits/
dashes only, no path-escape characters.

Implements the D3 contract from the v1.19.0.0 plan review: never a
half-written skill on disk. Test fail or approval reject = rm -rf the
temp dir, no tombstone for never-approved skills.

Closes Codex finding #5 (atomic skill packaging) for Phase 2a.

34 unit assertions covering: stage validation, file-path escape rejection,
permission check, atomic rename, clobber refusal, symlink refusal, project
tier unresolved, idempotent discard, end-to-end happy + simulated test
failure + approval reject paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(scrape): /scrape <intent> skill template

One entry point for pulling page data. Three paths under the hood:

1. Match — agent reads $B skill list, semantically matches the user's
   intent against each skill's triggers + description + host. Confident
   match = $B skill run <name> in ~200ms.
2. Prototype — no match, drive the page with $B goto/text/html/links etc.
   Return JSON, append a one-line "say /skillify" nudge.
3. Mutating refusal — verbs like submit/click/fill route to /automate
   (Phase 2b P0); /scrape is read-only by contract.

Match decision lives in the agent, not the daemon. No new code in
browse/src/, no expanded daemon command surface, no new prompt-injection
blast radius.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skillify): /skillify codifies last /scrape into permanent skill

The productivity multiplier. /scrape discovers the flow; /skillify writes
it as deterministic Playwright-via-browse-client code so the next /scrape
on the same intent runs in ~200ms.

11-step flow with three locked contracts from the v1.19.0.0 plan review:

D1 — Provenance guard. Walk back ≤10 agent turns for a clearly-bounded
/scrape result. Refuse with one specific message if cold. No silent
synthesis from chat fragments.

D2 — Synthesis input slice. Extract ONLY the final-attempt $B calls that
produced the JSON the user accepted, plus the user's intent string. Drop
failed selectors, drop unrelated chat, drop earlier-session content.
Closes Codex finding #6 by picking option (b) from the design doc:
re-prompt from agent's own context, not a structured recorder.

D3 — Atomic write. Stage to ~/.gstack/.tmp/skillify-<spawnId>/, run
$B skill test against the temp dir, only rename into the final tier path
on test pass + user approval. Test fail or approval reject = rm -rf the
temp dir entirely.

Default tier: global (~/.gstack/browser-skills/<name>/). --project flag
overrides to per-project. Generated test must include at least one ★★
assertion (parsed JSON has expected shape + non-empty key fields), not a
smoke ★ assertion.

Bun runtime distribution (Codex finding #7) carries over to Phase 4.
Documented in the skill's Limits section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills): gate-tier E2E for /scrape + /skillify (D4)

Five scenarios cover the productivity loop and the contracts locked
during the v1.19.0.0 plan review:

  scrape-match-path           — intent matching bundled hackernews-frontpage
                                routes via $B skill run, no prototype phase
  scrape-prototype-path       — no matching skill, drives $B against a local
                                file:// fixture, returns JSON, suggests
                                /skillify
  skillify-happy-path         — /scrape then /skillify; skill written to
                                ~/.gstack/browser-skills/<name>/ with the
                                full file tree; SKILL.md prose body must
                                not contain conversation fragments (D2)
  skillify-provenance-refusal — cold /skillify with no prior /scrape refuses
                                with the D1 message; nothing on disk (D1)
  skillify-approval-reject    — /scrape then /skillify but reject in the
                                approval gate; temp dir is removed, nothing
                                at the final tier path (D3)

All five gate-tier (~$0.50-$1.50 each, ~$5 total per CI run). Set EVALS=1
to enable. Uses local file:// fixtures so prototype + skillify scenarios
run deterministically without network.

Touchfiles registers all 5 entries with proper deps on scrape/**,
skillify/**, browse/src/browser-skill-write.ts, and the Phase 1 runtime
modules. The match-path test depends on the bundled hackernews-frontpage
skill so its touchfile includes browser-skills/hackernews-frontpage/**.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser-skills): TODOS Phase 2a + design doc D1-D4 decisions

TODOS.md:
- Narrows existing P1 (was "/scrape and /automate") to "/scrape and
  /skillify" — the /scrape + /skillify wedge ships in this branch.
  Codex finding #6 (synthesis) removed from Cons (resolved by D2);
  finding #7 (Bun runtime) stays as the open carry-over.
- Adds new ## P0 above PACING_UPDATES_V0 for the /automate follow-up.
  Same skillify pattern as /scrape, different trust profile (per-step
  confirmation gate when running non-codified). Reuses /skillify and
  the D3 helper as-is. Effort M.

BROWSER_SKILLS_V1.md:
- Phase table re-organized into 1, 2a, 2b, 3, 4. Phase 1 + Phase 2a
  consolidate into v1.19.0.0 ship (the v1.16.0.0 branch-internal
  bump never landed on main).
- New "Phase 2a" sub-section captures the four decisions locked
  during /plan-eng-review:
    D1 — provenance guard (≤10 turn walk-back, refuse if cold)
    D2 — synthesis input slice (final-attempt $B calls only,
         closes Codex finding #6)
    D3 — atomic write discipline (temp-dir-then-rename via new
         browse/src/browser-skill-write.ts helper)
    D4 — full test scope (5 gate E2E + 1 unit + smoke)
- New "Phase 2b" sketch for /automate: same skillify machinery,
  per-mutating-step confirmation gate, deferred to next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.16.0.0 -> v1.19.0.0 — browser-skills Phase 1 + 2a

Consolidates the v1.16.0.0 branch-internal bump (Phase 1 runtime, never
landed on main) with Phase 2a (/scrape + /skillify + atomic-write helper)
into one v1.19.0.0 ship per CLAUDE.md "Never orphan branch-internal
versions" rule.

Headline: Browser-skills land end-to-end. /scrape <intent> first call
drives the page; second call runs the codified script in 200ms.

The unified CHANGELOG entry covers:
- Phase 1 runtime: $B skill list/show/run/test/rm, scoped tokens,
  3-tier storage, bundled hackernews-frontpage reference.
- Phase 2a: /scrape + /skillify gstack skills, browser-skill-write.ts
  atomic helper, 5 gate-tier E2E + 34 unit assertions.

Numbers table updated: 5 new modules (+browser-skill-write), 2 new
gstack skills, 6 of 8 Codex outside-voice findings resolved (synthesis
#6 closed by D2; Bun runtime #7 + OS sandbox #1 stay deferred to Phase 4).

/automate (Phase 2b) is split out as P0 in TODOS for the next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(commands): tighten descriptions for LLM-judge baseline pinning

The skill-llm-eval test "baseline score pinning" failed CI on three
retry attempts: judge gave command_reference.actionability=3, baseline
demands ≥4. Judge cited 8 specific gaps in COMMAND_DESCRIPTIONS.

This commit closes 7 of 8 by tightening the descriptions:

- press: documents that key names are case-sensitive Playwright keys,
  shows modifier syntax (Shift+Enter, Control+A), links the full key
  list. Removes the "is this case-sensitive?" guesswork.
- is: documents that <sel> accepts either a CSS selector OR an @ref
  token from a prior snapshot, and that property values are case-
  sensitive.
- scroll: documents that there is no --by/--to amount option, points
  at `js window.scrollTo(0, N)` for pixel-precise scrolling.
- js / eval: clarifies that both run in the same JS sandbox, the
  difference is just inline expr (js) vs file (eval).
- storage: clarifies sessionStorage is read-only via this command,
  points at `js sessionStorage.setItem(...)` for the write path.
- chain: walks through how to invoke (pipe a JSON array of arrays to
  $B chain), confirms it stops at the first error.
- cdp: explains how to discover allowed methods (read cdp-allowlist.ts)
  + shows a concrete example invocation.
- domain-skill: explains that the "classifier flag" is set automatically
  by the L4 prompt-injection scan (agents do not set it manually);
  enumerates the full lifecycle verbs.

The 8th gap (storage set syntax conflict) is also resolved as part of
the storage rewrite.

Two pipe-character bugs caught by the existing
`no command description contains pipe character` guard at
`test/gen-skill-docs.test.ts:595`: the chain example originally used
`echo '[...]' | $B chain` (literal pipe) and the cdp description used
`tab|browser` / `trusted|untrusted` (also literal pipes). Both rewritten
to keep markdown table cells intact.

Verification: 696/0 pass on skill-validation + gen-skill-docs after
regen across all hosts. The CI llm-judge eval will re-run against the
new SKILL.md and should hit actionability ≥4 reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser): rewrite BROWSER.md as complete reference

Full rewrite covering the gstack browser surface as of v1.19.0.0. Up from
488 to 1,299 lines, 26 top-level sections.

Adds previously-undocumented subsystems:

- The productivity loop: /scrape + /skillify with D1 (provenance guard),
  D2 (final-attempt-only synthesis), D3 (atomic-write discipline) contracts.
- Browser-skills runtime: anatomy, three-tier storage, scoped tokens, trust
  model (capability + env axes), sibling SDK distribution, atomic-write
  helper, bundled hackernews-frontpage reference.
- Domain-skills: per-site agent notes with quarantined → active → global
  state machine and the L4-classifier auto-promotion gate.
- Pair-agent: dual-listener architecture, 26-command tunnel allowlist,
  canDispatchOverTunnel pure gate, three token types (root, setup key,
  scoped), denial log path + salt model.
- Security stack L1-L6: layer table, thresholds (BLOCK/WARN/LOG_ONLY/
  SOLO_CONTENT_BLOCK), ensemble rule, classifier model paths, env knobs.
- Side Panel deep dive: Terminal pane (Claude PTY) as the primary surface
  with Activity/Refs/Inspector as debug overlays, WS auth via
  Sec-WebSocket-Protocol, gstackInjectToTerminal cross-pane plumbing.
- CDP escape hatch: $B cdp deny-default allowlist, $B inspect CSS inspector,
  $B ux-audit page structure extraction.
- Meta commands previously undocumented: tabs/frames/state/watch/inbox/
  tab-each, with usage and storage paths.
- Authentication: three token types with lifetimes, SSE session cookie,
  PTY session cookie, token registry behavior.
- Full source map: 30+ file inventory of browse/src/ vs the old 11-file
  list.

Preserves from before: architecture diagram, daemon lifecycle, snapshot
ref staleness, screenshot modes, goto file:// vs load-html semantics,
batch endpoint, JS await wrapping, env vars, performance numbers vs MCP,
Playwright acknowledgments, dev guide.

Cross-links to ARCHITECTURE.md, CLAUDE.md, docs/REMOTE_BROWSER_ACCESS.md,
docs/designs/BROWSER_SKILLS_V1.md, scrape/SKILL.md, skillify/SKILL.md,
TODOS.md so anyone landing on BROWSER.md can navigate to the load-bearing
companion docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): tab-ownership gate keys on tabPolicy, not isWrite

Browser-skill spawns hit `403: Tab not owned by your agent` on every
first run because the gate at server.ts:639 fired for any non-root
write, regardless of the token's tabPolicy. The bundled
hackernews-frontpage reference skill failed identically. Every
/skillify-generated skill failed identically. The user's natural
tabs have no claimed owner — by design — so any skill driving
them via `goto` (a write) was 403'd.

The intent in skill-token.ts:79 was always correct: `tabPolicy: 'shared'`
with the comment "skill scripts may switch tabs as needed." The
enforcement just ignored it.

Two surgical changes:

browser-manager.ts:checkTabAccess — gate now keys on options.ownOnly
only. Shared-policy tokens (skill spawns, default scoped clients) get
permissive access — root-equivalent for the tab gate. Own-only tokens
(pair-agent over the ngrok tunnel) still require ownership for every
read and write. isWrite stays in the signature for callers that want
to log or branch elsewhere; it no longer gates the decision.

server.ts:639 — gate predicate narrowed from
  (WRITE_COMMANDS.has(command) || tokenInfo.tabPolicy === 'own-only')
to just
  tokenInfo.tabPolicy === 'own-only'
The 'newtab' exemption stays. Shared tokens skip the gate entirely;
own-only tokens still hit it. Comment block above the gate updated to
document the new predicate intent.

Pair-agent isolation is intact. Tunnel tokens still default to
tabPolicy: 'own-only', still must `newtab` first to get a tab they
can drive, still can't dispatch any of the 23 commands outside the
tunnel allowlist.

The capability gate (scope checks) and rate limits already constrain
what local scoped clients can do; tab ownership was never a security
boundary for them — only for pair-agent. This release makes the
enforcement match the original design intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(server): lock the shared-vs-own-only tab gate contract

The pre-fix tests at tab-isolation.test.ts:43,57 encoded the broken
behavior as the contract — they specifically asserted "scoped agent
cannot write to unowned tab," which was the exact failure mode that
broke browser-skills. They passed because they tested the wrong
invariant.

This commit replaces those tests with explicit shared-vs-own-only
coverage that documents what each policy actually means:

- Shared scoped agents (skill spawns, default scoped clients) can
  read AND write any tab — unowned, their own, or another agent's.
  The capability is gated by scope checks + rate limits, not by tab
  ownership.
- Own-only scoped agents (pair-agent over tunnel) cannot read OR
  write any tab they don't own. Pre-fix this case was conflated with
  shared writes; now it's explicit.

9 unit assertions on checkTabAccess, up from 6. Each test names
the policy axis it's covering so a future refactor can't quietly
flip the contract.

Adds source-shape regression test 10a in server-auth.test.ts:
"tab gate predicate is own-only-scoped, not write-scoped." The
gate's `if (...)` line MUST contain `tabPolicy === 'own-only'` and
MUST NOT contain `WRITE_COMMANDS.has(command) ||`. If a future
refactor re-introduces the write-scoped gate, this fails immediately
in free-tier `bun test`.

Updates the marker for the existing newtab-excluded test to match
the new comment block ("Tab ownership check (own-only tokens /
pair-agent isolation)").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.19.0.0 -> v1.20.0.0 — fix tab-ownership footgun

Patch release on top of v1.19.0.0. The shipping headline of v1.19.0.0
(/scrape + /skillify productivity loop) was broken on first run in any
session where the daemon already had a tab. Bundled
hackernews-frontpage failed identically. Every /skillify-generated
skill failed identically.

The fix narrows the tab-ownership gate from "any non-root write" to
"tabPolicy === 'own-only' only." Pair-agent isolation (the v1.6.0.0
threat model) is intact; local skill spawns get their original
behavior back.

VERSION: 1.19.0.0 -> 1.20.0.0
package.json version: synced.

CHANGELOG entry leads with the user-visible impact: the productivity
loop works again, no half-second-stalls of confused 403s. Includes
before/after metrics on the bundled reference skill and the broken-
contract pre-fix tests that hid the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(claude): sharpen CHANGELOG rule — diff between main and ship

Codifies what was already implicit in the existing "Never orphan
branch-internal versions" + "Only document what shipped between main
and this change" sections, but with sharper language and concrete
NEVER examples.

The rule: a CHANGELOG entry is the diff between main and the shipping
branch — what users get when they upgrade. NOT how the branch got
there. Branch-internal version bumps, mid-branch bug fixes, plan
review outcomes, and patch narratives all belong in PR descriptions
and commit messages, not in CHANGELOG.

Adds explicit examples of phrasing to NEVER use:
  - "v1.X had a bug that v1.Y fixes" (mentions a branch-internal version)
  - "The shipping headline of v1.X was broken because..." (apologizes
    for never-released state)
  - "Pre-fix tests encoded the broken behavior" (contributor's victory
    lap, not user benefit)
  - "Two surgical edits, both in the dispatch path" (micro-narrative
    of the patch)

The constructive replacement: describe the released system as a
property, not as a fix. "Browser-skills run end-to-end with the
expected tab-access semantics." If a property is worth calling out,
document it in the trust-model section, not as a "we fixed X" callout.

Pairs with feedback_no_shame_changelog and
feedback_changelog_harden_against_critics memories — entries should
read as a flex even to a hostile screenshotter, never admit prior
breakage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): consolidate v1.20.0.0 as the diff vs main

Rewrites the v1.20.0.0 entry to describe what users get when they
upgrade from main (v1.17.0.0) to this release: browser-skills
end-to-end. Drops all branch-internal narrative — Phase 1 / Phase 2a
labels, the v1.8.0.0 P1 history paragraph, the test-counts-by-phase
split, and the patch micro-narrative for the tab-policy semantics.

The previously-separate v1.19.0.0 entry (a branch-internal version
that never landed on main) collapses into v1.20.0.0 per the
"Never orphan branch-internal versions" rule.

Tab-access policies are now documented as a property of the trust
model: `'shared'` (skill spawns) is permissive, `'own-only'`
(pair-agent over the tunnel) is strict. No "fix" framing, no
mention of an intermediate state where it was broken.

Adds the BROWSER.md rewrite and the new tab-isolation +
server-auth source-shape regression tests to the itemized changes.

The reverse-chronological order remains: v1.20.0.0 → v1.17.0.0 →
v1.16.0.0 → v1.15.0.0 → ... Gaps (v1.18, v1.19) are fine — those
were branch-internal version numbers that never landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
gonnabe88 pushed a commit to gonnabe88/gstack that referenced this pull request May 9, 2026
…rrytan#1233)

* feat(gbrain-sync): queue primitives + writer shims

Adds bin/gstack-brain-enqueue (atomic append to sync queue) and
bin/gstack-jsonl-merge (git merge driver, ts-sort with SHA-256 fallback).
Wires one backgrounded enqueue call into learnings-log, timeline-log,
review-log, and developer-profile --migrate. question-log and
question-preferences stay local per Codex v2 decision.

gstack-config gains gbrain_sync_mode (off/artifacts-only/full) and
gbrain_sync_mode_prompted keys, plus GSTACK_HOME env alignment so
tests don't leak into real ~/.gstack/config.yaml.

* feat(gbrain-sync): --once drain + secret scan + push

bin/gstack-brain-sync is the core sync binary. Subcommands: --once
(drain queue, allowlist-filter, privacy-class-filter, secret-scan
staged diff, commit with template, push with fetch+merge retry),
--status, --skip-file <path>, --drop-queue --yes, --discover-new
(cursor-based detection of artifact writes that skip the shim).

Secret regex families: AWS keys, GitHub tokens (ghp_/gho_/ghu_/ghs_/
ghr_/github_pat_), OpenAI sk-, PEM blocks, JWTs, bearer-token-in-JSON.
On hit: unstage, preserve queue, print remediation hint (--skip-file
or edit), exit clean. No daemon — invoked by preamble at skill
boundaries.

* feat(gbrain-sync): init, restore, uninstall, consumer registry

bin/gstack-brain-init: idempotent first-run. git init ~/.gstack/,
.gitignore=*, canonical .brain-allowlist + .brain-privacy-map.json,
pre-commit secret-scan hook (defense-in-depth), merge driver registration
via git config, gh repo create --private OR arbitrary --remote <url>,
initial push, ~/.gstack-brain-remote.txt for new-machine discovery,
GBrain consumer registration via HTTP POST.

bin/gstack-brain-restore: safe new-machine bootstrap. Refuses clobber
of existing allowlisted files, clones to staging, rsync-copies tracked
files, re-registers merge drivers (required — not cloned from remote),
rehydrates consumers.json, prompts for per-consumer tokens.

bin/gstack-brain-uninstall: clean off-ramp. Removes .git + .brain-*
files + consumers.json + config keys. Preserves user data (learnings,
plans, retros, profile). Optional --delete-remote for GitHub repos.

bin/gstack-brain-consumer + bin/gstack-brain-reader (symlink alias):
registry management. Internal 'consumer' term; user-facing 'reader'
per DX review decision.

* feat(gbrain-sync): preamble block — privacy gate + boundary sync

scripts/resolvers/preamble/generate-brain-sync-block.ts emits bash that
runs at every skill invocation:
- Detects ~/.gstack-brain-remote.txt on machines without local .git
  and surfaces a restore-available hint (does NOT auto-run restore).
- Runs gstack-brain-sync --once at skill start to drain any pending
  writes (and at skill end via prose instruction).
- Once-per-day auto-pull (cached via .brain-last-pull) for append-only
  JSONL files.
- Emits BRAIN_SYNC: status line every skill run.

Also emits prose for the host LLM to fire the one-time privacy
stop-gate (full / artifacts-only / off) when gbrain is detected and
gbrain_sync_mode_prompted is false. Wired into preamble.ts composition.

* test(gbrain-sync): 27-test consolidated suite

test/brain-sync.test.ts covers:
- Config: validation, defaults, GSTACK_HOME env isolation
- Enqueue: no-op gates, skip list, concurrent atomicity, JSON escape
- JSONL merge driver: 3-way + ts-sort + SHA-256 fallback
- Init + sync: canonical file creation, merge driver registration,
  push-reject + fetch+merge retry path
- Init refuses different remote (idempotency)
- Cross-machine restore round-trip (machine A write → machine B sees)
- Secret scan across all 6 regex families (AWS, GH, OpenAI, PEM, JWT,
  bearer-JSON). --skip-file unblock remediation
- Uninstall removes sync config, preserves user data
- --discover-new idempotence via mtime+size cursor

Behaviors verified via integration smokes during implementation. Known
follow-up: bun-test 5s default timeout needs 30s wrapper for
spawnSync-heavy tests.

* docs(gbrain-sync): user guide + error lookup + README section

docs/gbrain-sync.md: setup walkthrough, privacy modes, cross-machine
workflow, secret protection, two-machine conflict handling, uninstall,
troubleshooting reference.

docs/gbrain-sync-errors.md: problem/cause/fix index for every
user-visible error. Patterned on Rust's error docs + Stripe's API
error reference.

README.md: 'Cross-machine memory with GBrain sync' section near the
top (discovery moment), plus docs-table entry.

* chore: bump version and changelog (v1.7.0.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: regenerate SKILL.md files for gbrain-sync preamble block

Re-runs bun run gen:skill-docs after adding generateBrainSyncBlock
to scripts/resolvers/preamble.ts in a2aa8a0. CI check-freshness
caught the drift. All 36 SKILL.md files regenerated with the new
skill-start bash block + privacy-gate prose + skill-end sync
instructions baked in.

* fix(test): session-awareness reads AskUserQuestion Format from a Tier 2+ SKILL.md

The test was reading ROOT/SKILL.md (browse skill, Tier 1) which never
contained '## AskUserQuestion Format' — that section is only emitted
for Tier 2+ skills by scripts/resolvers/preamble.ts. As a result the
agent was prompted with an empty format guide and only emitted
'RECOMMENDATION' intermittently, making the test flaky.

Pre-existing on main (same ROOT/SKILL.md shape there) — surfaced now
because the agent run didn't hit the RECOMMENDATION/recommend/option a
fallback strings in this particular attempt.

Fix: read from office-hours/SKILL.md (Tier 3, always has the section)
with a fallback that scans for the first top-level skill dir whose
SKILL.md contains the header. Future template moves won't break this
test again.

* feat(browse): domain-skills storage + state machine

New module browse/src/domain-skills.ts implements the per-site notes
the agent writes for itself, persisted as type:"domain" rows alongside
/learn's per-project learnings.

Three scopes layered: per-project default, global by explicit promotion.
Project-active shadows global for the same host.

State machine (T6 — codex outside-voice):
  quarantined --3 uses w/o flag--> active(project) --promote--> global
        ^                                |
        +----- classifier flag during use

- Append-only JSONL with O_APPEND for atomic small writes
- Tolerant parser drops partial trailing line on read
- Tombstone for deletes (compactor cleans up later)
- Version log per (host, scope) enables rollback
- Hostname derived from active tab top-level origin (T3 confused-deputy fix)
- writeSkill rejects classifier_score >= 0.85 with structured error

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): domain-skills storage + state machine

14 tests covering:
- T3 hostname normalization (lowercase, www. strip, port/path/query strip,
  subdomain-exact preserved)
- T4 scope shadowing (per-project active shadows global for same host)
- T5 persistence (version monotonicity, tolerant parser drops partial line)
- T6 state machine (quarantined → active after N=3 uses, classifier-flag
  blocks promotion, save-time score >= 0.85 rejected)
- Rollback by version log (restore prior body, advance version counter)
- Tombstone deletion (read returns null after delete)

All 14 pass in 27ms via bun test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B domain-skill subcommands

Wire the domain-skills storage layer into the browse CLI as a META command:

  $B domain-skill save              save body from stdin or --from-file
                                     (host derived from active tab — T3)
  $B domain-skill list              list all skills visible to current project
  $B domain-skill show <host>       print skill body
  $B domain-skill edit <host>       open in $EDITOR
  $B domain-skill promote-to-global <host>  cross-project promotion (T4)
  $B domain-skill rollback <host> [--global]  restore prior version
  $B domain-skill rm <host> [--global]        tombstone

Save path runs L1-L3 content filters from content-security.ts (importable
in compiled binary, unlike L4 ML classifier — see CLAUDE.md). The L4
classifier scan happens in sidebar-agent at prompt-injection load time.

Output is structured (problem + cause + suggested-action) per DX D7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B cdp escape hatch — deny-default allowlist + two-tier mutex

Codex T2: flip CDP posture to deny-default. Allowed methods enumerated in
cdp-allowlist.ts with (scope: tab|browser, output: trusted|untrusted,
justification) per entry.

Initial allowlist (~25 methods) covers:
- Accessibility tree extraction (read-only)
- DOM/CSS inspection (read-only)
- Performance metrics
- Tracing
- Emulation viewport/UA override
- Page screenshot/PDF capture (output is binary, no marker injection vector)
- Network.enable/disable (no bodies/cookies — those are exfil surfaces)
- Runtime.getProperties (NO evaluate/callFunctionOn — those would be RCE)

Page.navigate is INTENTIONALLY NOT allowed; agents use $B goto which
goes through the URL blocklist.

Codex T7: two-tier mutex. tab-scoped methods take per-tab lock; browser-
scoped take global lock that blocks all tab locks. 5s acquire timeout
yields CDPMutexAcquireTimeout (no silent hangs). All lock acquires use
try/finally so errors don't leak the lock.

Path A from spike: uses Playwright's newCDPSession() per page. No second
WebSocket, no need for --remote-debugging-port. CDPSession is cached
per page in a WeakMap and cleared on page close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): CDP allowlist + two-tier mutex

13 tests:
- Allowlist linter: every entry has 4 required fields, no duplicates,
  justification length > 20 chars
- Deny-list verification: dangerous methods (Runtime.evaluate, Page.navigate,
  Network.getResponseBody, Browser.close, Target.attachToTarget, etc.) are
  NOT allowed (Codex T2 categories 4-7)
- Per-tab mutex serializes ops on same tab
- Per-tab mutex allows parallel ops across different tabs
- Global lock blocks tab locks; tab locks block global lock
- Acquire timeout yields CDPMutexAcquireTimeout (no silent hang)
- Timeout error names the tab id and the timeout budget

Also extends Network.disable justification to satisfy linter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): telemetry signals + project-slug helper

Lightweight telemetry per DX D9: piggybacks on ~/.gstack/analytics/ pattern.
Hostname + aggregate counters only, no body content. GSTACK_TELEMETRY_OFF=1
silences. Fire-and-forget — never blocks calling path.

Signals fired so far:
- domain_skill_saved {host, scope, state, bytes}
- domain_skill_save_blocked {host, reason}

(domain_skill_fired and cdp_method_* fired in subsequent commits.)

Also extracts project-slug resolution into project-slug.ts so server.ts
and domain-skill-commands.ts share one cached lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): sidebar prompt-context injection + CDP telemetry

server.ts spawnClaude now:
- Imports per-project domain skill matching the active tab's hostname
  via readDomainSkill()
- Wraps the body in UNTRUSTED EXTERNAL CONTENT envelope (so the L4
  classifier in sidebar-agent sees it at load time per Eng D4)
- Appends as <domain-skill source="..." host="..." version="..."> block
- Fires domain_skill_fired telemetry (host, source, version)
- Calls recordSkillUse fire-and-forget so the auto-promote-after-N=3
  state machine advances on each successful prompt injection

System prompt also gets a one-liner introducing $B domain-skill commands
to agents (DX D4 start-of-task discoverability hint).

cdp-bridge.ts fires:
- cdp_method_denied (drives next allow-list growth)
- cdp_method_lock_acquire_ms (P50/P99 quantile observability)
- cdp_method_called (allowed methods)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): telemetry module

3 tests covering:
- logTelemetry writes JSONL with ts injected
- GSTACK_TELEMETRY_OFF=1 silences all events
- logTelemetry never throws on disk failures

Uses GSTACK_HOME env var to redirect writes to a tmp dir; the telemetry
module reads HOME lazily so test mutations take effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: domain-skills reference + error lookup table

docs/domain-skills.md mirrors the layered shape of docs/gbrain-sync.md
(DX D8): how agents use it, state machine, storage layout, security model
(L1-L3 + L4 layered defense), error reference table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(readme): browser-harness-js plug + domain-skills section

New "Domain skills + raw CDP escape hatch" section under "The sprint"
covering both v1.8.0.0 features. Plugs browser-use/browser-harness-js
as the no-rails alternative for users who want raw CDP without gstack's
security stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.8.0.0)

Branch-scoped bump on top of merged 1.7.0.0 base. CHANGELOG entry covers
the full v1.8.0.0 scope: $B domain-skill, $B cdp escape hatch, two-tier
mutex, telemetry signals, sidebar prompt-context injection. Includes
Codex outside-voice trail (7 of 20 findings resolved, 12 mooted by T1
scope drop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* todos: 7 follow-ups from v1.8.0.0 review trail

P1: Self-authoring $B commands with out-of-process worker isolation
    (Codex T1 deferred from v1.8.0.0 — needs real isolation design)
P2: Migrate /learn to SQLite (Codex T5 long-term primitive fix)
P2: Remove plan-mode handshake from /plan-devex-review (skill bug)
P3: GBrain skillpack publishing for domain-skills
P3: Replay/record demonstrated flows to domain-skills
P3: $B commands review batch-mode UX (alternative to inline approval)
P3: Heuristic command-gap watcher (DX D4 alternative C)

Each entry has the standard What/Why/Pros/Cons/Context/Effort/Priority/
Depends-on shape so anyone picking these up later has full context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): lazy GSTACK_HOME resolution in domain-skills

Module-level constants (GLOBAL_FILE, derived path) were evaluated at
module-load and cached. When E2E and unit tests run in the same Bun
test pass and set GSTACK_HOME differently, the second test sees the
first test's path. Switch to lazy gstackHome() / globalFile() / projectFile()
helpers so process.env mutations take effect.

Mirrors the pattern already used in telemetry.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): E2E gate-tier tests for domain-skills + CDP

domain-skills-e2e.test.ts (4 tests):
- save derives host from active tab top-level origin (T3)
- save lands quarantined; list surfaces it
- readSkill returns null until 3 uses without flag promote to active (T6)
- save without an active page errors with structured guidance

cdp-e2e.test.ts (8 tests):
- Accessibility.getFullAXTree returns wrapped JSON (allowed, untrusted-output)
- Performance.getMetrics returns plain JSON (allowed, trusted-output)
- Runtime.evaluate DENIED with structured guidance (T2 RCE block)
- Page.navigate DENIED (must use $B goto for blocklist routing)
- Network.getResponseBody DENIED (exfil block)
- malformed JSON params surfaces clear error
- non Domain.method format surfaces clear error
- $B cdp help returns help text

Both files boot a real Chromium via BrowserManager.launch() and exercise
the dispatch handlers end-to-end. Total 12 E2E tests in <2s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regenerate SKILL.md files with new $B commands

bun run gen:skill-docs picks up the domain-skill and cdp META_COMMANDS
entries added in commands.ts. Both top-level SKILL.md and browse/SKILL.md
now list the new commands in their Meta and Inspection tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship SKILL.md golden baselines for v1.7.0.0

Pre-existing failures inherited from garrytan/gbrain-support: the GBrain
Sync preamble block (added in v1.7.0.0) appears in regenerated SKILL.md
output but the golden baselines in test/fixtures/golden/ were never
updated. Three failures fixed:

  golden-file regression > Claude ship skill matches golden baseline
  golden-file regression > Codex ship skill matches golden baseline
  golden-file regression > Factory ship skill matches golden baseline

Goldens regenerated by copying the current ship/SKILL.md, codex
.agents/skills/gstack-ship/SKILL.md, and .factory/skills/gstack-ship/SKILL.md
files. Diff is the v1.7.0.0 GBrain Sync preamble block + privacy stop-gate
(no behavioral changes — just preamble text).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-sync): bearer-token regex catches values with leading space

Pre-existing bug from v1.7.0.0: the bearer-token-json secret pattern
required values matching [A-Za-z0-9_./+=-]{16,}, which rejected the
"Bearer <token>" form because the literal space after "Bearer" wasn't
in the character class. Real Authorization headers use "Bearer <token>"
syntax, and the test fixture
  '"authorization":"Bearer abcdef1234567890abcdef1234567890"'
sat unscanned despite being a leak-class secret.

One-character fix: add space to the value character class. Test
'gstack-brain-sync secret scan > blocks bearer-json' now passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(brain-sync): GSTACK_HOME isolation test compares mtime, not content

Pre-existing flaky test: the GSTACK_HOME-overrides-real-config test asserted
the real ~/.gstack/config.yaml does NOT contain "gbrain_sync_mode: full"
after the test. That fails for any user whose real config legitimately has
that key set from prior usage — the test's invariant is "the command did
not modify the real file," not "the real file lacks any specific value."

Switch to mtime + content snapshot: capture both BEFORE running the command,
then verify both are unchanged after. Also add a positive assertion that
the tmpHome config DID get the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): exempt deliberate large fixtures from 2MB limit

Pre-existing failure: the "git tracks no files larger than 2MB" test
caught browse/test/fixtures/security-bench-haiku-responses.json (28.8MB
of replay data committed in v1.6.4.0 for security benchmark gate tests).

The test exists to catch accidentally-committed binaries (Mach-O dist
binaries, etc), not to forbid all large files. Add an explicit
LARGE_FIXTURE_EXEMPTIONS allowlist so deliberate replay fixtures pass
the gate while accidental binaries still fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skill-token): mint scoped tokens per skill spawn

Wraps token-registry.createToken/revokeToken with skill-specific
clientId encoding (skill:<name>:<spawn-id>) and read+write defaults.
Skill scripts get a per-spawn capability token bound to browser-driving
commands; the daemon root token never leaves the harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-client): SDK for browser-skill scripts

Thin wrapper over POST /command with bearer auth. Resolves daemon
port + token from GSTACK_PORT + GSTACK_SKILL_TOKEN env vars first
(set by $B skill run when spawning), falls back to .gstack/browse.json
for standalone debug runs.

Convenience methods cover the read+write surface skills typically need:
goto, click, fill, text, html, snapshot, links, forms, accessibility,
attrs, media, data, scroll, press, type, select, wait, hover, screenshot.
Low-level command(cmd, args) escape hatch for anything else.

This is the canonical SDK source. Each browser-skill ships a sibling
copy at <skill>/_lib/browse-client.ts so each skill is fully portable
and version-pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): 3-tier storage helpers

listBrowserSkills() walks project > global > bundled (first-wins),
parses SKILL.md frontmatter, no INDEX.json. readBrowserSkill() does
the same for a single name. tombstoneBrowserSkill() moves a skill
into .tombstones/<name>-<ts>/ for recoverability.

Frontmatter parser handles the subset browser-skills need: scalars
(host, description, trusted, version, source), string lists
(triggers), and arg-mapping lists ([{name, description}, ...]).
Quoted values handle colons; trusted defaults to false.

Bundled tier path is auto-detected from the binary install location;
project tier comes from git rev-parse; global is ~/.gstack/. All tier
paths are overridable for hermetic tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): \$B skill list/show/run/test/rm subcommands

handleSkillCommand dispatches to per-subcommand handlers; spawnSkill is
the load-bearing function that:

  1. Mints a per-spawn scoped token (read+write only) bound to the
     skill name + spawn-id.
  2. Builds the spawn env:
     - trusted: passes process.env minus GSTACK_TOKEN (defense in depth).
     - untrusted: minimal allowlist (LANG, LC_ALL, TERM, TZ) + locked
       PATH; explicitly drops anything matching TOKEN/KEY/SECRET/etc.
       Also drops AWS_/AZURE_/GCP_/GOOGLE_APPLICATION_/ANTHROPIC_/OPENAI_/
       GITHUB_/GH_/SSH_/GPG_/NPM_TOKEN/PYPI_ patterns.
   3. Always injects GSTACK_PORT + GSTACK_SKILL_TOKEN last (cannot be
     overridden by parent env).
  4. Spawns bun run script.ts -- <args> with cwd=skillDir, captures
     stdout (1MB cap), stderr, and timeout-kills past the deadline.
  5. Revokes the token in finally{}, always.

list output prints the resolved tier inline so "why did it run that
one?" never becomes a debugging mystery (Codex finding garrytan#4 mitigation).

server.ts threads the listen port to meta-commands via MetaCommandOpts.daemonPort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): bundled hackernews-frontpage reference skill

Smallest interesting browser-skill: scrapes HN front page, returns
30 stories as JSON. No auth, stable HTML, fully fixture-tested.

Files:
  SKILL.md                          frontmatter + prose
  script.ts                         exports parseStoriesFromHtml(html)
                                    main: goto + html + parse + JSON.stringify
  _lib/browse-client.ts             vendored copy of the SDK
  fixtures/hn-2026-04-26.html       captured front page (5 stories)
  script.test.ts                    13 assertions against the fixture

The parser is a pure function over HTML so script.test.ts runs
without a daemon (just imports parseStoriesFromHtml and asserts).

This exercises every Phase 1 component end-to-end:
  - browse-client SDK (script imports browse from ./_lib/)
  - 3-tier lookup (hackernews-frontpage lives in the bundled tier)
  - scoped tokens (read+write is enough for goto + html)
  - spawn lifecycle (\$B skill run hackernews-frontpage)
  - file-fixture testing (\$B skill test hackernews-frontpage)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): cover bundled browser-skills

Adds 7 assertions per bundled skill at <root>/browser-skills/<name>/:
  - SKILL.md exists
  - frontmatter parses with required fields (name/host/triggers/args)
  - script.ts exists
  - _lib/browse-client.ts exists and matches the canonical SDK byte-for-byte
  - script.test.ts exists
  - script.ts imports browse from ./_lib/browse-client

The byte-identical SDK check enforces the version-pinning contract:
when the canonical SDK at browse/src/browse-client.ts changes, every
bundled skill's _lib/ copy must be re-synced or this test fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(designs): add BROWSER_SKILLS_V1 design doc

Captures the 13 locked decisions, two-axis trust model (daemon-side
scoped tokens + process-side env access), 3-tier lookup, file
layout, and full responses to all 8 Codex outside-voice findings.
Includes Phase 2-4 sketches for future branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): replace self-authoring-\$B P1 with browser-skills phases

Phase 1 of the browser-skills design shipped on this branch (sidesteps
the in-daemon isolation problem the original P1 was blocked on). The
new entries enumerate the work that remains:

  P1: Phase 2 (/scrape + /automate skill templates)
  P2: Phase 3 (resolver injection at session start)
  P2: Phase 4 (eval infra + fixture staleness + OS sandbox)

Cross-references docs/designs/BROWSER_SKILLS_V1.md for the full
architecture and the 8 Codex review findings + responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.9.0.0 — browser-skills runtime

VERSION 1.8.0.0 → 1.9.0.0. CHANGELOG entry leads with what humans
can do today (hand-write deterministic browser scripts, run them in
200ms via \$B skill run). Notes explicitly that agent authoring
lands in next release; no fabricated perf numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills-e2e): exercise dispatch with bundled hackernews-frontpage

Covers the full \$B skill list/show/test pipeline against the real
bundled reference skill (defaultTierPaths picks up <repo>/browser-skills/).
Verifies frontmatter shape, the three-tier walk surfaces the bundled
entry, and \$B skill test successfully runs the bundled script.test.ts
in a child bun process.

\$B skill run end-to-end against the live network is intentionally NOT
covered here (would be flaky against news.ycombinator.com); the spawn
lifecycle is exercised in browser-skill-commands.test.ts using inline
synthetic skills.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regen SKILL.md to surface the skill META command

bun run gen:skill-docs picked up the new \`skill\` command from
COMMAND_DESCRIPTIONS in browse/src/commands.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.9.0.0 → v1.13.0.0

Main shipped through v1.11.1.0 while this branch was in flight; v1.12.x
is presumed claimed by another in-flight branch. Use v1.13.0.0 as the
next available slot.

Updated VERSION, package.json, and the CHANGELOG header. Entry body
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.13.0.0 → v1.16.0.0

Main shipped v1.13.0.0 (claude outside-voice skill), v1.14.0.0
(sidebar REPL), and v1.15.0.0 (slim preamble + plan-mode E2E)
while this branch was in flight. Use v1.16.0.0 as the next
available slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-skills): atomic write helper for /skillify (D3)

stageSkill writes a candidate skill into ~/.gstack/.tmp/skillify-<spawnId>/
with restrictive perms. commitSkill does an atomic fs.renameSync into the
final tier path with realpath/lstat discipline (refuses symlinked staging
dirs, refuses to clobber existing skills). discardStaged is the cleanup
path for test failures and approval rejections, idempotent and bounded
to the per-spawn wrapper. validateSkillName enforces lowercase/digits/
dashes only, no path-escape characters.

Implements the D3 contract from the v1.19.0.0 plan review: never a
half-written skill on disk. Test fail or approval reject = rm -rf the
temp dir, no tombstone for never-approved skills.

Closes Codex finding garrytan#5 (atomic skill packaging) for Phase 2a.

34 unit assertions covering: stage validation, file-path escape rejection,
permission check, atomic rename, clobber refusal, symlink refusal, project
tier unresolved, idempotent discard, end-to-end happy + simulated test
failure + approval reject paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(scrape): /scrape <intent> skill template

One entry point for pulling page data. Three paths under the hood:

1. Match — agent reads $B skill list, semantically matches the user's
   intent against each skill's triggers + description + host. Confident
   match = $B skill run <name> in ~200ms.
2. Prototype — no match, drive the page with $B goto/text/html/links etc.
   Return JSON, append a one-line "say /skillify" nudge.
3. Mutating refusal — verbs like submit/click/fill route to /automate
   (Phase 2b P0); /scrape is read-only by contract.

Match decision lives in the agent, not the daemon. No new code in
browse/src/, no expanded daemon command surface, no new prompt-injection
blast radius.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skillify): /skillify codifies last /scrape into permanent skill

The productivity multiplier. /scrape discovers the flow; /skillify writes
it as deterministic Playwright-via-browse-client code so the next /scrape
on the same intent runs in ~200ms.

11-step flow with three locked contracts from the v1.19.0.0 plan review:

D1 — Provenance guard. Walk back ≤10 agent turns for a clearly-bounded
/scrape result. Refuse with one specific message if cold. No silent
synthesis from chat fragments.

D2 — Synthesis input slice. Extract ONLY the final-attempt $B calls that
produced the JSON the user accepted, plus the user's intent string. Drop
failed selectors, drop unrelated chat, drop earlier-session content.
Closes Codex finding garrytan#6 by picking option (b) from the design doc:
re-prompt from agent's own context, not a structured recorder.

D3 — Atomic write. Stage to ~/.gstack/.tmp/skillify-<spawnId>/, run
$B skill test against the temp dir, only rename into the final tier path
on test pass + user approval. Test fail or approval reject = rm -rf the
temp dir entirely.

Default tier: global (~/.gstack/browser-skills/<name>/). --project flag
overrides to per-project. Generated test must include at least one ★★
assertion (parsed JSON has expected shape + non-empty key fields), not a
smoke ★ assertion.

Bun runtime distribution (Codex finding garrytan#7) carries over to Phase 4.
Documented in the skill's Limits section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills): gate-tier E2E for /scrape + /skillify (D4)

Five scenarios cover the productivity loop and the contracts locked
during the v1.19.0.0 plan review:

  scrape-match-path           — intent matching bundled hackernews-frontpage
                                routes via $B skill run, no prototype phase
  scrape-prototype-path       — no matching skill, drives $B against a local
                                file:// fixture, returns JSON, suggests
                                /skillify
  skillify-happy-path         — /scrape then /skillify; skill written to
                                ~/.gstack/browser-skills/<name>/ with the
                                full file tree; SKILL.md prose body must
                                not contain conversation fragments (D2)
  skillify-provenance-refusal — cold /skillify with no prior /scrape refuses
                                with the D1 message; nothing on disk (D1)
  skillify-approval-reject    — /scrape then /skillify but reject in the
                                approval gate; temp dir is removed, nothing
                                at the final tier path (D3)

All five gate-tier (~$0.50-$1.50 each, ~$5 total per CI run). Set EVALS=1
to enable. Uses local file:// fixtures so prototype + skillify scenarios
run deterministically without network.

Touchfiles registers all 5 entries with proper deps on scrape/**,
skillify/**, browse/src/browser-skill-write.ts, and the Phase 1 runtime
modules. The match-path test depends on the bundled hackernews-frontpage
skill so its touchfile includes browser-skills/hackernews-frontpage/**.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser-skills): TODOS Phase 2a + design doc D1-D4 decisions

TODOS.md:
- Narrows existing P1 (was "/scrape and /automate") to "/scrape and
  /skillify" — the /scrape + /skillify wedge ships in this branch.
  Codex finding garrytan#6 (synthesis) removed from Cons (resolved by D2);
  finding garrytan#7 (Bun runtime) stays as the open carry-over.
- Adds new ## P0 above PACING_UPDATES_V0 for the /automate follow-up.
  Same skillify pattern as /scrape, different trust profile (per-step
  confirmation gate when running non-codified). Reuses /skillify and
  the D3 helper as-is. Effort M.

BROWSER_SKILLS_V1.md:
- Phase table re-organized into 1, 2a, 2b, 3, 4. Phase 1 + Phase 2a
  consolidate into v1.19.0.0 ship (the v1.16.0.0 branch-internal
  bump never landed on main).
- New "Phase 2a" sub-section captures the four decisions locked
  during /plan-eng-review:
    D1 — provenance guard (≤10 turn walk-back, refuse if cold)
    D2 — synthesis input slice (final-attempt $B calls only,
         closes Codex finding garrytan#6)
    D3 — atomic write discipline (temp-dir-then-rename via new
         browse/src/browser-skill-write.ts helper)
    D4 — full test scope (5 gate E2E + 1 unit + smoke)
- New "Phase 2b" sketch for /automate: same skillify machinery,
  per-mutating-step confirmation gate, deferred to next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.16.0.0 -> v1.19.0.0 — browser-skills Phase 1 + 2a

Consolidates the v1.16.0.0 branch-internal bump (Phase 1 runtime, never
landed on main) with Phase 2a (/scrape + /skillify + atomic-write helper)
into one v1.19.0.0 ship per CLAUDE.md "Never orphan branch-internal
versions" rule.

Headline: Browser-skills land end-to-end. /scrape <intent> first call
drives the page; second call runs the codified script in 200ms.

The unified CHANGELOG entry covers:
- Phase 1 runtime: $B skill list/show/run/test/rm, scoped tokens,
  3-tier storage, bundled hackernews-frontpage reference.
- Phase 2a: /scrape + /skillify gstack skills, browser-skill-write.ts
  atomic helper, 5 gate-tier E2E + 34 unit assertions.

Numbers table updated: 5 new modules (+browser-skill-write), 2 new
gstack skills, 6 of 8 Codex outside-voice findings resolved (synthesis
garrytan#6 closed by D2; Bun runtime garrytan#7 + OS sandbox garrytan#1 stay deferred to Phase 4).

/automate (Phase 2b) is split out as P0 in TODOS for the next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(commands): tighten descriptions for LLM-judge baseline pinning

The skill-llm-eval test "baseline score pinning" failed CI on three
retry attempts: judge gave command_reference.actionability=3, baseline
demands ≥4. Judge cited 8 specific gaps in COMMAND_DESCRIPTIONS.

This commit closes 7 of 8 by tightening the descriptions:

- press: documents that key names are case-sensitive Playwright keys,
  shows modifier syntax (Shift+Enter, Control+A), links the full key
  list. Removes the "is this case-sensitive?" guesswork.
- is: documents that <sel> accepts either a CSS selector OR an @ref
  token from a prior snapshot, and that property values are case-
  sensitive.
- scroll: documents that there is no --by/--to amount option, points
  at `js window.scrollTo(0, N)` for pixel-precise scrolling.
- js / eval: clarifies that both run in the same JS sandbox, the
  difference is just inline expr (js) vs file (eval).
- storage: clarifies sessionStorage is read-only via this command,
  points at `js sessionStorage.setItem(...)` for the write path.
- chain: walks through how to invoke (pipe a JSON array of arrays to
  $B chain), confirms it stops at the first error.
- cdp: explains how to discover allowed methods (read cdp-allowlist.ts)
  + shows a concrete example invocation.
- domain-skill: explains that the "classifier flag" is set automatically
  by the L4 prompt-injection scan (agents do not set it manually);
  enumerates the full lifecycle verbs.

The 8th gap (storage set syntax conflict) is also resolved as part of
the storage rewrite.

Two pipe-character bugs caught by the existing
`no command description contains pipe character` guard at
`test/gen-skill-docs.test.ts:595`: the chain example originally used
`echo '[...]' | $B chain` (literal pipe) and the cdp description used
`tab|browser` / `trusted|untrusted` (also literal pipes). Both rewritten
to keep markdown table cells intact.

Verification: 696/0 pass on skill-validation + gen-skill-docs after
regen across all hosts. The CI llm-judge eval will re-run against the
new SKILL.md and should hit actionability ≥4 reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser): rewrite BROWSER.md as complete reference

Full rewrite covering the gstack browser surface as of v1.19.0.0. Up from
488 to 1,299 lines, 26 top-level sections.

Adds previously-undocumented subsystems:

- The productivity loop: /scrape + /skillify with D1 (provenance guard),
  D2 (final-attempt-only synthesis), D3 (atomic-write discipline) contracts.
- Browser-skills runtime: anatomy, three-tier storage, scoped tokens, trust
  model (capability + env axes), sibling SDK distribution, atomic-write
  helper, bundled hackernews-frontpage reference.
- Domain-skills: per-site agent notes with quarantined → active → global
  state machine and the L4-classifier auto-promotion gate.
- Pair-agent: dual-listener architecture, 26-command tunnel allowlist,
  canDispatchOverTunnel pure gate, three token types (root, setup key,
  scoped), denial log path + salt model.
- Security stack L1-L6: layer table, thresholds (BLOCK/WARN/LOG_ONLY/
  SOLO_CONTENT_BLOCK), ensemble rule, classifier model paths, env knobs.
- Side Panel deep dive: Terminal pane (Claude PTY) as the primary surface
  with Activity/Refs/Inspector as debug overlays, WS auth via
  Sec-WebSocket-Protocol, gstackInjectToTerminal cross-pane plumbing.
- CDP escape hatch: $B cdp deny-default allowlist, $B inspect CSS inspector,
  $B ux-audit page structure extraction.
- Meta commands previously undocumented: tabs/frames/state/watch/inbox/
  tab-each, with usage and storage paths.
- Authentication: three token types with lifetimes, SSE session cookie,
  PTY session cookie, token registry behavior.
- Full source map: 30+ file inventory of browse/src/ vs the old 11-file
  list.

Preserves from before: architecture diagram, daemon lifecycle, snapshot
ref staleness, screenshot modes, goto file:// vs load-html semantics,
batch endpoint, JS await wrapping, env vars, performance numbers vs MCP,
Playwright acknowledgments, dev guide.

Cross-links to ARCHITECTURE.md, CLAUDE.md, docs/REMOTE_BROWSER_ACCESS.md,
docs/designs/BROWSER_SKILLS_V1.md, scrape/SKILL.md, skillify/SKILL.md,
TODOS.md so anyone landing on BROWSER.md can navigate to the load-bearing
companion docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): tab-ownership gate keys on tabPolicy, not isWrite

Browser-skill spawns hit `403: Tab not owned by your agent` on every
first run because the gate at server.ts:639 fired for any non-root
write, regardless of the token's tabPolicy. The bundled
hackernews-frontpage reference skill failed identically. Every
/skillify-generated skill failed identically. The user's natural
tabs have no claimed owner — by design — so any skill driving
them via `goto` (a write) was 403'd.

The intent in skill-token.ts:79 was always correct: `tabPolicy: 'shared'`
with the comment "skill scripts may switch tabs as needed." The
enforcement just ignored it.

Two surgical changes:

browser-manager.ts:checkTabAccess — gate now keys on options.ownOnly
only. Shared-policy tokens (skill spawns, default scoped clients) get
permissive access — root-equivalent for the tab gate. Own-only tokens
(pair-agent over the ngrok tunnel) still require ownership for every
read and write. isWrite stays in the signature for callers that want
to log or branch elsewhere; it no longer gates the decision.

server.ts:639 — gate predicate narrowed from
  (WRITE_COMMANDS.has(command) || tokenInfo.tabPolicy === 'own-only')
to just
  tokenInfo.tabPolicy === 'own-only'
The 'newtab' exemption stays. Shared tokens skip the gate entirely;
own-only tokens still hit it. Comment block above the gate updated to
document the new predicate intent.

Pair-agent isolation is intact. Tunnel tokens still default to
tabPolicy: 'own-only', still must `newtab` first to get a tab they
can drive, still can't dispatch any of the 23 commands outside the
tunnel allowlist.

The capability gate (scope checks) and rate limits already constrain
what local scoped clients can do; tab ownership was never a security
boundary for them — only for pair-agent. This release makes the
enforcement match the original design intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(server): lock the shared-vs-own-only tab gate contract

The pre-fix tests at tab-isolation.test.ts:43,57 encoded the broken
behavior as the contract — they specifically asserted "scoped agent
cannot write to unowned tab," which was the exact failure mode that
broke browser-skills. They passed because they tested the wrong
invariant.

This commit replaces those tests with explicit shared-vs-own-only
coverage that documents what each policy actually means:

- Shared scoped agents (skill spawns, default scoped clients) can
  read AND write any tab — unowned, their own, or another agent's.
  The capability is gated by scope checks + rate limits, not by tab
  ownership.
- Own-only scoped agents (pair-agent over tunnel) cannot read OR
  write any tab they don't own. Pre-fix this case was conflated with
  shared writes; now it's explicit.

9 unit assertions on checkTabAccess, up from 6. Each test names
the policy axis it's covering so a future refactor can't quietly
flip the contract.

Adds source-shape regression test 10a in server-auth.test.ts:
"tab gate predicate is own-only-scoped, not write-scoped." The
gate's `if (...)` line MUST contain `tabPolicy === 'own-only'` and
MUST NOT contain `WRITE_COMMANDS.has(command) ||`. If a future
refactor re-introduces the write-scoped gate, this fails immediately
in free-tier `bun test`.

Updates the marker for the existing newtab-excluded test to match
the new comment block ("Tab ownership check (own-only tokens /
pair-agent isolation)").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.19.0.0 -> v1.20.0.0 — fix tab-ownership footgun

Patch release on top of v1.19.0.0. The shipping headline of v1.19.0.0
(/scrape + /skillify productivity loop) was broken on first run in any
session where the daemon already had a tab. Bundled
hackernews-frontpage failed identically. Every /skillify-generated
skill failed identically.

The fix narrows the tab-ownership gate from "any non-root write" to
"tabPolicy === 'own-only' only." Pair-agent isolation (the v1.6.0.0
threat model) is intact; local skill spawns get their original
behavior back.

VERSION: 1.19.0.0 -> 1.20.0.0
package.json version: synced.

CHANGELOG entry leads with the user-visible impact: the productivity
loop works again, no half-second-stalls of confused 403s. Includes
before/after metrics on the bundled reference skill and the broken-
contract pre-fix tests that hid the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(claude): sharpen CHANGELOG rule — diff between main and ship

Codifies what was already implicit in the existing "Never orphan
branch-internal versions" + "Only document what shipped between main
and this change" sections, but with sharper language and concrete
NEVER examples.

The rule: a CHANGELOG entry is the diff between main and the shipping
branch — what users get when they upgrade. NOT how the branch got
there. Branch-internal version bumps, mid-branch bug fixes, plan
review outcomes, and patch narratives all belong in PR descriptions
and commit messages, not in CHANGELOG.

Adds explicit examples of phrasing to NEVER use:
  - "v1.X had a bug that v1.Y fixes" (mentions a branch-internal version)
  - "The shipping headline of v1.X was broken because..." (apologizes
    for never-released state)
  - "Pre-fix tests encoded the broken behavior" (contributor's victory
    lap, not user benefit)
  - "Two surgical edits, both in the dispatch path" (micro-narrative
    of the patch)

The constructive replacement: describe the released system as a
property, not as a fix. "Browser-skills run end-to-end with the
expected tab-access semantics." If a property is worth calling out,
document it in the trust-model section, not as a "we fixed X" callout.

Pairs with feedback_no_shame_changelog and
feedback_changelog_harden_against_critics memories — entries should
read as a flex even to a hostile screenshotter, never admit prior
breakage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): consolidate v1.20.0.0 as the diff vs main

Rewrites the v1.20.0.0 entry to describe what users get when they
upgrade from main (v1.17.0.0) to this release: browser-skills
end-to-end. Drops all branch-internal narrative — Phase 1 / Phase 2a
labels, the v1.8.0.0 P1 history paragraph, the test-counts-by-phase
split, and the patch micro-narrative for the tab-policy semantics.

The previously-separate v1.19.0.0 entry (a branch-internal version
that never landed on main) collapses into v1.20.0.0 per the
"Never orphan branch-internal versions" rule.

Tab-access policies are now documented as a property of the trust
model: `'shared'` (skill spawns) is permissive, `'own-only'`
(pair-agent over the tunnel) is strict. No "fix" framing, no
mention of an intermediate state where it was broken.

Adds the BROWSER.md rewrite and the new tab-isolation +
server-auth source-shape regression tests to the itemized changes.

The reverse-chronological order remains: v1.20.0.0 → v1.17.0.0 →
v1.16.0.0 → v1.15.0.0 → ... Gaps (v1.18, v1.19) are fine — those
were branch-internal version numbers that never landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 17, 2026
…n) (#1547)

* fix(gbrain-sync): fold hostname into code-source id hash + migration (#1414)

Cherry-picked from #1468 by 0xDevNinja and extended with the
hostname-fold migration that codex review surfaced.

Pre-fix `deriveCodeSourceId` hashed the absolute repo path alone, so two
machines with identical home-dir layouts (chezmoi-managed dotfiles,
ansible-provisioned VMs) derived the same id and clobbered each other's
`local_path` in a federated brain. Last-writer-wins, with cryptic "Not a
git repository" errors on the loser.

Hash key is now `\${hostname}::\${path}`. Conductor worktrees on a single
host stay distinct (path entropy unchanged within a host); cross-machine
federations stop colliding.

Migration (D1=B + codex refinements): every existing user has a
pre-#1468 path-only-hash source id in their brain that no longer matches
what `deriveCodeSourceId` produces. Without migration, the next sync
registers a fresh source and orphans the old one. This commit adds:

- \`derivePathOnlyHashLegacyId\` — separate helper for the pre-#1468 form.
  Distinct from \`deriveLegacyCodeSourceId\` (pre-pathhash v1.x form);
  both probes run.

- \`planHostnameFoldMigration\` — feature-checks \`gbrain sources rename
  <old> <new>\` (exact argument shape, not just \`--help\`), gates on
  path-drift (skip migration if old source's \`local_path\` differs from
  current repo root), and falls back to register-new + sync-OK +
  remove-old when rename is unsupported. As of gbrain 0.35.0.0 the
  rename subcommand does not exist, so users go through the cleanup
  path; the rename path stays dormant until gbrain ships it.

- \`removeOrphanedSource\` — called only AFTER new-source sync verifies
  page_count > 0. Closes the data-loss window codex flagged where
  "register new, remove old before sync" can wipe pages if sync fails.

- \`sourceLocalPath\` — looks up a source's \`local_path\` from
  \`gbrain sources list --json\` for the drift gate.

- Helpers accept an optional \`env\` parameter so tests can inject a
  gbrain shim via PATH without process-wide PATH mutation (Bun's
  spawnSync doesn't pick up runtime PATH changes). Pre-positions for
  commit 4's centralized gbrain-exec helper.

- \`if (import.meta.main)\` guard around \`main()\` so the helpers can be
  imported for in-process unit tests.

Tests cover: pure derivation, ids-match degenerate case, no-legacy
short-circuit, path-drift skip path, rename path with shim, cleanup
fallback when rename unsupported, cleanup fallback when rename call
itself fails, source-lookup happy/missing/error paths.

\`GSTACK_HOSTNAME\` env var is a test-only knob; production uses
\`os.hostname()\`.

Fixes #1414

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(gbrain-sync): cut source-id slugs on hyphen boundaries (+ #1357)

Cherry-picked from #1481 by drummerms and extended with the explicit
HTTPS-remote regression case for #1357 (decision D2=A).

`constrainSourceId` truncated the slug with `slug.slice(-tailBudget)`,
which cut mid-word when the boundary fell inside a token. For a repo
where the combined `prefix-org-repo-pathhash` exceeded 32 chars, this
produced embarrassing artifacts like `gstack-code-kill-270c0001-c32152`
(from `drummerms-av-sow-wiz-skill-270c0001`).

Two changes carried from #1481, adapted for the #1468 hostpathhash:

1. `constrainSourceId` now walks hyphen-separated tokens from the right,
   accumulating whole tokens until adding the next would exceed
   `tailBudget`. When no token fits, falls through to the existing
   `${prefix}-${hash}` form.

2. `deriveCodeSourceId` now retries with `repo-only-hostpathhash`
   (dropping the org segment) when the full `org-repo-hostpathhash`
   triggers truncation. Keeps the repo name readable when it fits at all.

Plus a new test asserting the source id is period-free for the exact
HTTPS-with-.git remote shape from #1357 (`https://github.com/foo/bar.git`).
canonicalizeRemote strips `.git`; the sanitizer strips any residual
non-alnum. The test closes #1357 by pinning the property.

Closes #1357

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(gbrain): probe CLI without command builtin

* fix(gbrain-sync): centralize gbrain spawn surface + seed DATABASE_URL

Cherry-picked from #1508 by jasshultz, restructured per codex review #4
and #7 to widen scope and centralize the spawn surface.

The bug: gbrain auto-loads .env.local from cwd via dotenv. When
/sync-gbrain runs inside a Next.js / Prisma / Rails project whose
.env.local defines its own DATABASE_URL (pointing at the app's local
DB), gbrain reads that value instead of its own
~/.gbrain/config.json — auth fails, code + memory stages crash.

This commit:

- Adds lib/gbrain-exec.ts: buildGbrainEnv, spawnGbrain, execGbrainJson,
  execGbrainText, spawnGbrainAsync (the last one for memory-ingest's
  streaming gbrain import call). buildGbrainEnv seeds DATABASE_URL from
  ${GBRAIN_HOME:-$HOME/.gbrain}/config.json, returns a fresh env object
  (never the caller's by identity — codex review #11), and honors the
  GSTACK_RESPECT_ENV_DATABASE_URL=1 escape hatch.

- Routes every gbrain spawn in bin/gstack-gbrain-sync.ts and
  bin/gstack-memory-ingest.ts through the helpers. Both files now own
  zero direct spawnSync("gbrain"|spawn("gbrain"|execFileSync("gbrain"
  call sites.

- Threads buildGbrainEnv into the spawnSync("bun", [memory-ingest], ...)
  grandchild in runMemoryIngest (codex review #7). Without this, the
  parent fix is half-baked — the bun child inherits a clean env but
  needs DATABASE_URL pre-seeded too. spawnGbrainAsync inside
  memory-ingest provides defense in depth for standalone invocations.

- Adds GBRAIN_HOME support — aligns with detectEngineTier (already
  honors GBRAIN_HOME) so all gstack-side gbrain calls agree on which
  config file matters. Resolves baseEnv.HOME first, then homedir(), so
  test injection works without process-wide HOME mutation.

- Adds test/build-gbrain-env.test.ts: 10 unit tests covering all five
  env-seeding branches (seed from config / override caller /
  GSTACK_RESPECT escape hatch / missing config / unparseable config /
  no database_url field / GBRAIN_HOME path / object-identity guard /
  unrelated-vars preservation / idempotent-when-matches).

- Adds test/gbrain-exec-invariant.test.ts: static-source check that
  greps both bin/gstack-gbrain-sync.ts and bin/gstack-memory-ingest.ts
  for direct spawnSync("gbrain"|spawn("gbrain"|execFileSync("gbrain"|
  execSync(...gbrain matches and fails the build if any are found.
  Refactor-proof against future contributors adding a new gbrain spawn
  without env threading.

The invariant is intentionally narrow — only the two files where the
DATABASE_URL bug actually hurts users are guarded. Migrating the
spawn sites in lib/gbrain-local-status.ts, lib/gstack-memory-helpers.ts,
and bin/gstack-brain-context-load.ts is a follow-up.

Co-Authored-By: Jason Shultz <jasshultz@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>

* fix(gbrain-sync): add .gbrain-source to consumer repo .gitignore (#1384)

The v1.29.0.0 changelog promised .gbrain-source would be added to the
consuming repo's .gitignore so the per-worktree pin stays local, but the
change actually only added it to gstack's own .gitignore. Without the
consumer-side entry, the pin gets committed and Conductor sibling
worktrees of the same repo + branch step on each other's pin every time
anyone commits.

Add ensureGbrainSourceGitignored after a successful gbrain sources
attach in runCodeImport. Idempotent on repeat runs (line-trim match),
creates .gitignore if missing, logs a warning and continues on
permission errors so a read-only checkout doesn't fail the sync.

Gate the top-level main() call behind import.meta.main so tests can
import the helper without triggering a full sync run on module load.

Tests in test/gbrain-source-gitignore.test.ts cover: create-when-missing,
append-without-trailing-newline, append-with-trailing-newline,
idempotent on repeat, recognize whitespace-surrounded entry, no-throw
on read-only file. 6 pass.

* fix(gbrain-sources): bump gbrain sources list --json timeout 10s → 30s

Supabase free-tier cold-starts can push `gbrain sources list --json` past
10s (observed 14.5s in the wild), causing probeSource() to throw ETIMEDOUT
during /sync-gbrain code stage even though the underlying CLI was healthy.
Matches the 30s ceiling already used by `sources add` / `sources remove`
in the same file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-allowlist): sync project-root eng-review-test-plan artifacts (#1452)

Cherry-picked from #1465 by genisis0x and extended with the v1.40.0.0
upgrade migration that codex review #5 surfaced.

#1465 alone only patches bin/gstack-artifacts-init, which means fresh
installs and re-inits pick up the new pattern. But existing users who
already ran v1.38.1.0 have a `.migrations/v1.38.1.0.done` marker — that
migration won't re-run no matter what we change. So their installed
`.brain-allowlist`, `.brain-privacy-map.json`, and `.gitattributes` stay
without the new pattern, and `/plan-eng-review` artifacts continue to
silently drop out of their federation queue.

This commit:

- bin/gstack-artifacts-init: adds projects/*/*-eng-review-test-plan-*.md
  to the three managed blocks. v1.38.1.0 covered design + test-plan; this
  completes the set for /plan-eng-review.

- gstack-upgrade/migrations/v1.40.0.0.sh: targeted in-place repair for
  existing installs. Same idempotent jq-based shape as v1.38.1.0. Adds
  the new pattern to .brain-allowlist (before the USER ADDITIONS marker),
  .brain-privacy-map.json (as class=artifact), and .gitattributes (as
  merge=union). NEVER commits + pushes — the user controls when the
  patches ship to their federated artifacts repo.

- test/artifacts-init-migration.test.ts: 5 new tests covering the
  v1.40.0.0 migration applied on top of a post-v1.38.1.0 state, jq
  patching, gitattributes append, idempotent re-run, and done-marker
  write when files are missing entirely.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(gbrain-install): skip postinstall on Windows MSYS/MINGW + post-install probe

Cherry-picked from #1487 by genisis0x and extended with the post-install
subcommand probe per T6 / codex review #19.

`bun install` in $INSTALL_DIR fails on Windows MSYS/MINGW/Cygwin shells
because gbrain's native postinstall script mis-parses path arguments
and aborts with a non-zero exit, breaking gstack-gbrain-install for
Windows users running git-bash/MSYS2. The package installs cleanly
without scripts.

This commit:

- Adds Windows shell detection via `uname -s` matching
  MINGW*/MSYS*/CYGWIN*/Windows_NT (#1487's case statement already covers
  all four — codex review #18 confirmed MINGW* is included). Windows
  paths get `bun install --ignore-scripts`; macOS and Linux unchanged.

- Adds a post-install probe of `gbrain sources --help`. `gbrain --version`
  already runs (D19 PATH-shadowing validation), but version success
  doesn't prove the subcommand surface is reachable — and
  `--ignore-scripts` may have skipped artifacts that subcommands need.
  Probe failure logs a clear warning (with Windows-specific remediation
  pointing at re-running `bun install` outside MSYS) but does NOT exit
  non-zero; users may still get value from gbrain even if the probe
  fails transiently.

Refs #1271

Co-Authored-By: Claude <noreply@anthropic.com>

* chore: v1.40.0.0 — gbrain sync hardening wave

Bumps VERSION 1.39.2.0 → 1.40.0.0 (MINOR — substantial gbrain capability
hardening across sync pipeline, install path, federation allowlist;
~600 net LOC added across 8 community PRs + plan-review refinements).

CHANGELOG entry follows the release-summary format: two-line headline,
lead paragraph, "numbers that matter" with before/after table across 8
user-visible surfaces, "what this means for builders" closer, itemized
Added/Changed/Fixed/NOT fixed/For contributors sections.

Per-commit contributor credits: 0xDevNinja, drummerms, Jayesh Betala,
Jason Shultz, genisis0x. Also names NikhileshNanduri and realcarsonterry
in the wave's "Fixed" section for independent submissions of the
.gbrain-source gitignore bug.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: 0xDevNinja <manmit0x@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: drummerms <mike@av2o.com>
Co-authored-by: Jayesh Betala <jayesh.betala7@gmail.com>
Co-authored-by: Jason Shultz <jasshultz@gmail.com>
Co-authored-by: genisis0x <manietdavv@gmail.com>
RyotaKun pushed a commit to RyotaKun/gstack that referenced this pull request May 18, 2026
…rrytan#1233)

* feat(gbrain-sync): queue primitives + writer shims

Adds bin/gstack-brain-enqueue (atomic append to sync queue) and
bin/gstack-jsonl-merge (git merge driver, ts-sort with SHA-256 fallback).
Wires one backgrounded enqueue call into learnings-log, timeline-log,
review-log, and developer-profile --migrate. question-log and
question-preferences stay local per Codex v2 decision.

gstack-config gains gbrain_sync_mode (off/artifacts-only/full) and
gbrain_sync_mode_prompted keys, plus GSTACK_HOME env alignment so
tests don't leak into real ~/.gstack/config.yaml.

* feat(gbrain-sync): --once drain + secret scan + push

bin/gstack-brain-sync is the core sync binary. Subcommands: --once
(drain queue, allowlist-filter, privacy-class-filter, secret-scan
staged diff, commit with template, push with fetch+merge retry),
--status, --skip-file <path>, --drop-queue --yes, --discover-new
(cursor-based detection of artifact writes that skip the shim).

Secret regex families: AWS keys, GitHub tokens (ghp_/gho_/ghu_/ghs_/
ghr_/github_pat_), OpenAI sk-, PEM blocks, JWTs, bearer-token-in-JSON.
On hit: unstage, preserve queue, print remediation hint (--skip-file
or edit), exit clean. No daemon — invoked by preamble at skill
boundaries.

* feat(gbrain-sync): init, restore, uninstall, consumer registry

bin/gstack-brain-init: idempotent first-run. git init ~/.gstack/,
.gitignore=*, canonical .brain-allowlist + .brain-privacy-map.json,
pre-commit secret-scan hook (defense-in-depth), merge driver registration
via git config, gh repo create --private OR arbitrary --remote <url>,
initial push, ~/.gstack-brain-remote.txt for new-machine discovery,
GBrain consumer registration via HTTP POST.

bin/gstack-brain-restore: safe new-machine bootstrap. Refuses clobber
of existing allowlisted files, clones to staging, rsync-copies tracked
files, re-registers merge drivers (required — not cloned from remote),
rehydrates consumers.json, prompts for per-consumer tokens.

bin/gstack-brain-uninstall: clean off-ramp. Removes .git + .brain-*
files + consumers.json + config keys. Preserves user data (learnings,
plans, retros, profile). Optional --delete-remote for GitHub repos.

bin/gstack-brain-consumer + bin/gstack-brain-reader (symlink alias):
registry management. Internal 'consumer' term; user-facing 'reader'
per DX review decision.

* feat(gbrain-sync): preamble block — privacy gate + boundary sync

scripts/resolvers/preamble/generate-brain-sync-block.ts emits bash that
runs at every skill invocation:
- Detects ~/.gstack-brain-remote.txt on machines without local .git
  and surfaces a restore-available hint (does NOT auto-run restore).
- Runs gstack-brain-sync --once at skill start to drain any pending
  writes (and at skill end via prose instruction).
- Once-per-day auto-pull (cached via .brain-last-pull) for append-only
  JSONL files.
- Emits BRAIN_SYNC: status line every skill run.

Also emits prose for the host LLM to fire the one-time privacy
stop-gate (full / artifacts-only / off) when gbrain is detected and
gbrain_sync_mode_prompted is false. Wired into preamble.ts composition.

* test(gbrain-sync): 27-test consolidated suite

test/brain-sync.test.ts covers:
- Config: validation, defaults, GSTACK_HOME env isolation
- Enqueue: no-op gates, skip list, concurrent atomicity, JSON escape
- JSONL merge driver: 3-way + ts-sort + SHA-256 fallback
- Init + sync: canonical file creation, merge driver registration,
  push-reject + fetch+merge retry path
- Init refuses different remote (idempotency)
- Cross-machine restore round-trip (machine A write → machine B sees)
- Secret scan across all 6 regex families (AWS, GH, OpenAI, PEM, JWT,
  bearer-JSON). --skip-file unblock remediation
- Uninstall removes sync config, preserves user data
- --discover-new idempotence via mtime+size cursor

Behaviors verified via integration smokes during implementation. Known
follow-up: bun-test 5s default timeout needs 30s wrapper for
spawnSync-heavy tests.

* docs(gbrain-sync): user guide + error lookup + README section

docs/gbrain-sync.md: setup walkthrough, privacy modes, cross-machine
workflow, secret protection, two-machine conflict handling, uninstall,
troubleshooting reference.

docs/gbrain-sync-errors.md: problem/cause/fix index for every
user-visible error. Patterned on Rust's error docs + Stripe's API
error reference.

README.md: 'Cross-machine memory with GBrain sync' section near the
top (discovery moment), plus docs-table entry.

* chore: bump version and changelog (v1.7.0.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: regenerate SKILL.md files for gbrain-sync preamble block

Re-runs bun run gen:skill-docs after adding generateBrainSyncBlock
to scripts/resolvers/preamble.ts in a2aa8a0. CI check-freshness
caught the drift. All 36 SKILL.md files regenerated with the new
skill-start bash block + privacy-gate prose + skill-end sync
instructions baked in.

* fix(test): session-awareness reads AskUserQuestion Format from a Tier 2+ SKILL.md

The test was reading ROOT/SKILL.md (browse skill, Tier 1) which never
contained '## AskUserQuestion Format' — that section is only emitted
for Tier 2+ skills by scripts/resolvers/preamble.ts. As a result the
agent was prompted with an empty format guide and only emitted
'RECOMMENDATION' intermittently, making the test flaky.

Pre-existing on main (same ROOT/SKILL.md shape there) — surfaced now
because the agent run didn't hit the RECOMMENDATION/recommend/option a
fallback strings in this particular attempt.

Fix: read from office-hours/SKILL.md (Tier 3, always has the section)
with a fallback that scans for the first top-level skill dir whose
SKILL.md contains the header. Future template moves won't break this
test again.

* feat(browse): domain-skills storage + state machine

New module browse/src/domain-skills.ts implements the per-site notes
the agent writes for itself, persisted as type:"domain" rows alongside
/learn's per-project learnings.

Three scopes layered: per-project default, global by explicit promotion.
Project-active shadows global for the same host.

State machine (T6 — codex outside-voice):
  quarantined --3 uses w/o flag--> active(project) --promote--> global
        ^                                |
        +----- classifier flag during use

- Append-only JSONL with O_APPEND for atomic small writes
- Tolerant parser drops partial trailing line on read
- Tombstone for deletes (compactor cleans up later)
- Version log per (host, scope) enables rollback
- Hostname derived from active tab top-level origin (T3 confused-deputy fix)
- writeSkill rejects classifier_score >= 0.85 with structured error

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): domain-skills storage + state machine

14 tests covering:
- T3 hostname normalization (lowercase, www. strip, port/path/query strip,
  subdomain-exact preserved)
- T4 scope shadowing (per-project active shadows global for same host)
- T5 persistence (version monotonicity, tolerant parser drops partial line)
- T6 state machine (quarantined → active after N=3 uses, classifier-flag
  blocks promotion, save-time score >= 0.85 rejected)
- Rollback by version log (restore prior body, advance version counter)
- Tombstone deletion (read returns null after delete)

All 14 pass in 27ms via bun test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B domain-skill subcommands

Wire the domain-skills storage layer into the browse CLI as a META command:

  $B domain-skill save              save body from stdin or --from-file
                                     (host derived from active tab — T3)
  $B domain-skill list              list all skills visible to current project
  $B domain-skill show <host>       print skill body
  $B domain-skill edit <host>       open in $EDITOR
  $B domain-skill promote-to-global <host>  cross-project promotion (T4)
  $B domain-skill rollback <host> [--global]  restore prior version
  $B domain-skill rm <host> [--global]        tombstone

Save path runs L1-L3 content filters from content-security.ts (importable
in compiled binary, unlike L4 ML classifier — see CLAUDE.md). The L4
classifier scan happens in sidebar-agent at prompt-injection load time.

Output is structured (problem + cause + suggested-action) per DX D7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B cdp escape hatch — deny-default allowlist + two-tier mutex

Codex T2: flip CDP posture to deny-default. Allowed methods enumerated in
cdp-allowlist.ts with (scope: tab|browser, output: trusted|untrusted,
justification) per entry.

Initial allowlist (~25 methods) covers:
- Accessibility tree extraction (read-only)
- DOM/CSS inspection (read-only)
- Performance metrics
- Tracing
- Emulation viewport/UA override
- Page screenshot/PDF capture (output is binary, no marker injection vector)
- Network.enable/disable (no bodies/cookies — those are exfil surfaces)
- Runtime.getProperties (NO evaluate/callFunctionOn — those would be RCE)

Page.navigate is INTENTIONALLY NOT allowed; agents use $B goto which
goes through the URL blocklist.

Codex T7: two-tier mutex. tab-scoped methods take per-tab lock; browser-
scoped take global lock that blocks all tab locks. 5s acquire timeout
yields CDPMutexAcquireTimeout (no silent hangs). All lock acquires use
try/finally so errors don't leak the lock.

Path A from spike: uses Playwright's newCDPSession() per page. No second
WebSocket, no need for --remote-debugging-port. CDPSession is cached
per page in a WeakMap and cleared on page close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): CDP allowlist + two-tier mutex

13 tests:
- Allowlist linter: every entry has 4 required fields, no duplicates,
  justification length > 20 chars
- Deny-list verification: dangerous methods (Runtime.evaluate, Page.navigate,
  Network.getResponseBody, Browser.close, Target.attachToTarget, etc.) are
  NOT allowed (Codex T2 categories 4-7)
- Per-tab mutex serializes ops on same tab
- Per-tab mutex allows parallel ops across different tabs
- Global lock blocks tab locks; tab locks block global lock
- Acquire timeout yields CDPMutexAcquireTimeout (no silent hang)
- Timeout error names the tab id and the timeout budget

Also extends Network.disable justification to satisfy linter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): telemetry signals + project-slug helper

Lightweight telemetry per DX D9: piggybacks on ~/.gstack/analytics/ pattern.
Hostname + aggregate counters only, no body content. GSTACK_TELEMETRY_OFF=1
silences. Fire-and-forget — never blocks calling path.

Signals fired so far:
- domain_skill_saved {host, scope, state, bytes}
- domain_skill_save_blocked {host, reason}

(domain_skill_fired and cdp_method_* fired in subsequent commits.)

Also extracts project-slug resolution into project-slug.ts so server.ts
and domain-skill-commands.ts share one cached lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): sidebar prompt-context injection + CDP telemetry

server.ts spawnClaude now:
- Imports per-project domain skill matching the active tab's hostname
  via readDomainSkill()
- Wraps the body in UNTRUSTED EXTERNAL CONTENT envelope (so the L4
  classifier in sidebar-agent sees it at load time per Eng D4)
- Appends as <domain-skill source="..." host="..." version="..."> block
- Fires domain_skill_fired telemetry (host, source, version)
- Calls recordSkillUse fire-and-forget so the auto-promote-after-N=3
  state machine advances on each successful prompt injection

System prompt also gets a one-liner introducing $B domain-skill commands
to agents (DX D4 start-of-task discoverability hint).

cdp-bridge.ts fires:
- cdp_method_denied (drives next allow-list growth)
- cdp_method_lock_acquire_ms (P50/P99 quantile observability)
- cdp_method_called (allowed methods)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): telemetry module

3 tests covering:
- logTelemetry writes JSONL with ts injected
- GSTACK_TELEMETRY_OFF=1 silences all events
- logTelemetry never throws on disk failures

Uses GSTACK_HOME env var to redirect writes to a tmp dir; the telemetry
module reads HOME lazily so test mutations take effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: domain-skills reference + error lookup table

docs/domain-skills.md mirrors the layered shape of docs/gbrain-sync.md
(DX D8): how agents use it, state machine, storage layout, security model
(L1-L3 + L4 layered defense), error reference table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(readme): browser-harness-js plug + domain-skills section

New "Domain skills + raw CDP escape hatch" section under "The sprint"
covering both v1.8.0.0 features. Plugs browser-use/browser-harness-js
as the no-rails alternative for users who want raw CDP without gstack's
security stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.8.0.0)

Branch-scoped bump on top of merged 1.7.0.0 base. CHANGELOG entry covers
the full v1.8.0.0 scope: $B domain-skill, $B cdp escape hatch, two-tier
mutex, telemetry signals, sidebar prompt-context injection. Includes
Codex outside-voice trail (7 of 20 findings resolved, 12 mooted by T1
scope drop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* todos: 7 follow-ups from v1.8.0.0 review trail

P1: Self-authoring $B commands with out-of-process worker isolation
    (Codex T1 deferred from v1.8.0.0 — needs real isolation design)
P2: Migrate /learn to SQLite (Codex T5 long-term primitive fix)
P2: Remove plan-mode handshake from /plan-devex-review (skill bug)
P3: GBrain skillpack publishing for domain-skills
P3: Replay/record demonstrated flows to domain-skills
P3: $B commands review batch-mode UX (alternative to inline approval)
P3: Heuristic command-gap watcher (DX D4 alternative C)

Each entry has the standard What/Why/Pros/Cons/Context/Effort/Priority/
Depends-on shape so anyone picking these up later has full context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): lazy GSTACK_HOME resolution in domain-skills

Module-level constants (GLOBAL_FILE, derived path) were evaluated at
module-load and cached. When E2E and unit tests run in the same Bun
test pass and set GSTACK_HOME differently, the second test sees the
first test's path. Switch to lazy gstackHome() / globalFile() / projectFile()
helpers so process.env mutations take effect.

Mirrors the pattern already used in telemetry.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): E2E gate-tier tests for domain-skills + CDP

domain-skills-e2e.test.ts (4 tests):
- save derives host from active tab top-level origin (T3)
- save lands quarantined; list surfaces it
- readSkill returns null until 3 uses without flag promote to active (T6)
- save without an active page errors with structured guidance

cdp-e2e.test.ts (8 tests):
- Accessibility.getFullAXTree returns wrapped JSON (allowed, untrusted-output)
- Performance.getMetrics returns plain JSON (allowed, trusted-output)
- Runtime.evaluate DENIED with structured guidance (T2 RCE block)
- Page.navigate DENIED (must use $B goto for blocklist routing)
- Network.getResponseBody DENIED (exfil block)
- malformed JSON params surfaces clear error
- non Domain.method format surfaces clear error
- $B cdp help returns help text

Both files boot a real Chromium via BrowserManager.launch() and exercise
the dispatch handlers end-to-end. Total 12 E2E tests in <2s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regenerate SKILL.md files with new $B commands

bun run gen:skill-docs picks up the domain-skill and cdp META_COMMANDS
entries added in commands.ts. Both top-level SKILL.md and browse/SKILL.md
now list the new commands in their Meta and Inspection tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship SKILL.md golden baselines for v1.7.0.0

Pre-existing failures inherited from garrytan/gbrain-support: the GBrain
Sync preamble block (added in v1.7.0.0) appears in regenerated SKILL.md
output but the golden baselines in test/fixtures/golden/ were never
updated. Three failures fixed:

  golden-file regression > Claude ship skill matches golden baseline
  golden-file regression > Codex ship skill matches golden baseline
  golden-file regression > Factory ship skill matches golden baseline

Goldens regenerated by copying the current ship/SKILL.md, codex
.agents/skills/gstack-ship/SKILL.md, and .factory/skills/gstack-ship/SKILL.md
files. Diff is the v1.7.0.0 GBrain Sync preamble block + privacy stop-gate
(no behavioral changes — just preamble text).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-sync): bearer-token regex catches values with leading space

Pre-existing bug from v1.7.0.0: the bearer-token-json secret pattern
required values matching [A-Za-z0-9_./+=-]{16,}, which rejected the
"Bearer <token>" form because the literal space after "Bearer" wasn't
in the character class. Real Authorization headers use "Bearer <token>"
syntax, and the test fixture
  '"authorization":"Bearer abcdef1234567890abcdef1234567890"'
sat unscanned despite being a leak-class secret.

One-character fix: add space to the value character class. Test
'gstack-brain-sync secret scan > blocks bearer-json' now passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(brain-sync): GSTACK_HOME isolation test compares mtime, not content

Pre-existing flaky test: the GSTACK_HOME-overrides-real-config test asserted
the real ~/.gstack/config.yaml does NOT contain "gbrain_sync_mode: full"
after the test. That fails for any user whose real config legitimately has
that key set from prior usage — the test's invariant is "the command did
not modify the real file," not "the real file lacks any specific value."

Switch to mtime + content snapshot: capture both BEFORE running the command,
then verify both are unchanged after. Also add a positive assertion that
the tmpHome config DID get the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): exempt deliberate large fixtures from 2MB limit

Pre-existing failure: the "git tracks no files larger than 2MB" test
caught browse/test/fixtures/security-bench-haiku-responses.json (28.8MB
of replay data committed in v1.6.4.0 for security benchmark gate tests).

The test exists to catch accidentally-committed binaries (Mach-O dist
binaries, etc), not to forbid all large files. Add an explicit
LARGE_FIXTURE_EXEMPTIONS allowlist so deliberate replay fixtures pass
the gate while accidental binaries still fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skill-token): mint scoped tokens per skill spawn

Wraps token-registry.createToken/revokeToken with skill-specific
clientId encoding (skill:<name>:<spawn-id>) and read+write defaults.
Skill scripts get a per-spawn capability token bound to browser-driving
commands; the daemon root token never leaves the harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-client): SDK for browser-skill scripts

Thin wrapper over POST /command with bearer auth. Resolves daemon
port + token from GSTACK_PORT + GSTACK_SKILL_TOKEN env vars first
(set by $B skill run when spawning), falls back to .gstack/browse.json
for standalone debug runs.

Convenience methods cover the read+write surface skills typically need:
goto, click, fill, text, html, snapshot, links, forms, accessibility,
attrs, media, data, scroll, press, type, select, wait, hover, screenshot.
Low-level command(cmd, args) escape hatch for anything else.

This is the canonical SDK source. Each browser-skill ships a sibling
copy at <skill>/_lib/browse-client.ts so each skill is fully portable
and version-pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): 3-tier storage helpers

listBrowserSkills() walks project > global > bundled (first-wins),
parses SKILL.md frontmatter, no INDEX.json. readBrowserSkill() does
the same for a single name. tombstoneBrowserSkill() moves a skill
into .tombstones/<name>-<ts>/ for recoverability.

Frontmatter parser handles the subset browser-skills need: scalars
(host, description, trusted, version, source), string lists
(triggers), and arg-mapping lists ([{name, description}, ...]).
Quoted values handle colons; trusted defaults to false.

Bundled tier path is auto-detected from the binary install location;
project tier comes from git rev-parse; global is ~/.gstack/. All tier
paths are overridable for hermetic tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): \$B skill list/show/run/test/rm subcommands

handleSkillCommand dispatches to per-subcommand handlers; spawnSkill is
the load-bearing function that:

  1. Mints a per-spawn scoped token (read+write only) bound to the
     skill name + spawn-id.
  2. Builds the spawn env:
     - trusted: passes process.env minus GSTACK_TOKEN (defense in depth).
     - untrusted: minimal allowlist (LANG, LC_ALL, TERM, TZ) + locked
       PATH; explicitly drops anything matching TOKEN/KEY/SECRET/etc.
       Also drops AWS_/AZURE_/GCP_/GOOGLE_APPLICATION_/ANTHROPIC_/OPENAI_/
       GITHUB_/GH_/SSH_/GPG_/NPM_TOKEN/PYPI_ patterns.
   3. Always injects GSTACK_PORT + GSTACK_SKILL_TOKEN last (cannot be
     overridden by parent env).
  4. Spawns bun run script.ts -- <args> with cwd=skillDir, captures
     stdout (1MB cap), stderr, and timeout-kills past the deadline.
  5. Revokes the token in finally{}, always.

list output prints the resolved tier inline so "why did it run that
one?" never becomes a debugging mystery (Codex finding garrytan#4 mitigation).

server.ts threads the listen port to meta-commands via MetaCommandOpts.daemonPort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): bundled hackernews-frontpage reference skill

Smallest interesting browser-skill: scrapes HN front page, returns
30 stories as JSON. No auth, stable HTML, fully fixture-tested.

Files:
  SKILL.md                          frontmatter + prose
  script.ts                         exports parseStoriesFromHtml(html)
                                    main: goto + html + parse + JSON.stringify
  _lib/browse-client.ts             vendored copy of the SDK
  fixtures/hn-2026-04-26.html       captured front page (5 stories)
  script.test.ts                    13 assertions against the fixture

The parser is a pure function over HTML so script.test.ts runs
without a daemon (just imports parseStoriesFromHtml and asserts).

This exercises every Phase 1 component end-to-end:
  - browse-client SDK (script imports browse from ./_lib/)
  - 3-tier lookup (hackernews-frontpage lives in the bundled tier)
  - scoped tokens (read+write is enough for goto + html)
  - spawn lifecycle (\$B skill run hackernews-frontpage)
  - file-fixture testing (\$B skill test hackernews-frontpage)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): cover bundled browser-skills

Adds 7 assertions per bundled skill at <root>/browser-skills/<name>/:
  - SKILL.md exists
  - frontmatter parses with required fields (name/host/triggers/args)
  - script.ts exists
  - _lib/browse-client.ts exists and matches the canonical SDK byte-for-byte
  - script.test.ts exists
  - script.ts imports browse from ./_lib/browse-client

The byte-identical SDK check enforces the version-pinning contract:
when the canonical SDK at browse/src/browse-client.ts changes, every
bundled skill's _lib/ copy must be re-synced or this test fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(designs): add BROWSER_SKILLS_V1 design doc

Captures the 13 locked decisions, two-axis trust model (daemon-side
scoped tokens + process-side env access), 3-tier lookup, file
layout, and full responses to all 8 Codex outside-voice findings.
Includes Phase 2-4 sketches for future branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): replace self-authoring-\$B P1 with browser-skills phases

Phase 1 of the browser-skills design shipped on this branch (sidesteps
the in-daemon isolation problem the original P1 was blocked on). The
new entries enumerate the work that remains:

  P1: Phase 2 (/scrape + /automate skill templates)
  P2: Phase 3 (resolver injection at session start)
  P2: Phase 4 (eval infra + fixture staleness + OS sandbox)

Cross-references docs/designs/BROWSER_SKILLS_V1.md for the full
architecture and the 8 Codex review findings + responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.9.0.0 — browser-skills runtime

VERSION 1.8.0.0 → 1.9.0.0. CHANGELOG entry leads with what humans
can do today (hand-write deterministic browser scripts, run them in
200ms via \$B skill run). Notes explicitly that agent authoring
lands in next release; no fabricated perf numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills-e2e): exercise dispatch with bundled hackernews-frontpage

Covers the full \$B skill list/show/test pipeline against the real
bundled reference skill (defaultTierPaths picks up <repo>/browser-skills/).
Verifies frontmatter shape, the three-tier walk surfaces the bundled
entry, and \$B skill test successfully runs the bundled script.test.ts
in a child bun process.

\$B skill run end-to-end against the live network is intentionally NOT
covered here (would be flaky against news.ycombinator.com); the spawn
lifecycle is exercised in browser-skill-commands.test.ts using inline
synthetic skills.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regen SKILL.md to surface the skill META command

bun run gen:skill-docs picked up the new \`skill\` command from
COMMAND_DESCRIPTIONS in browse/src/commands.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.9.0.0 → v1.13.0.0

Main shipped through v1.11.1.0 while this branch was in flight; v1.12.x
is presumed claimed by another in-flight branch. Use v1.13.0.0 as the
next available slot.

Updated VERSION, package.json, and the CHANGELOG header. Entry body
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.13.0.0 → v1.16.0.0

Main shipped v1.13.0.0 (claude outside-voice skill), v1.14.0.0
(sidebar REPL), and v1.15.0.0 (slim preamble + plan-mode E2E)
while this branch was in flight. Use v1.16.0.0 as the next
available slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-skills): atomic write helper for /skillify (D3)

stageSkill writes a candidate skill into ~/.gstack/.tmp/skillify-<spawnId>/
with restrictive perms. commitSkill does an atomic fs.renameSync into the
final tier path with realpath/lstat discipline (refuses symlinked staging
dirs, refuses to clobber existing skills). discardStaged is the cleanup
path for test failures and approval rejections, idempotent and bounded
to the per-spawn wrapper. validateSkillName enforces lowercase/digits/
dashes only, no path-escape characters.

Implements the D3 contract from the v1.19.0.0 plan review: never a
half-written skill on disk. Test fail or approval reject = rm -rf the
temp dir, no tombstone for never-approved skills.

Closes Codex finding garrytan#5 (atomic skill packaging) for Phase 2a.

34 unit assertions covering: stage validation, file-path escape rejection,
permission check, atomic rename, clobber refusal, symlink refusal, project
tier unresolved, idempotent discard, end-to-end happy + simulated test
failure + approval reject paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(scrape): /scrape <intent> skill template

One entry point for pulling page data. Three paths under the hood:

1. Match — agent reads $B skill list, semantically matches the user's
   intent against each skill's triggers + description + host. Confident
   match = $B skill run <name> in ~200ms.
2. Prototype — no match, drive the page with $B goto/text/html/links etc.
   Return JSON, append a one-line "say /skillify" nudge.
3. Mutating refusal — verbs like submit/click/fill route to /automate
   (Phase 2b P0); /scrape is read-only by contract.

Match decision lives in the agent, not the daemon. No new code in
browse/src/, no expanded daemon command surface, no new prompt-injection
blast radius.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skillify): /skillify codifies last /scrape into permanent skill

The productivity multiplier. /scrape discovers the flow; /skillify writes
it as deterministic Playwright-via-browse-client code so the next /scrape
on the same intent runs in ~200ms.

11-step flow with three locked contracts from the v1.19.0.0 plan review:

D1 — Provenance guard. Walk back ≤10 agent turns for a clearly-bounded
/scrape result. Refuse with one specific message if cold. No silent
synthesis from chat fragments.

D2 — Synthesis input slice. Extract ONLY the final-attempt $B calls that
produced the JSON the user accepted, plus the user's intent string. Drop
failed selectors, drop unrelated chat, drop earlier-session content.
Closes Codex finding garrytan#6 by picking option (b) from the design doc:
re-prompt from agent's own context, not a structured recorder.

D3 — Atomic write. Stage to ~/.gstack/.tmp/skillify-<spawnId>/, run
$B skill test against the temp dir, only rename into the final tier path
on test pass + user approval. Test fail or approval reject = rm -rf the
temp dir entirely.

Default tier: global (~/.gstack/browser-skills/<name>/). --project flag
overrides to per-project. Generated test must include at least one ★★
assertion (parsed JSON has expected shape + non-empty key fields), not a
smoke ★ assertion.

Bun runtime distribution (Codex finding garrytan#7) carries over to Phase 4.
Documented in the skill's Limits section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills): gate-tier E2E for /scrape + /skillify (D4)

Five scenarios cover the productivity loop and the contracts locked
during the v1.19.0.0 plan review:

  scrape-match-path           — intent matching bundled hackernews-frontpage
                                routes via $B skill run, no prototype phase
  scrape-prototype-path       — no matching skill, drives $B against a local
                                file:// fixture, returns JSON, suggests
                                /skillify
  skillify-happy-path         — /scrape then /skillify; skill written to
                                ~/.gstack/browser-skills/<name>/ with the
                                full file tree; SKILL.md prose body must
                                not contain conversation fragments (D2)
  skillify-provenance-refusal — cold /skillify with no prior /scrape refuses
                                with the D1 message; nothing on disk (D1)
  skillify-approval-reject    — /scrape then /skillify but reject in the
                                approval gate; temp dir is removed, nothing
                                at the final tier path (D3)

All five gate-tier (~$0.50-$1.50 each, ~$5 total per CI run). Set EVALS=1
to enable. Uses local file:// fixtures so prototype + skillify scenarios
run deterministically without network.

Touchfiles registers all 5 entries with proper deps on scrape/**,
skillify/**, browse/src/browser-skill-write.ts, and the Phase 1 runtime
modules. The match-path test depends on the bundled hackernews-frontpage
skill so its touchfile includes browser-skills/hackernews-frontpage/**.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser-skills): TODOS Phase 2a + design doc D1-D4 decisions

TODOS.md:
- Narrows existing P1 (was "/scrape and /automate") to "/scrape and
  /skillify" — the /scrape + /skillify wedge ships in this branch.
  Codex finding garrytan#6 (synthesis) removed from Cons (resolved by D2);
  finding garrytan#7 (Bun runtime) stays as the open carry-over.
- Adds new ## P0 above PACING_UPDATES_V0 for the /automate follow-up.
  Same skillify pattern as /scrape, different trust profile (per-step
  confirmation gate when running non-codified). Reuses /skillify and
  the D3 helper as-is. Effort M.

BROWSER_SKILLS_V1.md:
- Phase table re-organized into 1, 2a, 2b, 3, 4. Phase 1 + Phase 2a
  consolidate into v1.19.0.0 ship (the v1.16.0.0 branch-internal
  bump never landed on main).
- New "Phase 2a" sub-section captures the four decisions locked
  during /plan-eng-review:
    D1 — provenance guard (≤10 turn walk-back, refuse if cold)
    D2 — synthesis input slice (final-attempt $B calls only,
         closes Codex finding garrytan#6)
    D3 — atomic write discipline (temp-dir-then-rename via new
         browse/src/browser-skill-write.ts helper)
    D4 — full test scope (5 gate E2E + 1 unit + smoke)
- New "Phase 2b" sketch for /automate: same skillify machinery,
  per-mutating-step confirmation gate, deferred to next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.16.0.0 -> v1.19.0.0 — browser-skills Phase 1 + 2a

Consolidates the v1.16.0.0 branch-internal bump (Phase 1 runtime, never
landed on main) with Phase 2a (/scrape + /skillify + atomic-write helper)
into one v1.19.0.0 ship per CLAUDE.md "Never orphan branch-internal
versions" rule.

Headline: Browser-skills land end-to-end. /scrape <intent> first call
drives the page; second call runs the codified script in 200ms.

The unified CHANGELOG entry covers:
- Phase 1 runtime: $B skill list/show/run/test/rm, scoped tokens,
  3-tier storage, bundled hackernews-frontpage reference.
- Phase 2a: /scrape + /skillify gstack skills, browser-skill-write.ts
  atomic helper, 5 gate-tier E2E + 34 unit assertions.

Numbers table updated: 5 new modules (+browser-skill-write), 2 new
gstack skills, 6 of 8 Codex outside-voice findings resolved (synthesis
garrytan#6 closed by D2; Bun runtime garrytan#7 + OS sandbox garrytan#1 stay deferred to Phase 4).

/automate (Phase 2b) is split out as P0 in TODOS for the next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(commands): tighten descriptions for LLM-judge baseline pinning

The skill-llm-eval test "baseline score pinning" failed CI on three
retry attempts: judge gave command_reference.actionability=3, baseline
demands ≥4. Judge cited 8 specific gaps in COMMAND_DESCRIPTIONS.

This commit closes 7 of 8 by tightening the descriptions:

- press: documents that key names are case-sensitive Playwright keys,
  shows modifier syntax (Shift+Enter, Control+A), links the full key
  list. Removes the "is this case-sensitive?" guesswork.
- is: documents that <sel> accepts either a CSS selector OR an @ref
  token from a prior snapshot, and that property values are case-
  sensitive.
- scroll: documents that there is no --by/--to amount option, points
  at `js window.scrollTo(0, N)` for pixel-precise scrolling.
- js / eval: clarifies that both run in the same JS sandbox, the
  difference is just inline expr (js) vs file (eval).
- storage: clarifies sessionStorage is read-only via this command,
  points at `js sessionStorage.setItem(...)` for the write path.
- chain: walks through how to invoke (pipe a JSON array of arrays to
  $B chain), confirms it stops at the first error.
- cdp: explains how to discover allowed methods (read cdp-allowlist.ts)
  + shows a concrete example invocation.
- domain-skill: explains that the "classifier flag" is set automatically
  by the L4 prompt-injection scan (agents do not set it manually);
  enumerates the full lifecycle verbs.

The 8th gap (storage set syntax conflict) is also resolved as part of
the storage rewrite.

Two pipe-character bugs caught by the existing
`no command description contains pipe character` guard at
`test/gen-skill-docs.test.ts:595`: the chain example originally used
`echo '[...]' | $B chain` (literal pipe) and the cdp description used
`tab|browser` / `trusted|untrusted` (also literal pipes). Both rewritten
to keep markdown table cells intact.

Verification: 696/0 pass on skill-validation + gen-skill-docs after
regen across all hosts. The CI llm-judge eval will re-run against the
new SKILL.md and should hit actionability ≥4 reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser): rewrite BROWSER.md as complete reference

Full rewrite covering the gstack browser surface as of v1.19.0.0. Up from
488 to 1,299 lines, 26 top-level sections.

Adds previously-undocumented subsystems:

- The productivity loop: /scrape + /skillify with D1 (provenance guard),
  D2 (final-attempt-only synthesis), D3 (atomic-write discipline) contracts.
- Browser-skills runtime: anatomy, three-tier storage, scoped tokens, trust
  model (capability + env axes), sibling SDK distribution, atomic-write
  helper, bundled hackernews-frontpage reference.
- Domain-skills: per-site agent notes with quarantined → active → global
  state machine and the L4-classifier auto-promotion gate.
- Pair-agent: dual-listener architecture, 26-command tunnel allowlist,
  canDispatchOverTunnel pure gate, three token types (root, setup key,
  scoped), denial log path + salt model.
- Security stack L1-L6: layer table, thresholds (BLOCK/WARN/LOG_ONLY/
  SOLO_CONTENT_BLOCK), ensemble rule, classifier model paths, env knobs.
- Side Panel deep dive: Terminal pane (Claude PTY) as the primary surface
  with Activity/Refs/Inspector as debug overlays, WS auth via
  Sec-WebSocket-Protocol, gstackInjectToTerminal cross-pane plumbing.
- CDP escape hatch: $B cdp deny-default allowlist, $B inspect CSS inspector,
  $B ux-audit page structure extraction.
- Meta commands previously undocumented: tabs/frames/state/watch/inbox/
  tab-each, with usage and storage paths.
- Authentication: three token types with lifetimes, SSE session cookie,
  PTY session cookie, token registry behavior.
- Full source map: 30+ file inventory of browse/src/ vs the old 11-file
  list.

Preserves from before: architecture diagram, daemon lifecycle, snapshot
ref staleness, screenshot modes, goto file:// vs load-html semantics,
batch endpoint, JS await wrapping, env vars, performance numbers vs MCP,
Playwright acknowledgments, dev guide.

Cross-links to ARCHITECTURE.md, CLAUDE.md, docs/REMOTE_BROWSER_ACCESS.md,
docs/designs/BROWSER_SKILLS_V1.md, scrape/SKILL.md, skillify/SKILL.md,
TODOS.md so anyone landing on BROWSER.md can navigate to the load-bearing
companion docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): tab-ownership gate keys on tabPolicy, not isWrite

Browser-skill spawns hit `403: Tab not owned by your agent` on every
first run because the gate at server.ts:639 fired for any non-root
write, regardless of the token's tabPolicy. The bundled
hackernews-frontpage reference skill failed identically. Every
/skillify-generated skill failed identically. The user's natural
tabs have no claimed owner — by design — so any skill driving
them via `goto` (a write) was 403'd.

The intent in skill-token.ts:79 was always correct: `tabPolicy: 'shared'`
with the comment "skill scripts may switch tabs as needed." The
enforcement just ignored it.

Two surgical changes:

browser-manager.ts:checkTabAccess — gate now keys on options.ownOnly
only. Shared-policy tokens (skill spawns, default scoped clients) get
permissive access — root-equivalent for the tab gate. Own-only tokens
(pair-agent over the ngrok tunnel) still require ownership for every
read and write. isWrite stays in the signature for callers that want
to log or branch elsewhere; it no longer gates the decision.

server.ts:639 — gate predicate narrowed from
  (WRITE_COMMANDS.has(command) || tokenInfo.tabPolicy === 'own-only')
to just
  tokenInfo.tabPolicy === 'own-only'
The 'newtab' exemption stays. Shared tokens skip the gate entirely;
own-only tokens still hit it. Comment block above the gate updated to
document the new predicate intent.

Pair-agent isolation is intact. Tunnel tokens still default to
tabPolicy: 'own-only', still must `newtab` first to get a tab they
can drive, still can't dispatch any of the 23 commands outside the
tunnel allowlist.

The capability gate (scope checks) and rate limits already constrain
what local scoped clients can do; tab ownership was never a security
boundary for them — only for pair-agent. This release makes the
enforcement match the original design intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(server): lock the shared-vs-own-only tab gate contract

The pre-fix tests at tab-isolation.test.ts:43,57 encoded the broken
behavior as the contract — they specifically asserted "scoped agent
cannot write to unowned tab," which was the exact failure mode that
broke browser-skills. They passed because they tested the wrong
invariant.

This commit replaces those tests with explicit shared-vs-own-only
coverage that documents what each policy actually means:

- Shared scoped agents (skill spawns, default scoped clients) can
  read AND write any tab — unowned, their own, or another agent's.
  The capability is gated by scope checks + rate limits, not by tab
  ownership.
- Own-only scoped agents (pair-agent over tunnel) cannot read OR
  write any tab they don't own. Pre-fix this case was conflated with
  shared writes; now it's explicit.

9 unit assertions on checkTabAccess, up from 6. Each test names
the policy axis it's covering so a future refactor can't quietly
flip the contract.

Adds source-shape regression test 10a in server-auth.test.ts:
"tab gate predicate is own-only-scoped, not write-scoped." The
gate's `if (...)` line MUST contain `tabPolicy === 'own-only'` and
MUST NOT contain `WRITE_COMMANDS.has(command) ||`. If a future
refactor re-introduces the write-scoped gate, this fails immediately
in free-tier `bun test`.

Updates the marker for the existing newtab-excluded test to match
the new comment block ("Tab ownership check (own-only tokens /
pair-agent isolation)").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.19.0.0 -> v1.20.0.0 — fix tab-ownership footgun

Patch release on top of v1.19.0.0. The shipping headline of v1.19.0.0
(/scrape + /skillify productivity loop) was broken on first run in any
session where the daemon already had a tab. Bundled
hackernews-frontpage failed identically. Every /skillify-generated
skill failed identically.

The fix narrows the tab-ownership gate from "any non-root write" to
"tabPolicy === 'own-only' only." Pair-agent isolation (the v1.6.0.0
threat model) is intact; local skill spawns get their original
behavior back.

VERSION: 1.19.0.0 -> 1.20.0.0
package.json version: synced.

CHANGELOG entry leads with the user-visible impact: the productivity
loop works again, no half-second-stalls of confused 403s. Includes
before/after metrics on the bundled reference skill and the broken-
contract pre-fix tests that hid the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(claude): sharpen CHANGELOG rule — diff between main and ship

Codifies what was already implicit in the existing "Never orphan
branch-internal versions" + "Only document what shipped between main
and this change" sections, but with sharper language and concrete
NEVER examples.

The rule: a CHANGELOG entry is the diff between main and the shipping
branch — what users get when they upgrade. NOT how the branch got
there. Branch-internal version bumps, mid-branch bug fixes, plan
review outcomes, and patch narratives all belong in PR descriptions
and commit messages, not in CHANGELOG.

Adds explicit examples of phrasing to NEVER use:
  - "v1.X had a bug that v1.Y fixes" (mentions a branch-internal version)
  - "The shipping headline of v1.X was broken because..." (apologizes
    for never-released state)
  - "Pre-fix tests encoded the broken behavior" (contributor's victory
    lap, not user benefit)
  - "Two surgical edits, both in the dispatch path" (micro-narrative
    of the patch)

The constructive replacement: describe the released system as a
property, not as a fix. "Browser-skills run end-to-end with the
expected tab-access semantics." If a property is worth calling out,
document it in the trust-model section, not as a "we fixed X" callout.

Pairs with feedback_no_shame_changelog and
feedback_changelog_harden_against_critics memories — entries should
read as a flex even to a hostile screenshotter, never admit prior
breakage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): consolidate v1.20.0.0 as the diff vs main

Rewrites the v1.20.0.0 entry to describe what users get when they
upgrade from main (v1.17.0.0) to this release: browser-skills
end-to-end. Drops all branch-internal narrative — Phase 1 / Phase 2a
labels, the v1.8.0.0 P1 history paragraph, the test-counts-by-phase
split, and the patch micro-narrative for the tab-policy semantics.

The previously-separate v1.19.0.0 entry (a branch-internal version
that never landed on main) collapses into v1.20.0.0 per the
"Never orphan branch-internal versions" rule.

Tab-access policies are now documented as a property of the trust
model: `'shared'` (skill spawns) is permissive, `'own-only'`
(pair-agent over the tunnel) is strict. No "fix" framing, no
mention of an intermediate state where it was broken.

Adds the BROWSER.md rewrite and the new tab-isolation +
server-auth source-shape regression tests to the itemized changes.

The reverse-chronological order remains: v1.20.0.0 → v1.17.0.0 →
v1.16.0.0 → v1.15.0.0 → ... Gaps (v1.18, v1.19) are fine — those
were branch-internal version numbers that never landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
shelman09 pushed a commit to Namleh-Studios/gstack that referenced this pull request May 19, 2026
…mFile guard

Covers PR garrytan#1169 bugs garrytan#6 and garrytan#7:

- security-classifier-download-cleanup.test.ts pins downloadFile error-path
  cleanup against three failure shapes: reader rejects mid-stream, non-2xx
  response, missing body. Asserts the dest file is not created and no
  <dest>.tmp.* siblings remain (glob-matched, not exact path — codex push:
  if the fix later switches to mkdtempSync, the assertion still holds).
  Includes a happy-path case so the cleanup isn't fighting a correct download.

- regression-pr1169-pdf-from-file-invalid-json.test.ts pins parsePdfFromFile
  to throw a helpful error for: invalid JSON, empty file, top-level array,
  top-level number, top-level string, top-level null, top-level boolean.
  Codex push: JSON.parse accepts primitives too, so Array.isArray + typeof
  guard must be tested separately from the JSON.parse try/catch.

Both files use mkdtempSync(process.cwd()/...) for fixture isolation since
SAFE_DIRECTORIES allows TEMP_DIR or cwd; cwd is universal across CI hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 20, 2026
…s (PR #1169 follow-up) (#1592)

* fix(build-app): escape sed replacement metachars in Chromium rebrand

build-app.sh injects \$APP_NAME directly into the replacement half of
sed's s/// when patching Chromium's localized InfoPlist.strings. If
\$APP_NAME ever carries '/', '&', or '\\' — the command either breaks
or starts interpreting input as sed syntax. The trailing '|| true'
would then silently hide the failure and ship a DMG that still says
'Google Chrome for Testing' in the menu bar.

Escape replacement metachars before substitution. No change for the
default name 'GStack Browser'.

* fix(build-app): bail out if 'mktemp -d' fails instead of cp-ing into '/'

The DMG creation step sets DMG_TMP from 'mktemp -d' with no error check.
If mktemp fails (tmpfs full, permissions, TMPDIR misconfigured), DMG_TMP
is empty and the very next line — 'cp -a "\$APP_DIR" "\$DMG_TMP/"' —
expands to 'cp -a "<app>" "/"', which copies the bundle into the root of
the filesystem.

Refuse to continue unless mktemp produced a real directory. Defensive
second check catches the (rare) case where mktemp succeeds but returns
something that isn't a directory we can cp into.

* fix(telemetry-sync): drop predictable $$ tmp-file fallback

gstack-telemetry-sync tried 'mktemp /tmp/gstack-sync-XXXXXX' and on
failure fell back to '/tmp/gstack-sync-$$'. $$ is the PID — predictable
and reusable, so on shared hosts another user can pre-create or symlink
the path and either steal the response body or clobber an unrelated
file when curl writes through it.

Drop the fallback. If mktemp cannot produce a unique file we just skip
this sync cycle — the events stay on disk and the next run picks them
up. Also install an EXIT trap so the response file is cleaned up on
unexpected exit, not just on the happy path.

* fix(verify-rls): drop predictable $$-based tmp file fallback

Same shape as gstack-telemetry-sync: on mktemp failure the script fell
back to '/tmp/verify-rls-$$-$TOTAL', which is fully predictable from the
PID and a per-check counter. On a shared box another user can pre-create
or symlink the path and either capture the HTTP response body (which may
leak what the RLS tests revealed) or corrupt an unrelated file that curl
writes through.

Make mktemp strict. On failure return from the check function; the caller
tallies a FAIL and the run moves on.

* fix(security-classifier): close writer + delete tmp on download error

downloadFile() opens an fs.WriteStream to '<dest>.tmp.<pid>' and drives
it from a fetch body reader, but if reader.read() or writer.write()
throws mid-download the writer is never closed. That leaks an FD per
failed attempt and leaves the half-written tmp on disk. A later retry
can land in renameSync(tmp, dest) with a truncated TestSavantAI /
DeBERTa ONNX file — which then loads but produces garbage classifier
verdicts until the user manually nukes the models cache.

Wrap the download loop in try/catch. On failure, destroy() the writer
and unlink the tmp before rethrowing, so the next attempt starts from a
clean slate.

* fix(meta-commands): guard JSON.parse in pdf --from-file parser

parsePdfFromFile() runs JSON.parse on user-supplied file contents with
no try/catch. A malformed payload surfaces as an uncaught SyntaxError
from the 'pdf' command handler and the user sees an opaque stack trace
instead of "this file isn't valid JSON". Worse, the same call path is
used by make-pdf when header/footer HTML would overflow Windows'
CreateProcess argv cap, so a corrupt payload file there can take down
the make-pdf run.

Wrap JSON.parse. Re-throw with a message that names the offending file
and echoes the parser's own explanation. Also reject top-level non-
objects (null, array, primitive) since the rest of the function treats
json as an object — catching that here produces a clear error instead
of a TypeError further down.

* fix(global-discover): stop dropping sessions when header >8KB

extractCwdFromJsonl() reads the first 8KB of each JSONL session file and
runs JSON.parse on every newline-split line. When a session record
happens to straddle the 8KB cap, the last line ends in a truncated JSON
fragment, JSON.parse throws, the catch block 'continue's silently, and
if that was the only line carrying 'cwd' the whole project gets dropped
from the discovery output without a warning.

Two independent hardening steps:
  1. Raise the read cap to 64KB. Session headers observed in Claude
     Code / Codex / Gemini transcripts fit comfortably; this just moves
     the cliff out of the normal range.
  2. Drop the final segment after splitting on '\\n'. If the read hit
     the cap mid-line, that segment is guaranteed incomplete; if the
     file ended inside the buffer, the split produces an empty final
     segment and dropping it is a no-op.

Together these make the parser robust regardless of how verbose the
leading records are.

* test: export downloadFile, parsePdfFromFile, extractCwdFromJsonl

These three internal helpers are now imported by regression tests
landing in the next commits (PR #1169 follow-up). Pattern matches the
existing normalizeRemoteUrl export in gstack-global-discover.ts which
test/global-discover.test.ts already imports side-effect-free.

No change to runtime behavior; gstack has no public package entrypoint
that would re-export these, so the in-repo surface is unchanged for
callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security-classifier): await writer close before unlinking tmp on error

The earlier downloadFile() error-path cleanup hit a race: Node's
createWriteStream lazily opens the FD and flushes buffered writes during
destroy(), so a naive `fs.unlinkSync(tmp)` immediately after `writer.destroy()`
hits ENOENT (file not yet on disk), then the writer's destroy finishes on the
next tick and creates the file fresh — leaving the half-written tmp behind
exactly as the original fix tried to prevent.

The new sequence awaits the writer's 'close' event before unlinking, so the FD
is fully torn down and no subsequent flush can re-create the path.

Caught by browse/test/security-classifier-download-cleanup.test.ts in the
next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): regression tests for downloadFile cleanup + parsePdfFromFile guard

Covers PR #1169 bugs #6 and #7:

- security-classifier-download-cleanup.test.ts pins downloadFile error-path
  cleanup against three failure shapes: reader rejects mid-stream, non-2xx
  response, missing body. Asserts the dest file is not created and no
  <dest>.tmp.* siblings remain (glob-matched, not exact path — codex push:
  if the fix later switches to mkdtempSync, the assertion still holds).
  Includes a happy-path case so the cleanup isn't fighting a correct download.

- regression-pr1169-pdf-from-file-invalid-json.test.ts pins parsePdfFromFile
  to throw a helpful error for: invalid JSON, empty file, top-level array,
  top-level number, top-level string, top-level null, top-level boolean.
  Codex push: JSON.parse accepts primitives too, so Array.isArray + typeof
  guard must be tested separately from the JSON.parse try/catch.

Both files use mkdtempSync(process.cwd()/...) for fixture isolation since
SAFE_DIRECTORIES allows TEMP_DIR or cwd; cwd is universal across CI hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(global-discover): regression for extractCwdFromJsonl 64KB cap

PR #1169 bug #8: the 8KB read cap landed mid-line on Claude Code session
headers, JSON.parse threw on the truncated tail, the catch silently
continued, and the project disappeared from /gstack discovery output.

Six new cases under describe("extractCwdFromJsonl 64KB cap"):

- happy path: small JSONL with obj.cwd returns it
- 12KB first line with obj.cwd: returns cwd (the bug case)
- 80KB single line overflowing 64KB: returns null without crashing
- complete line followed by partial second line: trailing-partial-drop
  must not poison the result; returns first line's cwd
- missing file: returns null (file read error swallowed)
- malformed first line + valid second line within cap: skips bad,
  returns second's cwd

Tests use the exported extractCwdFromJsonl (added in earlier export
commit) and live in a separate describe block from the existing
"4KB / 128KB buffer" tests, which exercise the unrelated scanCodex
meta.payload.cwd path at L338 — different function, different bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: regression tests for shell-script bugs in PR #1169 (#2-#5)

Two new test files pinning the four shell-script invariants from the
external audit:

regression-pr1169-build-app-sed.test.ts — bugs #2 + #3
- Runtime isolation: extracts the sed-escape sequence from build-app.sh
  and runs it against hostile $APP_NAME values ("Foo/Bar&Baz", "Cool\App",
  "A/B\C&D"). Asserts the literal hostile name round-trips through a real
  `sed s///` invocation, locking the metachar safety end-to-end.
- Static check: the rebrand block must contain both the escape line AND
  the sed line referencing $APP_NAME_SED_ESCAPED; bare $APP_NAME
  interpolation directly into the s/// replacement is rejected.
- Static check: DMG_TMP=$(mktemp -d) is followed by an explicit `|| { ... exit }`
  failure handler AND a `[ -z "$DMG_TMP" ] || [ ! -d "$DMG_TMP" ]` validation
  AND the cp -a appears AFTER both guards.
- Runtime fake-bin: extracts the guard shape, runs with a fake mktemp that
  exits 1, asserts the script exits non-zero before any cp block can reach.

regression-pr1169-mktemp-fallbacks.test.ts — bugs #4 + #5
- Per codex pushback, the invariant is "no `mktemp ... || echo <path>`
  fallback shape" — not just "no $$ token." That's a stronger invariant
  that catches future swaps to $RANDOM or hardcoded paths.
- For each of bin/gstack-telemetry-sync and supabase/verify-rls.sh:
  - no echo-based fallback after mktemp
  - no $$ inside any /tmp path literal
  - mktemp failure path explicitly exits / returns non-zero
  - telemetry-sync also pins the `trap rm -f $RESP_FILE EXIT` cleanup
    so success paths don't leak the tmp on normal exit.

All seven new test files are gate-tier (deterministic, sub-second, no LLM,
no network). Runtime shell tests use fake-bin PATH stubs in temp dirs;
no $HOME mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.41.1.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: RagavRida <ragavrida@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mathiasmora2232 pushed a commit to mathiasmora2232/gstack that referenced this pull request May 30, 2026
…rrytan#1233)

* feat(gbrain-sync): queue primitives + writer shims

Adds bin/gstack-brain-enqueue (atomic append to sync queue) and
bin/gstack-jsonl-merge (git merge driver, ts-sort with SHA-256 fallback).
Wires one backgrounded enqueue call into learnings-log, timeline-log,
review-log, and developer-profile --migrate. question-log and
question-preferences stay local per Codex v2 decision.

gstack-config gains gbrain_sync_mode (off/artifacts-only/full) and
gbrain_sync_mode_prompted keys, plus GSTACK_HOME env alignment so
tests don't leak into real ~/.gstack/config.yaml.

* feat(gbrain-sync): --once drain + secret scan + push

bin/gstack-brain-sync is the core sync binary. Subcommands: --once
(drain queue, allowlist-filter, privacy-class-filter, secret-scan
staged diff, commit with template, push with fetch+merge retry),
--status, --skip-file <path>, --drop-queue --yes, --discover-new
(cursor-based detection of artifact writes that skip the shim).

Secret regex families: AWS keys, GitHub tokens (ghp_/gho_/ghu_/ghs_/
ghr_/github_pat_), OpenAI sk-, PEM blocks, JWTs, bearer-token-in-JSON.
On hit: unstage, preserve queue, print remediation hint (--skip-file
or edit), exit clean. No daemon — invoked by preamble at skill
boundaries.

* feat(gbrain-sync): init, restore, uninstall, consumer registry

bin/gstack-brain-init: idempotent first-run. git init ~/.gstack/,
.gitignore=*, canonical .brain-allowlist + .brain-privacy-map.json,
pre-commit secret-scan hook (defense-in-depth), merge driver registration
via git config, gh repo create --private OR arbitrary --remote <url>,
initial push, ~/.gstack-brain-remote.txt for new-machine discovery,
GBrain consumer registration via HTTP POST.

bin/gstack-brain-restore: safe new-machine bootstrap. Refuses clobber
of existing allowlisted files, clones to staging, rsync-copies tracked
files, re-registers merge drivers (required — not cloned from remote),
rehydrates consumers.json, prompts for per-consumer tokens.

bin/gstack-brain-uninstall: clean off-ramp. Removes .git + .brain-*
files + consumers.json + config keys. Preserves user data (learnings,
plans, retros, profile). Optional --delete-remote for GitHub repos.

bin/gstack-brain-consumer + bin/gstack-brain-reader (symlink alias):
registry management. Internal 'consumer' term; user-facing 'reader'
per DX review decision.

* feat(gbrain-sync): preamble block — privacy gate + boundary sync

scripts/resolvers/preamble/generate-brain-sync-block.ts emits bash that
runs at every skill invocation:
- Detects ~/.gstack-brain-remote.txt on machines without local .git
  and surfaces a restore-available hint (does NOT auto-run restore).
- Runs gstack-brain-sync --once at skill start to drain any pending
  writes (and at skill end via prose instruction).
- Once-per-day auto-pull (cached via .brain-last-pull) for append-only
  JSONL files.
- Emits BRAIN_SYNC: status line every skill run.

Also emits prose for the host LLM to fire the one-time privacy
stop-gate (full / artifacts-only / off) when gbrain is detected and
gbrain_sync_mode_prompted is false. Wired into preamble.ts composition.

* test(gbrain-sync): 27-test consolidated suite

test/brain-sync.test.ts covers:
- Config: validation, defaults, GSTACK_HOME env isolation
- Enqueue: no-op gates, skip list, concurrent atomicity, JSON escape
- JSONL merge driver: 3-way + ts-sort + SHA-256 fallback
- Init + sync: canonical file creation, merge driver registration,
  push-reject + fetch+merge retry path
- Init refuses different remote (idempotency)
- Cross-machine restore round-trip (machine A write → machine B sees)
- Secret scan across all 6 regex families (AWS, GH, OpenAI, PEM, JWT,
  bearer-JSON). --skip-file unblock remediation
- Uninstall removes sync config, preserves user data
- --discover-new idempotence via mtime+size cursor

Behaviors verified via integration smokes during implementation. Known
follow-up: bun-test 5s default timeout needs 30s wrapper for
spawnSync-heavy tests.

* docs(gbrain-sync): user guide + error lookup + README section

docs/gbrain-sync.md: setup walkthrough, privacy modes, cross-machine
workflow, secret protection, two-machine conflict handling, uninstall,
troubleshooting reference.

docs/gbrain-sync-errors.md: problem/cause/fix index for every
user-visible error. Patterned on Rust's error docs + Stripe's API
error reference.

README.md: 'Cross-machine memory with GBrain sync' section near the
top (discovery moment), plus docs-table entry.

* chore: bump version and changelog (v1.7.0.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: regenerate SKILL.md files for gbrain-sync preamble block

Re-runs bun run gen:skill-docs after adding generateBrainSyncBlock
to scripts/resolvers/preamble.ts in a2aa8a0. CI check-freshness
caught the drift. All 36 SKILL.md files regenerated with the new
skill-start bash block + privacy-gate prose + skill-end sync
instructions baked in.

* fix(test): session-awareness reads AskUserQuestion Format from a Tier 2+ SKILL.md

The test was reading ROOT/SKILL.md (browse skill, Tier 1) which never
contained '## AskUserQuestion Format' — that section is only emitted
for Tier 2+ skills by scripts/resolvers/preamble.ts. As a result the
agent was prompted with an empty format guide and only emitted
'RECOMMENDATION' intermittently, making the test flaky.

Pre-existing on main (same ROOT/SKILL.md shape there) — surfaced now
because the agent run didn't hit the RECOMMENDATION/recommend/option a
fallback strings in this particular attempt.

Fix: read from office-hours/SKILL.md (Tier 3, always has the section)
with a fallback that scans for the first top-level skill dir whose
SKILL.md contains the header. Future template moves won't break this
test again.

* feat(browse): domain-skills storage + state machine

New module browse/src/domain-skills.ts implements the per-site notes
the agent writes for itself, persisted as type:"domain" rows alongside
/learn's per-project learnings.

Three scopes layered: per-project default, global by explicit promotion.
Project-active shadows global for the same host.

State machine (T6 — codex outside-voice):
  quarantined --3 uses w/o flag--> active(project) --promote--> global
        ^                                |
        +----- classifier flag during use

- Append-only JSONL with O_APPEND for atomic small writes
- Tolerant parser drops partial trailing line on read
- Tombstone for deletes (compactor cleans up later)
- Version log per (host, scope) enables rollback
- Hostname derived from active tab top-level origin (T3 confused-deputy fix)
- writeSkill rejects classifier_score >= 0.85 with structured error

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): domain-skills storage + state machine

14 tests covering:
- T3 hostname normalization (lowercase, www. strip, port/path/query strip,
  subdomain-exact preserved)
- T4 scope shadowing (per-project active shadows global for same host)
- T5 persistence (version monotonicity, tolerant parser drops partial line)
- T6 state machine (quarantined → active after N=3 uses, classifier-flag
  blocks promotion, save-time score >= 0.85 rejected)
- Rollback by version log (restore prior body, advance version counter)
- Tombstone deletion (read returns null after delete)

All 14 pass in 27ms via bun test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B domain-skill subcommands

Wire the domain-skills storage layer into the browse CLI as a META command:

  $B domain-skill save              save body from stdin or --from-file
                                     (host derived from active tab — T3)
  $B domain-skill list              list all skills visible to current project
  $B domain-skill show <host>       print skill body
  $B domain-skill edit <host>       open in $EDITOR
  $B domain-skill promote-to-global <host>  cross-project promotion (T4)
  $B domain-skill rollback <host> [--global]  restore prior version
  $B domain-skill rm <host> [--global]        tombstone

Save path runs L1-L3 content filters from content-security.ts (importable
in compiled binary, unlike L4 ML classifier — see CLAUDE.md). The L4
classifier scan happens in sidebar-agent at prompt-injection load time.

Output is structured (problem + cause + suggested-action) per DX D7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): $B cdp escape hatch — deny-default allowlist + two-tier mutex

Codex T2: flip CDP posture to deny-default. Allowed methods enumerated in
cdp-allowlist.ts with (scope: tab|browser, output: trusted|untrusted,
justification) per entry.

Initial allowlist (~25 methods) covers:
- Accessibility tree extraction (read-only)
- DOM/CSS inspection (read-only)
- Performance metrics
- Tracing
- Emulation viewport/UA override
- Page screenshot/PDF capture (output is binary, no marker injection vector)
- Network.enable/disable (no bodies/cookies — those are exfil surfaces)
- Runtime.getProperties (NO evaluate/callFunctionOn — those would be RCE)

Page.navigate is INTENTIONALLY NOT allowed; agents use $B goto which
goes through the URL blocklist.

Codex T7: two-tier mutex. tab-scoped methods take per-tab lock; browser-
scoped take global lock that blocks all tab locks. 5s acquire timeout
yields CDPMutexAcquireTimeout (no silent hangs). All lock acquires use
try/finally so errors don't leak the lock.

Path A from spike: uses Playwright's newCDPSession() per page. No second
WebSocket, no need for --remote-debugging-port. CDPSession is cached
per page in a WeakMap and cleared on page close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): CDP allowlist + two-tier mutex

13 tests:
- Allowlist linter: every entry has 4 required fields, no duplicates,
  justification length > 20 chars
- Deny-list verification: dangerous methods (Runtime.evaluate, Page.navigate,
  Network.getResponseBody, Browser.close, Target.attachToTarget, etc.) are
  NOT allowed (Codex T2 categories 4-7)
- Per-tab mutex serializes ops on same tab
- Per-tab mutex allows parallel ops across different tabs
- Global lock blocks tab locks; tab locks block global lock
- Acquire timeout yields CDPMutexAcquireTimeout (no silent hang)
- Timeout error names the tab id and the timeout budget

Also extends Network.disable justification to satisfy linter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): telemetry signals + project-slug helper

Lightweight telemetry per DX D9: piggybacks on ~/.gstack/analytics/ pattern.
Hostname + aggregate counters only, no body content. GSTACK_TELEMETRY_OFF=1
silences. Fire-and-forget — never blocks calling path.

Signals fired so far:
- domain_skill_saved {host, scope, state, bytes}
- domain_skill_save_blocked {host, reason}

(domain_skill_fired and cdp_method_* fired in subsequent commits.)

Also extracts project-slug resolution into project-slug.ts so server.ts
and domain-skill-commands.ts share one cached lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): sidebar prompt-context injection + CDP telemetry

server.ts spawnClaude now:
- Imports per-project domain skill matching the active tab's hostname
  via readDomainSkill()
- Wraps the body in UNTRUSTED EXTERNAL CONTENT envelope (so the L4
  classifier in sidebar-agent sees it at load time per Eng D4)
- Appends as <domain-skill source="..." host="..." version="..."> block
- Fires domain_skill_fired telemetry (host, source, version)
- Calls recordSkillUse fire-and-forget so the auto-promote-after-N=3
  state machine advances on each successful prompt injection

System prompt also gets a one-liner introducing $B domain-skill commands
to agents (DX D4 start-of-task discoverability hint).

cdp-bridge.ts fires:
- cdp_method_denied (drives next allow-list growth)
- cdp_method_lock_acquire_ms (P50/P99 quantile observability)
- cdp_method_called (allowed methods)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): telemetry module

3 tests covering:
- logTelemetry writes JSONL with ts injected
- GSTACK_TELEMETRY_OFF=1 silences all events
- logTelemetry never throws on disk failures

Uses GSTACK_HOME env var to redirect writes to a tmp dir; the telemetry
module reads HOME lazily so test mutations take effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: domain-skills reference + error lookup table

docs/domain-skills.md mirrors the layered shape of docs/gbrain-sync.md
(DX D8): how agents use it, state machine, storage layout, security model
(L1-L3 + L4 layered defense), error reference table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(readme): browser-harness-js plug + domain-skills section

New "Domain skills + raw CDP escape hatch" section under "The sprint"
covering both v1.8.0.0 features. Plugs browser-use/browser-harness-js
as the no-rails alternative for users who want raw CDP without gstack's
security stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.8.0.0)

Branch-scoped bump on top of merged 1.7.0.0 base. CHANGELOG entry covers
the full v1.8.0.0 scope: $B domain-skill, $B cdp escape hatch, two-tier
mutex, telemetry signals, sidebar prompt-context injection. Includes
Codex outside-voice trail (7 of 20 findings resolved, 12 mooted by T1
scope drop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* todos: 7 follow-ups from v1.8.0.0 review trail

P1: Self-authoring $B commands with out-of-process worker isolation
    (Codex T1 deferred from v1.8.0.0 — needs real isolation design)
P2: Migrate /learn to SQLite (Codex T5 long-term primitive fix)
P2: Remove plan-mode handshake from /plan-devex-review (skill bug)
P3: GBrain skillpack publishing for domain-skills
P3: Replay/record demonstrated flows to domain-skills
P3: $B commands review batch-mode UX (alternative to inline approval)
P3: Heuristic command-gap watcher (DX D4 alternative C)

Each entry has the standard What/Why/Pros/Cons/Context/Effort/Priority/
Depends-on shape so anyone picking these up later has full context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): lazy GSTACK_HOME resolution in domain-skills

Module-level constants (GLOBAL_FILE, derived path) were evaluated at
module-load and cached. When E2E and unit tests run in the same Bun
test pass and set GSTACK_HOME differently, the second test sees the
first test's path. Switch to lazy gstackHome() / globalFile() / projectFile()
helpers so process.env mutations take effect.

Mirrors the pattern already used in telemetry.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): E2E gate-tier tests for domain-skills + CDP

domain-skills-e2e.test.ts (4 tests):
- save derives host from active tab top-level origin (T3)
- save lands quarantined; list surfaces it
- readSkill returns null until 3 uses without flag promote to active (T6)
- save without an active page errors with structured guidance

cdp-e2e.test.ts (8 tests):
- Accessibility.getFullAXTree returns wrapped JSON (allowed, untrusted-output)
- Performance.getMetrics returns plain JSON (allowed, trusted-output)
- Runtime.evaluate DENIED with structured guidance (T2 RCE block)
- Page.navigate DENIED (must use $B goto for blocklist routing)
- Network.getResponseBody DENIED (exfil block)
- malformed JSON params surfaces clear error
- non Domain.method format surfaces clear error
- $B cdp help returns help text

Both files boot a real Chromium via BrowserManager.launch() and exercise
the dispatch handlers end-to-end. Total 12 E2E tests in <2s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regenerate SKILL.md files with new $B commands

bun run gen:skill-docs picks up the domain-skill and cdp META_COMMANDS
entries added in commands.ts. Both top-level SKILL.md and browse/SKILL.md
now list the new commands in their Meta and Inspection tables.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship SKILL.md golden baselines for v1.7.0.0

Pre-existing failures inherited from garrytan/gbrain-support: the GBrain
Sync preamble block (added in v1.7.0.0) appears in regenerated SKILL.md
output but the golden baselines in test/fixtures/golden/ were never
updated. Three failures fixed:

  golden-file regression > Claude ship skill matches golden baseline
  golden-file regression > Codex ship skill matches golden baseline
  golden-file regression > Factory ship skill matches golden baseline

Goldens regenerated by copying the current ship/SKILL.md, codex
.agents/skills/gstack-ship/SKILL.md, and .factory/skills/gstack-ship/SKILL.md
files. Diff is the v1.7.0.0 GBrain Sync preamble block + privacy stop-gate
(no behavioral changes — just preamble text).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-sync): bearer-token regex catches values with leading space

Pre-existing bug from v1.7.0.0: the bearer-token-json secret pattern
required values matching [A-Za-z0-9_./+=-]{16,}, which rejected the
"Bearer <token>" form because the literal space after "Bearer" wasn't
in the character class. Real Authorization headers use "Bearer <token>"
syntax, and the test fixture
  '"authorization":"Bearer abcdef1234567890abcdef1234567890"'
sat unscanned despite being a leak-class secret.

One-character fix: add space to the value character class. Test
'gstack-brain-sync secret scan > blocks bearer-json' now passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(brain-sync): GSTACK_HOME isolation test compares mtime, not content

Pre-existing flaky test: the GSTACK_HOME-overrides-real-config test asserted
the real ~/.gstack/config.yaml does NOT contain "gbrain_sync_mode: full"
after the test. That fails for any user whose real config legitimately has
that key set from prior usage — the test's invariant is "the command did
not modify the real file," not "the real file lacks any specific value."

Switch to mtime + content snapshot: capture both BEFORE running the command,
then verify both are unchanged after. Also add a positive assertion that
the tmpHome config DID get the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): exempt deliberate large fixtures from 2MB limit

Pre-existing failure: the "git tracks no files larger than 2MB" test
caught browse/test/fixtures/security-bench-haiku-responses.json (28.8MB
of replay data committed in v1.6.4.0 for security benchmark gate tests).

The test exists to catch accidentally-committed binaries (Mach-O dist
binaries, etc), not to forbid all large files. Add an explicit
LARGE_FIXTURE_EXEMPTIONS allowlist so deliberate replay fixtures pass
the gate while accidental binaries still fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skill-token): mint scoped tokens per skill spawn

Wraps token-registry.createToken/revokeToken with skill-specific
clientId encoding (skill:<name>:<spawn-id>) and read+write defaults.
Skill scripts get a per-spawn capability token bound to browser-driving
commands; the daemon root token never leaves the harness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-client): SDK for browser-skill scripts

Thin wrapper over POST /command with bearer auth. Resolves daemon
port + token from GSTACK_PORT + GSTACK_SKILL_TOKEN env vars first
(set by $B skill run when spawning), falls back to .gstack/browse.json
for standalone debug runs.

Convenience methods cover the read+write surface skills typically need:
goto, click, fill, text, html, snapshot, links, forms, accessibility,
attrs, media, data, scroll, press, type, select, wait, hover, screenshot.
Low-level command(cmd, args) escape hatch for anything else.

This is the canonical SDK source. Each browser-skill ships a sibling
copy at <skill>/_lib/browse-client.ts so each skill is fully portable
and version-pinned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): 3-tier storage helpers

listBrowserSkills() walks project > global > bundled (first-wins),
parses SKILL.md frontmatter, no INDEX.json. readBrowserSkill() does
the same for a single name. tombstoneBrowserSkill() moves a skill
into .tombstones/<name>-<ts>/ for recoverability.

Frontmatter parser handles the subset browser-skills need: scalars
(host, description, trusted, version, source), string lists
(triggers), and arg-mapping lists ([{name, description}, ...]).
Quoted values handle colons; trusted defaults to false.

Bundled tier path is auto-detected from the binary install location;
project tier comes from git rev-parse; global is ~/.gstack/. All tier
paths are overridable for hermetic tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): \$B skill list/show/run/test/rm subcommands

handleSkillCommand dispatches to per-subcommand handlers; spawnSkill is
the load-bearing function that:

  1. Mints a per-spawn scoped token (read+write only) bound to the
     skill name + spawn-id.
  2. Builds the spawn env:
     - trusted: passes process.env minus GSTACK_TOKEN (defense in depth).
     - untrusted: minimal allowlist (LANG, LC_ALL, TERM, TZ) + locked
       PATH; explicitly drops anything matching TOKEN/KEY/SECRET/etc.
       Also drops AWS_/AZURE_/GCP_/GOOGLE_APPLICATION_/ANTHROPIC_/OPENAI_/
       GITHUB_/GH_/SSH_/GPG_/NPM_TOKEN/PYPI_ patterns.
   3. Always injects GSTACK_PORT + GSTACK_SKILL_TOKEN last (cannot be
     overridden by parent env).
  4. Spawns bun run script.ts -- <args> with cwd=skillDir, captures
     stdout (1MB cap), stderr, and timeout-kills past the deadline.
  5. Revokes the token in finally{}, always.

list output prints the resolved tier inline so "why did it run that
one?" never becomes a debugging mystery (Codex finding garrytan#4 mitigation).

server.ts threads the listen port to meta-commands via MetaCommandOpts.daemonPort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browser-skills): bundled hackernews-frontpage reference skill

Smallest interesting browser-skill: scrapes HN front page, returns
30 stories as JSON. No auth, stable HTML, fully fixture-tested.

Files:
  SKILL.md                          frontmatter + prose
  script.ts                         exports parseStoriesFromHtml(html)
                                    main: goto + html + parse + JSON.stringify
  _lib/browse-client.ts             vendored copy of the SDK
  fixtures/hn-2026-04-26.html       captured front page (5 stories)
  script.test.ts                    13 assertions against the fixture

The parser is a pure function over HTML so script.test.ts runs
without a daemon (just imports parseStoriesFromHtml and asserts).

This exercises every Phase 1 component end-to-end:
  - browse-client SDK (script imports browse from ./_lib/)
  - 3-tier lookup (hackernews-frontpage lives in the bundled tier)
  - scoped tokens (read+write is enough for goto + html)
  - spawn lifecycle (\$B skill run hackernews-frontpage)
  - file-fixture testing (\$B skill test hackernews-frontpage)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(skill-validation): cover bundled browser-skills

Adds 7 assertions per bundled skill at <root>/browser-skills/<name>/:
  - SKILL.md exists
  - frontmatter parses with required fields (name/host/triggers/args)
  - script.ts exists
  - _lib/browse-client.ts exists and matches the canonical SDK byte-for-byte
  - script.test.ts exists
  - script.ts imports browse from ./_lib/browse-client

The byte-identical SDK check enforces the version-pinning contract:
when the canonical SDK at browse/src/browse-client.ts changes, every
bundled skill's _lib/ copy must be re-synced or this test fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(designs): add BROWSER_SKILLS_V1 design doc

Captures the 13 locked decisions, two-axis trust model (daemon-side
scoped tokens + process-side env access), 3-tier lookup, file
layout, and full responses to all 8 Codex outside-voice findings.
Includes Phase 2-4 sketches for future branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(todos): replace self-authoring-\$B P1 with browser-skills phases

Phase 1 of the browser-skills design shipped on this branch (sidesteps
the in-daemon isolation problem the original P1 was blocked on). The
new entries enumerate the work that remains:

  P1: Phase 2 (/scrape + /automate skill templates)
  P2: Phase 3 (resolver injection at session start)
  P2: Phase 4 (eval infra + fixture staleness + OS sandbox)

Cross-references docs/designs/BROWSER_SKILLS_V1.md for the full
architecture and the 8 Codex review findings + responses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.9.0.0 — browser-skills runtime

VERSION 1.8.0.0 → 1.9.0.0. CHANGELOG entry leads with what humans
can do today (hand-write deterministic browser scripts, run them in
200ms via \$B skill run). Notes explicitly that agent authoring
lands in next release; no fabricated perf numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills-e2e): exercise dispatch with bundled hackernews-frontpage

Covers the full \$B skill list/show/test pipeline against the real
bundled reference skill (defaultTierPaths picks up <repo>/browser-skills/).
Verifies frontmatter shape, the three-tier walk surfaces the bundled
entry, and \$B skill test successfully runs the bundled script.test.ts
in a child bun process.

\$B skill run end-to-end against the live network is intentionally NOT
covered here (would be flaky against news.ycombinator.com); the spawn
lifecycle is exercised in browser-skill-commands.test.ts using inline
synthetic skills.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: regen SKILL.md to surface the skill META command

bun run gen:skill-docs picked up the new \`skill\` command from
COMMAND_DESCRIPTIONS in browse/src/commands.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.9.0.0 → v1.13.0.0

Main shipped through v1.11.1.0 while this branch was in flight; v1.12.x
is presumed claimed by another in-flight branch. Use v1.13.0.0 as the
next available slot.

Updated VERSION, package.json, and the CHANGELOG header. Entry body
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: bump v1.13.0.0 → v1.16.0.0

Main shipped v1.13.0.0 (claude outside-voice skill), v1.14.0.0
(sidebar REPL), and v1.15.0.0 (slim preamble + plan-mode E2E)
while this branch was in flight. Use v1.16.0.0 as the next
available slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse-skills): atomic write helper for /skillify (D3)

stageSkill writes a candidate skill into ~/.gstack/.tmp/skillify-<spawnId>/
with restrictive perms. commitSkill does an atomic fs.renameSync into the
final tier path with realpath/lstat discipline (refuses symlinked staging
dirs, refuses to clobber existing skills). discardStaged is the cleanup
path for test failures and approval rejections, idempotent and bounded
to the per-spawn wrapper. validateSkillName enforces lowercase/digits/
dashes only, no path-escape characters.

Implements the D3 contract from the v1.19.0.0 plan review: never a
half-written skill on disk. Test fail or approval reject = rm -rf the
temp dir, no tombstone for never-approved skills.

Closes Codex finding garrytan#5 (atomic skill packaging) for Phase 2a.

34 unit assertions covering: stage validation, file-path escape rejection,
permission check, atomic rename, clobber refusal, symlink refusal, project
tier unresolved, idempotent discard, end-to-end happy + simulated test
failure + approval reject paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(scrape): /scrape <intent> skill template

One entry point for pulling page data. Three paths under the hood:

1. Match — agent reads $B skill list, semantically matches the user's
   intent against each skill's triggers + description + host. Confident
   match = $B skill run <name> in ~200ms.
2. Prototype — no match, drive the page with $B goto/text/html/links etc.
   Return JSON, append a one-line "say /skillify" nudge.
3. Mutating refusal — verbs like submit/click/fill route to /automate
   (Phase 2b P0); /scrape is read-only by contract.

Match decision lives in the agent, not the daemon. No new code in
browse/src/, no expanded daemon command surface, no new prompt-injection
blast radius.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(skillify): /skillify codifies last /scrape into permanent skill

The productivity multiplier. /scrape discovers the flow; /skillify writes
it as deterministic Playwright-via-browse-client code so the next /scrape
on the same intent runs in ~200ms.

11-step flow with three locked contracts from the v1.19.0.0 plan review:

D1 — Provenance guard. Walk back ≤10 agent turns for a clearly-bounded
/scrape result. Refuse with one specific message if cold. No silent
synthesis from chat fragments.

D2 — Synthesis input slice. Extract ONLY the final-attempt $B calls that
produced the JSON the user accepted, plus the user's intent string. Drop
failed selectors, drop unrelated chat, drop earlier-session content.
Closes Codex finding garrytan#6 by picking option (b) from the design doc:
re-prompt from agent's own context, not a structured recorder.

D3 — Atomic write. Stage to ~/.gstack/.tmp/skillify-<spawnId>/, run
$B skill test against the temp dir, only rename into the final tier path
on test pass + user approval. Test fail or approval reject = rm -rf the
temp dir entirely.

Default tier: global (~/.gstack/browser-skills/<name>/). --project flag
overrides to per-project. Generated test must include at least one ★★
assertion (parsed JSON has expected shape + non-empty key fields), not a
smoke ★ assertion.

Bun runtime distribution (Codex finding garrytan#7) carries over to Phase 4.
Documented in the skill's Limits section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browser-skills): gate-tier E2E for /scrape + /skillify (D4)

Five scenarios cover the productivity loop and the contracts locked
during the v1.19.0.0 plan review:

  scrape-match-path           — intent matching bundled hackernews-frontpage
                                routes via $B skill run, no prototype phase
  scrape-prototype-path       — no matching skill, drives $B against a local
                                file:// fixture, returns JSON, suggests
                                /skillify
  skillify-happy-path         — /scrape then /skillify; skill written to
                                ~/.gstack/browser-skills/<name>/ with the
                                full file tree; SKILL.md prose body must
                                not contain conversation fragments (D2)
  skillify-provenance-refusal — cold /skillify with no prior /scrape refuses
                                with the D1 message; nothing on disk (D1)
  skillify-approval-reject    — /scrape then /skillify but reject in the
                                approval gate; temp dir is removed, nothing
                                at the final tier path (D3)

All five gate-tier (~$0.50-$1.50 each, ~$5 total per CI run). Set EVALS=1
to enable. Uses local file:// fixtures so prototype + skillify scenarios
run deterministically without network.

Touchfiles registers all 5 entries with proper deps on scrape/**,
skillify/**, browse/src/browser-skill-write.ts, and the Phase 1 runtime
modules. The match-path test depends on the bundled hackernews-frontpage
skill so its touchfile includes browser-skills/hackernews-frontpage/**.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser-skills): TODOS Phase 2a + design doc D1-D4 decisions

TODOS.md:
- Narrows existing P1 (was "/scrape and /automate") to "/scrape and
  /skillify" — the /scrape + /skillify wedge ships in this branch.
  Codex finding garrytan#6 (synthesis) removed from Cons (resolved by D2);
  finding garrytan#7 (Bun runtime) stays as the open carry-over.
- Adds new ## P0 above PACING_UPDATES_V0 for the /automate follow-up.
  Same skillify pattern as /scrape, different trust profile (per-step
  confirmation gate when running non-codified). Reuses /skillify and
  the D3 helper as-is. Effort M.

BROWSER_SKILLS_V1.md:
- Phase table re-organized into 1, 2a, 2b, 3, 4. Phase 1 + Phase 2a
  consolidate into v1.19.0.0 ship (the v1.16.0.0 branch-internal
  bump never landed on main).
- New "Phase 2a" sub-section captures the four decisions locked
  during /plan-eng-review:
    D1 — provenance guard (≤10 turn walk-back, refuse if cold)
    D2 — synthesis input slice (final-attempt $B calls only,
         closes Codex finding garrytan#6)
    D3 — atomic write discipline (temp-dir-then-rename via new
         browse/src/browser-skill-write.ts helper)
    D4 — full test scope (5 gate E2E + 1 unit + smoke)
- New "Phase 2b" sketch for /automate: same skillify machinery,
  per-mutating-step confirmation gate, deferred to next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.16.0.0 -> v1.19.0.0 — browser-skills Phase 1 + 2a

Consolidates the v1.16.0.0 branch-internal bump (Phase 1 runtime, never
landed on main) with Phase 2a (/scrape + /skillify + atomic-write helper)
into one v1.19.0.0 ship per CLAUDE.md "Never orphan branch-internal
versions" rule.

Headline: Browser-skills land end-to-end. /scrape <intent> first call
drives the page; second call runs the codified script in 200ms.

The unified CHANGELOG entry covers:
- Phase 1 runtime: $B skill list/show/run/test/rm, scoped tokens,
  3-tier storage, bundled hackernews-frontpage reference.
- Phase 2a: /scrape + /skillify gstack skills, browser-skill-write.ts
  atomic helper, 5 gate-tier E2E + 34 unit assertions.

Numbers table updated: 5 new modules (+browser-skill-write), 2 new
gstack skills, 6 of 8 Codex outside-voice findings resolved (synthesis
garrytan#6 closed by D2; Bun runtime garrytan#7 + OS sandbox garrytan#1 stay deferred to Phase 4).

/automate (Phase 2b) is split out as P0 in TODOS for the next branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(commands): tighten descriptions for LLM-judge baseline pinning

The skill-llm-eval test "baseline score pinning" failed CI on three
retry attempts: judge gave command_reference.actionability=3, baseline
demands ≥4. Judge cited 8 specific gaps in COMMAND_DESCRIPTIONS.

This commit closes 7 of 8 by tightening the descriptions:

- press: documents that key names are case-sensitive Playwright keys,
  shows modifier syntax (Shift+Enter, Control+A), links the full key
  list. Removes the "is this case-sensitive?" guesswork.
- is: documents that <sel> accepts either a CSS selector OR an @ref
  token from a prior snapshot, and that property values are case-
  sensitive.
- scroll: documents that there is no --by/--to amount option, points
  at `js window.scrollTo(0, N)` for pixel-precise scrolling.
- js / eval: clarifies that both run in the same JS sandbox, the
  difference is just inline expr (js) vs file (eval).
- storage: clarifies sessionStorage is read-only via this command,
  points at `js sessionStorage.setItem(...)` for the write path.
- chain: walks through how to invoke (pipe a JSON array of arrays to
  $B chain), confirms it stops at the first error.
- cdp: explains how to discover allowed methods (read cdp-allowlist.ts)
  + shows a concrete example invocation.
- domain-skill: explains that the "classifier flag" is set automatically
  by the L4 prompt-injection scan (agents do not set it manually);
  enumerates the full lifecycle verbs.

The 8th gap (storage set syntax conflict) is also resolved as part of
the storage rewrite.

Two pipe-character bugs caught by the existing
`no command description contains pipe character` guard at
`test/gen-skill-docs.test.ts:595`: the chain example originally used
`echo '[...]' | $B chain` (literal pipe) and the cdp description used
`tab|browser` / `trusted|untrusted` (also literal pipes). Both rewritten
to keep markdown table cells intact.

Verification: 696/0 pass on skill-validation + gen-skill-docs after
regen across all hosts. The CI llm-judge eval will re-run against the
new SKILL.md and should hit actionability ≥4 reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(browser): rewrite BROWSER.md as complete reference

Full rewrite covering the gstack browser surface as of v1.19.0.0. Up from
488 to 1,299 lines, 26 top-level sections.

Adds previously-undocumented subsystems:

- The productivity loop: /scrape + /skillify with D1 (provenance guard),
  D2 (final-attempt-only synthesis), D3 (atomic-write discipline) contracts.
- Browser-skills runtime: anatomy, three-tier storage, scoped tokens, trust
  model (capability + env axes), sibling SDK distribution, atomic-write
  helper, bundled hackernews-frontpage reference.
- Domain-skills: per-site agent notes with quarantined → active → global
  state machine and the L4-classifier auto-promotion gate.
- Pair-agent: dual-listener architecture, 26-command tunnel allowlist,
  canDispatchOverTunnel pure gate, three token types (root, setup key,
  scoped), denial log path + salt model.
- Security stack L1-L6: layer table, thresholds (BLOCK/WARN/LOG_ONLY/
  SOLO_CONTENT_BLOCK), ensemble rule, classifier model paths, env knobs.
- Side Panel deep dive: Terminal pane (Claude PTY) as the primary surface
  with Activity/Refs/Inspector as debug overlays, WS auth via
  Sec-WebSocket-Protocol, gstackInjectToTerminal cross-pane plumbing.
- CDP escape hatch: $B cdp deny-default allowlist, $B inspect CSS inspector,
  $B ux-audit page structure extraction.
- Meta commands previously undocumented: tabs/frames/state/watch/inbox/
  tab-each, with usage and storage paths.
- Authentication: three token types with lifetimes, SSE session cookie,
  PTY session cookie, token registry behavior.
- Full source map: 30+ file inventory of browse/src/ vs the old 11-file
  list.

Preserves from before: architecture diagram, daemon lifecycle, snapshot
ref staleness, screenshot modes, goto file:// vs load-html semantics,
batch endpoint, JS await wrapping, env vars, performance numbers vs MCP,
Playwright acknowledgments, dev guide.

Cross-links to ARCHITECTURE.md, CLAUDE.md, docs/REMOTE_BROWSER_ACCESS.md,
docs/designs/BROWSER_SKILLS_V1.md, scrape/SKILL.md, skillify/SKILL.md,
TODOS.md so anyone landing on BROWSER.md can navigate to the load-bearing
companion docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(server): tab-ownership gate keys on tabPolicy, not isWrite

Browser-skill spawns hit `403: Tab not owned by your agent` on every
first run because the gate at server.ts:639 fired for any non-root
write, regardless of the token's tabPolicy. The bundled
hackernews-frontpage reference skill failed identically. Every
/skillify-generated skill failed identically. The user's natural
tabs have no claimed owner — by design — so any skill driving
them via `goto` (a write) was 403'd.

The intent in skill-token.ts:79 was always correct: `tabPolicy: 'shared'`
with the comment "skill scripts may switch tabs as needed." The
enforcement just ignored it.

Two surgical changes:

browser-manager.ts:checkTabAccess — gate now keys on options.ownOnly
only. Shared-policy tokens (skill spawns, default scoped clients) get
permissive access — root-equivalent for the tab gate. Own-only tokens
(pair-agent over the ngrok tunnel) still require ownership for every
read and write. isWrite stays in the signature for callers that want
to log or branch elsewhere; it no longer gates the decision.

server.ts:639 — gate predicate narrowed from
  (WRITE_COMMANDS.has(command) || tokenInfo.tabPolicy === 'own-only')
to just
  tokenInfo.tabPolicy === 'own-only'
The 'newtab' exemption stays. Shared tokens skip the gate entirely;
own-only tokens still hit it. Comment block above the gate updated to
document the new predicate intent.

Pair-agent isolation is intact. Tunnel tokens still default to
tabPolicy: 'own-only', still must `newtab` first to get a tab they
can drive, still can't dispatch any of the 23 commands outside the
tunnel allowlist.

The capability gate (scope checks) and rate limits already constrain
what local scoped clients can do; tab ownership was never a security
boundary for them — only for pair-agent. This release makes the
enforcement match the original design intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(server): lock the shared-vs-own-only tab gate contract

The pre-fix tests at tab-isolation.test.ts:43,57 encoded the broken
behavior as the contract — they specifically asserted "scoped agent
cannot write to unowned tab," which was the exact failure mode that
broke browser-skills. They passed because they tested the wrong
invariant.

This commit replaces those tests with explicit shared-vs-own-only
coverage that documents what each policy actually means:

- Shared scoped agents (skill spawns, default scoped clients) can
  read AND write any tab — unowned, their own, or another agent's.
  The capability is gated by scope checks + rate limits, not by tab
  ownership.
- Own-only scoped agents (pair-agent over tunnel) cannot read OR
  write any tab they don't own. Pre-fix this case was conflated with
  shared writes; now it's explicit.

9 unit assertions on checkTabAccess, up from 6. Each test names
the policy axis it's covering so a future refactor can't quietly
flip the contract.

Adds source-shape regression test 10a in server-auth.test.ts:
"tab gate predicate is own-only-scoped, not write-scoped." The
gate's `if (...)` line MUST contain `tabPolicy === 'own-only'` and
MUST NOT contain `WRITE_COMMANDS.has(command) ||`. If a future
refactor re-introduces the write-scoped gate, this fails immediately
in free-tier `bun test`.

Updates the marker for the existing newtab-excluded test to match
the new comment block ("Tab ownership check (own-only tokens /
pair-agent isolation)").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v1.19.0.0 -> v1.20.0.0 — fix tab-ownership footgun

Patch release on top of v1.19.0.0. The shipping headline of v1.19.0.0
(/scrape + /skillify productivity loop) was broken on first run in any
session where the daemon already had a tab. Bundled
hackernews-frontpage failed identically. Every /skillify-generated
skill failed identically.

The fix narrows the tab-ownership gate from "any non-root write" to
"tabPolicy === 'own-only' only." Pair-agent isolation (the v1.6.0.0
threat model) is intact; local skill spawns get their original
behavior back.

VERSION: 1.19.0.0 -> 1.20.0.0
package.json version: synced.

CHANGELOG entry leads with the user-visible impact: the productivity
loop works again, no half-second-stalls of confused 403s. Includes
before/after metrics on the bundled reference skill and the broken-
contract pre-fix tests that hid the regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(claude): sharpen CHANGELOG rule — diff between main and ship

Codifies what was already implicit in the existing "Never orphan
branch-internal versions" + "Only document what shipped between main
and this change" sections, but with sharper language and concrete
NEVER examples.

The rule: a CHANGELOG entry is the diff between main and the shipping
branch — what users get when they upgrade. NOT how the branch got
there. Branch-internal version bumps, mid-branch bug fixes, plan
review outcomes, and patch narratives all belong in PR descriptions
and commit messages, not in CHANGELOG.

Adds explicit examples of phrasing to NEVER use:
  - "v1.X had a bug that v1.Y fixes" (mentions a branch-internal version)
  - "The shipping headline of v1.X was broken because..." (apologizes
    for never-released state)
  - "Pre-fix tests encoded the broken behavior" (contributor's victory
    lap, not user benefit)
  - "Two surgical edits, both in the dispatch path" (micro-narrative
    of the patch)

The constructive replacement: describe the released system as a
property, not as a fix. "Browser-skills run end-to-end with the
expected tab-access semantics." If a property is worth calling out,
document it in the trust-model section, not as a "we fixed X" callout.

Pairs with feedback_no_shame_changelog and
feedback_changelog_harden_against_critics memories — entries should
read as a flex even to a hostile screenshotter, never admit prior
breakage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): consolidate v1.20.0.0 as the diff vs main

Rewrites the v1.20.0.0 entry to describe what users get when they
upgrade from main (v1.17.0.0) to this release: browser-skills
end-to-end. Drops all branch-internal narrative — Phase 1 / Phase 2a
labels, the v1.8.0.0 P1 history paragraph, the test-counts-by-phase
split, and the patch micro-narrative for the tab-policy semantics.

The previously-separate v1.19.0.0 entry (a branch-internal version
that never landed on main) collapses into v1.20.0.0 per the
"Never orphan branch-internal versions" rule.

Tab-access policies are now documented as a property of the trust
model: `'shared'` (skill spawns) is permissive, `'own-only'`
(pair-agent over the tunnel) is strict. No "fix" framing, no
mention of an intermediate state where it was broken.

Adds the BROWSER.md rewrite and the new tab-isolation +
server-auth source-shape regression tests to the itemized changes.

The reverse-chronological order remains: v1.20.0.0 → v1.17.0.0 →
v1.16.0.0 → v1.15.0.0 → ... Gaps (v1.18, v1.19) are fine — those
were branch-internal version numbers that never landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
adamtasteslikegood added a commit to adamtasteslikegood/gstack that referenced this pull request Jun 5, 2026
* v1.41.1.0 fix wave: 7 HIGH bugs from external audit + regression tests (PR #1169 follow-up) (#1592)

* fix(build-app): escape sed replacement metachars in Chromium rebrand

build-app.sh injects \$APP_NAME directly into the replacement half of
sed's s/// when patching Chromium's localized InfoPlist.strings. If
\$APP_NAME ever carries '/', '&', or '\\' — the command either breaks
or starts interpreting input as sed syntax. The trailing '|| true'
would then silently hide the failure and ship a DMG that still says
'Google Chrome for Testing' in the menu bar.

Escape replacement metachars before substitution. No change for the
default name 'GStack Browser'.

* fix(build-app): bail out if 'mktemp -d' fails instead of cp-ing into '/'

The DMG creation step sets DMG_TMP from 'mktemp -d' with no error check.
If mktemp fails (tmpfs full, permissions, TMPDIR misconfigured), DMG_TMP
is empty and the very next line — 'cp -a "\$APP_DIR" "\$DMG_TMP/"' —
expands to 'cp -a "<app>" "/"', which copies the bundle into the root of
the filesystem.

Refuse to continue unless mktemp produced a real directory. Defensive
second check catches the (rare) case where mktemp succeeds but returns
something that isn't a directory we can cp into.

* fix(telemetry-sync): drop predictable $$ tmp-file fallback

gstack-telemetry-sync tried 'mktemp /tmp/gstack-sync-XXXXXX' and on
failure fell back to '/tmp/gstack-sync-$$'. $$ is the PID — predictable
and reusable, so on shared hosts another user can pre-create or symlink
the path and either steal the response body or clobber an unrelated
file when curl writes through it.

Drop the fallback. If mktemp cannot produce a unique file we just skip
this sync cycle — the events stay on disk and the next run picks them
up. Also install an EXIT trap so the response file is cleaned up on
unexpected exit, not just on the happy path.

* fix(verify-rls): drop predictable $$-based tmp file fallback

Same shape as gstack-telemetry-sync: on mktemp failure the script fell
back to '/tmp/verify-rls-$$-$TOTAL', which is fully predictable from the
PID and a per-check counter. On a shared box another user can pre-create
or symlink the path and either capture the HTTP response body (which may
leak what the RLS tests revealed) or corrupt an unrelated file that curl
writes through.

Make mktemp strict. On failure return from the check function; the caller
tallies a FAIL and the run moves on.

* fix(security-classifier): close writer + delete tmp on download error

downloadFile() opens an fs.WriteStream to '<dest>.tmp.<pid>' and drives
it from a fetch body reader, but if reader.read() or writer.write()
throws mid-download the writer is never closed. That leaks an FD per
failed attempt and leaves the half-written tmp on disk. A later retry
can land in renameSync(tmp, dest) with a truncated TestSavantAI /
DeBERTa ONNX file — which then loads but produces garbage classifier
verdicts until the user manually nukes the models cache.

Wrap the download loop in try/catch. On failure, destroy() the writer
and unlink the tmp before rethrowing, so the next attempt starts from a
clean slate.

* fix(meta-commands): guard JSON.parse in pdf --from-file parser

parsePdfFromFile() runs JSON.parse on user-supplied file contents with
no try/catch. A malformed payload surfaces as an uncaught SyntaxError
from the 'pdf' command handler and the user sees an opaque stack trace
instead of "this file isn't valid JSON". Worse, the same call path is
used by make-pdf when header/footer HTML would overflow Windows'
CreateProcess argv cap, so a corrupt payload file there can take down
the make-pdf run.

Wrap JSON.parse. Re-throw with a message that names the offending file
and echoes the parser's own explanation. Also reject top-level non-
objects (null, array, primitive) since the rest of the function treats
json as an object — catching that here produces a clear error instead
of a TypeError further down.

* fix(global-discover): stop dropping sessions when header >8KB

extractCwdFromJsonl() reads the first 8KB of each JSONL session file and
runs JSON.parse on every newline-split line. When a session record
happens to straddle the 8KB cap, the last line ends in a truncated JSON
fragment, JSON.parse throws, the catch block 'continue's silently, and
if that was the only line carrying 'cwd' the whole project gets dropped
from the discovery output without a warning.

Two independent hardening steps:
  1. Raise the read cap to 64KB. Session headers observed in Claude
     Code / Codex / Gemini transcripts fit comfortably; this just moves
     the cliff out of the normal range.
  2. Drop the final segment after splitting on '\\n'. If the read hit
     the cap mid-line, that segment is guaranteed incomplete; if the
     file ended inside the buffer, the split produces an empty final
     segment and dropping it is a no-op.

Together these make the parser robust regardless of how verbose the
leading records are.

* test: export downloadFile, parsePdfFromFile, extractCwdFromJsonl

These three internal helpers are now imported by regression tests
landing in the next commits (PR #1169 follow-up). Pattern matches the
existing normalizeRemoteUrl export in gstack-global-discover.ts which
test/global-discover.test.ts already imports side-effect-free.

No change to runtime behavior; gstack has no public package entrypoint
that would re-export these, so the in-repo surface is unchanged for
callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(security-classifier): await writer close before unlinking tmp on error

The earlier downloadFile() error-path cleanup hit a race: Node's
createWriteStream lazily opens the FD and flushes buffered writes during
destroy(), so a naive `fs.unlinkSync(tmp)` immediately after `writer.destroy()`
hits ENOENT (file not yet on disk), then the writer's destroy finishes on the
next tick and creates the file fresh — leaving the half-written tmp behind
exactly as the original fix tried to prevent.

The new sequence awaits the writer's 'close' event before unlinking, so the FD
is fully torn down and no subsequent flush can re-create the path.

Caught by browse/test/security-classifier-download-cleanup.test.ts in the
next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): regression tests for downloadFile cleanup + parsePdfFromFile guard

Covers PR #1169 bugs #6 and #7:

- security-classifier-download-cleanup.test.ts pins downloadFile error-path
  cleanup against three failure shapes: reader rejects mid-stream, non-2xx
  response, missing body. Asserts the dest file is not created and no
  <dest>.tmp.* siblings remain (glob-matched, not exact path — codex push:
  if the fix later switches to mkdtempSync, the assertion still holds).
  Includes a happy-path case so the cleanup isn't fighting a correct download.

- regression-pr1169-pdf-from-file-invalid-json.test.ts pins parsePdfFromFile
  to throw a helpful error for: invalid JSON, empty file, top-level array,
  top-level number, top-level string, top-level null, top-level boolean.
  Codex push: JSON.parse accepts primitives too, so Array.isArray + typeof
  guard must be tested separately from the JSON.parse try/catch.

Both files use mkdtempSync(process.cwd()/...) for fixture isolation since
SAFE_DIRECTORIES allows TEMP_DIR or cwd; cwd is universal across CI hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(global-discover): regression for extractCwdFromJsonl 64KB cap

PR #1169 bug #8: the 8KB read cap landed mid-line on Claude Code session
headers, JSON.parse threw on the truncated tail, the catch silently
continued, and the project disappeared from /gstack discovery output.

Six new cases under describe("extractCwdFromJsonl 64KB cap"):

- happy path: small JSONL with obj.cwd returns it
- 12KB first line with obj.cwd: returns cwd (the bug case)
- 80KB single line overflowing 64KB: returns null without crashing
- complete line followed by partial second line: trailing-partial-drop
  must not poison the result; returns first line's cwd
- missing file: returns null (file read error swallowed)
- malformed first line + valid second line within cap: skips bad,
  returns second's cwd

Tests use the exported extractCwdFromJsonl (added in earlier export
commit) and live in a separate describe block from the existing
"4KB / 128KB buffer" tests, which exercise the unrelated scanCodex
meta.payload.cwd path at L338 — different function, different bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: regression tests for shell-script bugs in PR #1169 (#2-#5)

Two new test files pinning the four shell-script invariants from the
external audit:

regression-pr1169-build-app-sed.test.ts — bugs #2 + #3
- Runtime isolation: extracts the sed-escape sequence from build-app.sh
  and runs it against hostile $APP_NAME values ("Foo/Bar&Baz", "Cool\App",
  "A/B\C&D"). Asserts the literal hostile name round-trips through a real
  `sed s///` invocation, locking the metachar safety end-to-end.
- Static check: the rebrand block must contain both the escape line AND
  the sed line referencing $APP_NAME_SED_ESCAPED; bare $APP_NAME
  interpolation directly into the s/// replacement is rejected.
- Static check: DMG_TMP=$(mktemp -d) is followed by an explicit `|| { ... exit }`
  failure handler AND a `[ -z "$DMG_TMP" ] || [ ! -d "$DMG_TMP" ]` validation
  AND the cp -a appears AFTER both guards.
- Runtime fake-bin: extracts the guard shape, runs with a fake mktemp that
  exits 1, asserts the script exits non-zero before any cp block can reach.

regression-pr1169-mktemp-fallbacks.test.ts — bugs #4 + #5
- Per codex pushback, the invariant is "no `mktemp ... || echo <path>`
  fallback shape" — not just "no $$ token." That's a stronger invariant
  that catches future swaps to $RANDOM or hardcoded paths.
- For each of bin/gstack-telemetry-sync and supabase/verify-rls.sh:
  - no echo-based fallback after mktemp
  - no $$ inside any /tmp path literal
  - mktemp failure path explicitly exits / returns non-zero
  - telemetry-sync also pins the `trap rm -f $RESP_FILE EXIT` cleanup
    so success paths don't leak the tmp on normal exit.

All seven new test files are gate-tier (deterministic, sub-second, no LLM,
no network). Runtime shell tests use fake-bin PATH stubs in temp dirs;
no $HOME mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.41.1.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: RagavRida <ragavrida@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.42.0.0 Daegu wave: 23 community-filed bugs + PTY classifier enforcement (24 bisect commits) (#1594)

* fix(gstack-paths): guard CLAUDE_PLUGIN_DATA against cross-plugin contamination (#1569)

gstack-paths previously trusted CLAUDE_PLUGIN_DATA as a fallback for
GSTACK_STATE_ROOT whenever GSTACK_HOME was unset. When another plugin
(e.g. Codex) persists its own CLAUDE_PLUGIN_DATA into the session env
via CLAUDE_ENV_FILE, gstack picked it up and wrote checkpoints,
analytics, and learnings into that plugin's directory. Anyone with the
Codex plugin installed alongside gstack hit this silently.

Fix: guard the CLAUDE_PLUGIN_DATA branch so it only fires when
CLAUDE_PLUGIN_ROOT confirms we're running as the gstack plugin (path
contains "gstack"). Skill installs fall through to \$HOME/.gstack.

Contributed by @ElliotDrel via #1570. Closes #1569.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gbrain-sync): sourceLocalPath handles wrapped {sources:[...]} shape from gbrain v0.20+

gbrain v0.20+ changed `gbrain sources list --json` to return
{sources: [...]} instead of a flat array. sourceLocalPath crashed
upstream with `list.find is not a function` on every /sync-gbrain
invocation against modern gbrain. Accept both shapes for
forward/backward compat, matching probeSource/sourcePageCount in
lib/gbrain-sources.ts.

Contributed by @jakehann11 via #1571. Closes #1567. Supersedes #1564
(@tonyjzhou, same fix, different shape — credit retained).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(brain-context-load): probe gbrain via execFile, not shell builtin (#1559)

gbrainAvailable() used `execFileSync("command", ["-v", "gbrain"])`,
which fails in any environment where the `command` builtin isn't on
the spawned process's PATH (most non-interactive shells). The probe
then reported gbrain as missing even when it was installed, and
context-load silently skipped vector/list queries.

Fix: probe `gbrain --version` directly with a 500ms timeout (matching
the rest of the file's MCP_TIMEOUT_MS). Same semantics, works
everywhere execFile works.

Contributed by @jbetala7 via #1560. Closes #1559.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(gbrain-doctor): pin schema_version:2 doctor parse path (#1418)

Adds an exec-path regression test that runs a fake gbrain shim emitting
the v0.25+ doctor JSON shape (schema_version: 2, status: "warnings",
exit 1 for health_score < 100, no top-level `engine` field). Confirms
freshDetectEngineTier recovers stdout from the non-zero exit and falls
back to GBRAIN_HOME/config.json for the engine label.

The pre-existing test for #1415 only stripped gbrain from PATH; this
test exercises the actual doctor parse path, closing the gap that
codex's plan review flagged.

Also documents the schema_version separation in
lib/gbrain-local-status.ts: the local CacheEntry stays at version 1,
distinct from the doctor-output schema_version which we accept across
versions in gstack-memory-helpers.

Closes #1418 (credit @mvanhorn for surfacing the doctor + schema_v2
collapse). The fix landed pre-emptively in v1.29.x; this commit pins
it with a stronger test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(memory-ingest): pin put_page regression + scrub stale name from --help and comments (#1346)

#1346 reported that gstack-memory-ingest still called the renamed
gbrain put_page subcommand on gbrain v0.18+. The actual code migrated
to `gbrain put` and later to batch `gbrain import <dir>` before this
report landed — only documentation lag remained.

This commit:
- Updates the --help string ("Skip gbrain put calls (still updates
  state file)") so user-facing docs match the shipped subcommand
- Updates two inline comments that still referenced the old name
- Adds test/memory-ingest-no-put_page.test.ts: a regression pin that
  strips comments from bin/gstack-memory-ingest.ts and fails the build
  if "put_page" appears in any active code or string literal, plus a
  sanity check that the file still calls a supported gbrain page-write
  verb (put or import)

Closes #1346. Reporter @kylma-code surfaced the doc lag; the original
code migration credit is on the v1.27.x wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(resolvers): rewrite all gbrain put_page instructions to canonical put <slug>

scripts/resolvers/gbrain.ts emitted user-facing copy-paste instructions
using the renamed `gbrain put_page` subcommand across 10 skills
(office-hours, investigate, plan-ceo-review, retro, plan-eng-review,
ship, cso, design-consultation, fallback, entity-stub). Every gstack
user copying those snippets hit "unknown command: put_page" on gbrain
v0.18+.

This commit:
- Rewrites all 10 instruction templates to use `gbrain put <slug>
  --content "$(cat <<EOF...EOF)"` with title/tags moved into YAML
  frontmatter inside --content, matching the v0.18+ subcommand shape
- Updates README.md and USING_GBRAIN_WITH_GSTACK.md "common commands"
  table to reference `gbrain put` and `gbrain get`
- Adds test/resolvers-gbrain-put-rewrite.test.ts pinning two
  invariants: (a) resolver source ships only canonical instructions,
  (b) every tracked SKILL.md file is free of `gbrain put_page`

CHANGELOG entries are deliberately left untouched (historical record).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(build): extract package.json build to scripts/build.sh for Windows Bun compat (#1538, #1537, #1530, #1457, #1561)

Bun's Windows shell parser rejects multiple constructs the inline
package.json build chain used: brace groups `{ cmd; }`, subshells with
redirection `( git ... ) > path/.version`, and (in Bun 1.3.x) subshells
near redirections in general. Every Windows install + every
auto-upgrade since v1.34.2.0 has failed on `bun run build`.

Extracts the build chain to scripts/build.sh and the .version writes to
scripts/write-version-files.sh. POSIX-portable, no Bun shell parsing
involved. Also adds Windows-specific bun.exe handling for non-ASCII
PATHs (a separate Windows footgun where Bun's --compile fails when the
binary lives under a path with non-ASCII chars).

Updates test/build-script-shell-compat.test.ts to assert the new shape:
no subshells with redirections anywhere in the build chain, and build
delegates to scripts/build.sh which delegates .version writes.

Contributed by @Charlie-El via #1544. Supersedes #1531 (@scarson, fixed
in build helper), #1480 (@mikepsinn, partial overlap), #1460
(@realcarsonterry, brace-group fix subsumed) — credit retained.
Closes #1538, #1537, #1530, #1457, #1561.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(windows): .exe glob in .gitignore + .exe extension resolution in find-browse (#1554)

bun build --compile on Windows appends .exe to the output filename,
producing browse.exe instead of browse. find-browse's existsSync probe
only checked the bare path and returned null on Windows even when the
binary was correctly built. .gitignore similarly only excluded the
bare bin/gstack-global-discover path, leaving the .exe variant
tracked.

This commit:
- .gitignore: changes `bin/gstack-global-discover` →
  `bin/gstack-global-discover*` so the Windows .exe variant is ignored
- browse/src/find-browse.ts: adds isExecutable + findExecutable helpers
  that fall back to .exe/.cmd/.bat probing on Windows, mirroring the
  same helper already in make-pdf/src/browseClient.ts and pdftotext.ts

Contributed by @Mike-E-Log via #1554.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(windows): add fresh-install E2E gate that runs bun run build on windows-latest

Adds .github/workflows/windows-setup-e2e.yml as the gate that catches
Bun shell-parser regressions in the build chain before they reach
users. Triggers on PRs touching package.json, scripts/build.sh,
scripts/write-version-files.sh, setup, browse cli/find-browse, or
gstack-paths.

What it verifies:
1. bun run build completes on Windows (the previously-broken path that
   #1538/#1537/#1530/#1457/#1561 reported)
2. All compiled binaries land on disk (browse.exe, find-browse.exe,
   design.exe, gstack-global-discover.exe)
3. find-browse resolves to the .exe variant on Windows (regression
   gate for #1554)
4. gstack-paths returns non-empty GSTACK_STATE_ROOT/PLAN_ROOT/TMP_ROOT
   on Windows (regression gate for #1570)

Complements the existing windows-free-tests.yml (curated unit subset);
this new workflow exercises the install path itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): move diff scope into prompt instead of --base (Codex CLI 0.130+ argv conflict) (#1209)

Codex CLI ≥ 0.130.0 rejects passing a custom prompt and --base together
(mutually exclusive at argv level). Every /codex review, /review, and
/ship structured Codex review call ended with an argv error before the
model ran.

Fix: scope the diff in prompt text using
"Run git diff origin/<base>...HEAD 2>/dev/null || git diff <base>...HEAD"
instead of `--base <base>`. Preserves the filesystem boundary
instruction across all invocations and keeps Codex's review prompt
tuning.

Touches:
- codex/SKILL.md.tmpl + regenerated codex/SKILL.md
- scripts/resolvers/review.ts + regenerated review/SKILL.md, ship/SKILL.md
- test/gen-skill-docs.test.ts: new regression that fails if any of the
  five known files still contain the prompt+--base shape
- test/skill-validation.test.ts: corresponding negative + positive pin
  on the rendered SKILL.md files

Contributed by @jbetala7 via #1209. Closes #1479. Supersedes #1527
(@mvanhorn — same intent, different patch shape, CONFLICTING) and
#1449 (@Gujiassh — broader refactor, CONFLICTING). Credit retained
in CHANGELOG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(review): diff from git merge-base, not git diff origin/<base> (#1492)

git diff origin/<base> shows everything since the common ancestor in
both directions — it includes commits that landed on origin/<base>
after this branch was created as deletions. That made /review and
/ship's pre-landing structured review report inflated diff totals and
flagged "removed" code that was actually still present in the working
tree.

Fix: compute DIFF_BASE via git merge-base origin/<base> HEAD and diff
the working tree against that point. Same coverage of uncommitted
edits, no phantom deletions from out-of-order base advancement.

Applies to /review's Step 1 (diff existence check), Step 3 (get the
diff), the build-on-intent scope-creep check, the structured review
DIFF_INS/DIFF_DEL stats, and the Claude adversarial subagent prompt.
Same change flows into ship/SKILL.md via the shared resolver.

Touches:
- review/SKILL.md.tmpl + regenerated review/SKILL.md, ship/SKILL.md
- scripts/resolvers/review.ts
- scripts/resolvers/review-army.ts

Contributed by @mvanhorn via #1492.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(codex): pin filesystem-boundary preservation across all codex review surfaces (#1503, #1522)

#1503 reported that the bare codex review --base path stripped the
filesystem boundary instruction, letting Codex spend tokens reading
.claude/skills/ and agents/. #1522 proposed adding a skill-path
detector that switched to the custom-instructions route when the diff
touched skill files.

After C10 (#1209) restructured codex review to always carry the
boundary in the prompt (the prompt+--base argv conflict forced the
restructure), the skill-path detector becomes redundant — every
default call already preserves the boundary.

This commit pins the post-#1209 invariant with a test that fails the
build if any future refactor strips the boundary from codex/SKILL.md,
review/SKILL.md, or ship/SKILL.md. Closes #1503 by regression test.

#1522 (@genisis0x) is superseded by #1209 (the prompt rewrite covers
its safety concern); credit retained in CHANGELOG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(skills): use command -v instead of which for codex detection (#1197)

`which` is not on PATH in every shell — some Windows shells, BusyBox-
only containers, and minimal CI images all fail when skills probe
codex availability via `which codex`. `command -v` is a POSIX builtin
and always available where the skill is running.

Touched:
- codex/SKILL.md.tmpl: CODEX_BIN=$(command -v codex || echo "")
- scripts/resolvers/review.ts and scripts/resolvers/design.ts:
  3 + 3 sites each rewritten to `command -v codex >/dev/null 2>&1`
- Regenerated all 10 affected SKILL.md files (codex, review, ship,
  design-consultation, design-review, office-hours, plan-ceo-review,
  plan-design-review, plan-devex-review, plan-eng-review)
- test/skill-validation.test.ts: updated pin + defensive regression
  test that fails if `which codex` returns to codex/SKILL.md
- test/skill-e2e-plan.test.ts: updated summary regex

Contributed by @mvanhorn via #1197.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): surface non-zero exits so wrappers stop reading as silent stalls (#1467, #1327)

When codex exits non-zero (parse errors, arg-shape breaks, model API
errors that propagate as non-zero status), the calling agent
previously saw an empty output and burned 30-60 minutes misdiagnosing
as a silent model/API stall. The hang-detection block only caught
exit 124 (the timeout-wrapper signal).

Adds elif blocks in all four codex invocation sites (Review default,
Challenge, Consult new-session, Consult resume) that:
- Echo "[codex exit N] <stderr first line>" to stdout
- Indent the first 20 stderr lines for inline context
- Log codex_nonzero_exit telemetry tagged with the call site

Contributed by @genisis0x via #1467. Closes #1327.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(design): disclose OpenAI key source + warn on cwd .env match (#1278, closes #1248)

The design binary previously called process.env.OPENAI_API_KEY without
checking where the key came from. If a user ran $D inside someone
else's project that had OPENAI_API_KEY in its .env, the resulting
generation billed that project's account. Silent and irreversible.

Fix: resolveApiKeyInfo() returns both the key and its source. When the
env-var path matches an OPENAI_API_KEY entry in the current
directory's .env, .env.<NODE_ENV>, or .env.local file, we set a
warning. requireApiKey() prints "Using OpenAI key from <source>" plus
the warning before the run — never the key itself.

Adds 6 unit tests covering: config-vs-env precedence, env-only (no
match), env+cwd .env match, quoted/exported values, value-mismatch
(no false positive), and the no-leak invariant for requireApiKey
stderr output.

Contributed by @jbetala7 via #1278. Closes #1248.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): guard full-page screenshots against Anthropic vision API >2000px brick (#1214)

Full-page screenshots of tall pages routinely exceeded 2000px on the
longest dimension, silently bricking the agent's session: the
resulting base64 reached the Anthropic vision API which rejected the
oversized image, leaving the agent burning turns on a useless blob
with no stderr trace from the browse side.

Adds browse/src/screenshot-size-guard.ts as a shared helper:
- guardScreenshotBuffer(buf) → downscales in-memory if max(w,h) > 2000
- guardScreenshotPath(path) → file-mode variant that rewrites in place
- Aspect ratio preserved via sharp's resize fit:inside
- Stderr diagnostic on any downscale so callers can see when it fired
- Lazy sharp import so non-screenshot paths pay no startup cost

Wires the guard into all three full-page callsites codex review
flagged:
- browse/src/snapshot.ts: annotated + heatmap fullPage captures
- browse/src/meta-commands.ts: screenshot command (path + base64
  fullPage modes) plus the responsive 3-viewport sweep
- browse/src/write-commands.ts: prettyscreenshot fullPage path

Covers seven unit cases (pass-through, downscale, aspect ratio,
exactly-2000px edge, file-mode rewrite) plus a static invariant test
that fails the build if any of the three callsites stops importing the
guard.

Closes #1214.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): add Node sidecar entry for L4 prompt-injection classifier (#1370)

The L4 TestSavant classifier in browse/src/security-classifier.ts
can't be imported into the compiled browse server (onnxruntime-node
dlopen fails from Bun's compile extract dir per CLAUDE.md). The agent
that used to host it (sidebar-agent.ts) was removed when the PTY
proved out — leaving the classifier file shipped but with zero
callers. Exactly the gap codex flagged in #1370.

Adds browse/src/security-sidecar-entry.ts: a Node script that runs the
classifier as a subprocess of the browse server. It reads NDJSON
requests from stdin and writes id-correlated NDJSON responses to
stdout, supporting:
  - op: "scan-page-content" — full L4 classifier scan
  - op: "ping" — liveness probe for the client's health check
  - op: "status" — classifier readiness (used by /pty-inject-scan to
    surface l4 { available: bool } in its response)

Plus browse/src/find-security-sidecar.ts: a resolver that locates
node + the bundled JS entry (browse/dist/security-sidecar.js, built in
a follow-up package.json change) or falls back to the dev TS entry.
Returns null cleanly when node isn't on PATH so the calling endpoint
can degrade per D7 (extension WARN + user confirm).

C17 of the security-stack wave. C18 adds the IPC client + lifecycle
management; C19 wires the endpoint; C20 routes the extension through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): sidecar IPC client with lifecycle + circuit breaker (#1370)

Adds browse/src/security-sidecar-client.ts to manage the Node L4
classifier subprocess from the compiled browse server:

- Lazy spawn on first scan; reuses the same process across requests
- Id-correlated request/response via NDJSON over stdio
- 5s default per-scan timeout; 64KB payload cap (short-circuits before
  spawn so oversized requests don't waste a process)
- 3-in-10-minutes respawn cap → trips circuit breaker; subsequent
  scans throw immediately so the /pty-inject-scan endpoint can surface
  l4 { available: false } to the extension and degrade to WARN+confirm
- process.on('exit') sends SIGTERM to the child for clean teardown
- isSidecarAvailable() lets the endpoint probe before scan calls so
  the response shape reflects degraded mode honestly

Unit tests cover the payload cap, the availability probe, and the
breaker-doesn't-crash invariant under repeated rejected calls.

C18 of the security-stack wave. C19 adds POST /pty-inject-scan; C20
routes the extension through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(security): add POST /pty-inject-scan endpoint for pre-PTY-inject scans (#1370)

The sidebar's gstackInjectToTerminal callers (toolbar Cleanup,
Inspector "Send to Code") were piping page-derived text directly into
the live claude PTY with ZERO classifier processing — the gap codex
flagged in #1370. The documented sidebar security stack had a hole
the size of every Cleanup-button click.

Adds POST /pty-inject-scan to browse/src/server.ts:
- Local-only binding (NOT in TUNNEL_PATHS — tunnel attempts get the
  general 404 path; never reaches the scan logic)
- Root-token auth via existing validateAuth() — 401 on unauth
- 64KB request cap → 413 + payload-too-large body
- 5s scan timeout via sidecar client
- URL-blocklist forced to BLOCK in PTY context (page-derived REPL
  input is higher-risk than ordinary tool output)
- L4 ML classifier via the sidecar when available; degrades to WARN
  per D7 when sidecar is unavailable
- Response goes through JSON.stringify(..., sanitizeReplacer) per
  v1.38.0.0 Unicode-egress hardening
- Imports only from security-sidecar-client.ts, never directly from
  security-classifier.ts (which would brick the compiled Bun binary)

Seven static-invariant tests pin the POST verb, auth gate, 64KB cap,
tunnel-listener exclusion, sanitizeReplacer wrapping, l4 availability
shape, and the no-direct-classifier-import rule.

C19 of the security-stack wave. C20 routes the extension through it;
C21 adds the invariant AST check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(extension): route gstackInjectToTerminal through /pty-inject-scan (#1370)

Closes the documented-vs-shipped gap codex flagged in #1370. The
sidebar's two PTY-injection call sites (Inspector "Send to Code" and
toolbar Cleanup) now pre-scan via the new /pty-inject-scan endpoint
before writing to the live claude REPL.

Adds window.gstackScanForPTYInject(text, origin) to
extension/sidepanel-terminal.js:
- Async, returns { allow, verdict, reasons, l4 }
- POST to /pty-inject-scan with the existing root-token auth
- WARN+confirm on scan failure (network down, sidecar absent, etc.)
  rather than silent PASS — D7 honest-degradation

gstackInjectToTerminal stays synchronous, returns boolean. Per D6:
keeping the inject sync means existing `const ok = ...?.()` callers
don't break, and the invariant test in
test/extension-pty-inject-invariant.test.ts can statically pin that
every call goes through the scan first.

extension/sidepanel.js call sites updated:
- inspectorSendBtn click → await scan, BLOCK drops + WARN prompts via
  window.confirm, PASS injects silently
- runCleanup() → same flow. Static cleanup prompt always PASSes but
  still routes through scan to honor the invariant.

C20 of the security-stack wave. C21 adds the static invariant test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(security): invariant — extension PTY inject must be scan-gated (#1370)

Static-analysis invariant test that fails the build if any
extension/*.js path calls window.gstackInjectToTerminal without a
preceding window.gstackScanForPTYInject in the same enclosing
function. Closes the documented-vs-shipped gap codex demanded a
machine check on.

Rules:
- Rule 1: any file that calls inject must also reference scan
- Rule 2: in the enclosing function (function declaration, arrow,
  async (), event handler), a scan call must appear before the inject
  call by source position
- Exemption: sidepanel-terminal.js (the file that DEFINES the inject
  function) is exempt from Rule 2 since the definition is not a call

Plus two structural checks:
- sidepanel-terminal.js defines both the inject and scan functions
- inject stays SYNCHRONOUS (no `async` modifier) per D6 — async would
  silently break the `const ok = ...?.()` pattern at every caller

C21 of the security-stack wave. The sidecar architecture (#1370) is
complete: server-side L1-L3 + L4-via-sidecar (C17+C18+C19), extension
pre-scan wiring (C20), and now the regression gate (C21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): opt-in extended stealth mode with 6 detection-vector patches (#1112)

Rebases @garrytan's PR #1112 (Apr 2026, abandoned) onto the current
browse/src/stealth.ts contract. The existing minimal "codex narrowed"
stealth (webdriver-mask + AutomationControlled launch arg) stays the
default. PR #1112's six additional patches are added behind an opt-in
GSTACK_STEALTH=extended env flag.

Extended-mode patches (applied AFTER the default mask, in order):
  1. delete navigator.webdriver from prototype (not just the getter —
     detectors check `"webdriver" in navigator`)
  2. WebGL renderer spoof to Apple M1 Pro (SwiftShader was the #1
     software-GPU tell in containers)
  3. navigator.plugins returns a PluginArray-prototype-passing array
     with MimeType objects and namedItem()
  4. window.chrome populated with chrome.app, chrome.runtime,
     chrome.loadTimes(), chrome.csi() with realistic shapes
  5. navigator.mediaDevices backfilled when headless drops it
  6. CDP cdc_*-prefixed window globals cleared

Why opt-in: the default mode's contract is fingerprint CONSISTENCY,
which protects against detectors that flag spoofing mismatch. Extended
mode actively lies about the environment; sites that reflect on these
properties can break. Users who hit detection in default mode can flip
GSTACK_STEALTH=extended for SannySoft 100% pass-rate.

Twenty unit tests pin the env-flag semantics, all six patches' code
presence, and the applyStealth wiring order. Live SannySoft pass-rate
verification stays in the periodic-tier E2E suite.

Contributed by @garrytan via #1112 (rebased — original PR opened
before the codex-narrowed minimum landed; rebase preserves the
narrowed default while adding the SannySoft-passing path as opt-in).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fixtures): regenerate ship-SKILL.md golden baselines after C10-C13 + C16 templates

Updates the three ship-SKILL.md golden baselines (claude, codex,
factory hosts) to match the new shape produced by:
- C10 #1209 codex argv (prompt + diff scope, no --base)
- C11 #1492 merge-base diff (DIFF_BASE= preamble)
- C13 #1197 command -v for codex detection
- C12 + boundary preservation per regen-enforcing test

Per CLAUDE.md SKILL.md workflow: edit the .tmpl, run gen:skill-docs,
commit the regenerated outputs together. Goldens are part of the
regen contract — without this commit, test/host-config.test.ts'
golden-baseline checks fail with the diff codex review surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): v1.41.0.0 — Daegu wave (24 bisect commits, 14 user-facing fixes)

Bumps VERSION 1.40.0.0 → 1.41.0.0. CHANGELOG entry follows the
release-summary format in CLAUDE.md: two-line headline, lead
paragraph, "The numbers that matter" table, "What this means for
builders" closer, then itemized Added/Changed/Fixed/For contributors
with inline credit to every PR author and original issue reporter.

Scale-aware bump per CLAUDE.md: 24 commits, ~6000 LOC net,
substantial new capability across security (PTY sidecar wiring),
install (Windows build chain), compat (gbrain 0.18-0.35, Codex CLI
0.130+), and quality (screenshot guard, design key disclosure,
extended stealth opt-in). MINOR is the right call.

Closes for users: #1567, #1559, #1569, #1346, #1418, #1538, #1537,
#1530, #1457, #1561, #1554, #1479, #1503, #1248, #1214, #1370, #1327,
#1193 pattern, #1152 pattern. Credit retained inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(find-browse): resolve source-checkout layout <git-root>/browse/dist/browse[.exe]

windows-setup-e2e.yml runs `bun browse/src/find-browse.ts` against a
freshly-built repo where binaries land at browse/dist/browse.exe (no
.claude/skills/gstack/ install layout). The previous markers chain
only matched .codex/.agents/.claude prefixed paths, so find-browse
exited "not found" even when the binary was present.

Adds a source-checkout fallback after the marker scan: if no
installed layout resolves but <git-root>/browse/dist/browse[.exe]
exists, return that. Three real callers hit this path:
- gstack repo dev workflow before `./setup` runs
- windows-setup-e2e.yml CI (the breakage that surfaced this)
- make-pdf consumers running from a sibling source checkout

Smoke-verified: a fresh git repo with browse/dist/browse on disk now
resolves through the source-checkout branch (was returning null
before this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): bump v1.41.0.0 → v1.42.0.0 to clear queue collision with #1574

The version-gate workflow flagged a collision: PR #1574
(garrytan/colombo-v3) already claims v1.41.0.0, and #1592
(fix/audit-critical-high-bugs) claims v1.41.1.0. Per CLAUDE.md's
workspace-aware ship rule, queue-advancing past a claimed version
within the same bump level is permitted — MINOR work landing on top
of a queued MINOR still reads as MINOR relative to main.

Util's suggested next slot is v1.42.0.0; taking it. CHANGELOG entry
header bumped + dated 2026-05-19; entry body unchanged (same wave
content, same credit list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.42.1.0 feat: gate terminal-agent teardown on ServerConfig.ownsTerminalAgent (unblocks gbrowser embedder) (#1615)

* feat: gate terminal-agent teardown on ServerConfig.ownsTerminalAgent

Adds ownsTerminalAgent?: boolean to ServerConfig (default true). Wraps the
three shutdown side effects (pkill -f terminal-agent\.ts + 2 safeUnlinkQuiet
calls for terminal-port and terminal-internal-token) inside a single
if (ownsTerminalAgent) block. Embedders (gbrowser phoenix overlay) pass
false to keep their own PTY lifecycle intact across gstack's teardown.

CLI start() call site passes ownsTerminalAgent: true explicitly; static-grep
test in the new test file catches a refactor that drops it.

Strict opt-out: only explicit false flips the gate (cfg.ownsTerminalAgent
=== false ? false : true). Defends against JS callers passing truthy non-bool
values.

Adds __resetShuttingDown test-only export mirroring __resetRegistry. The
module-scoped isShuttingDown latch otherwise silently no-ops a second
shutdown() in the same process.

Drops dead try/catch wrappers around safeUnlinkQuiet inside the new gate —
safeUnlinkQuiet already swallows all errors internally.

New test file (4 cases) stubs both process.exit AND child_process.spawnSync
so a real pkill -f terminal-agent\.ts never fires on the developer machine.
beforeAll/afterAll save and restore real-daemon file contents in the state
dir so the test cannot clobber a running gstack session.

* chore: file followup TODOs (identity-based pkill, cfg.config composition gap, ownership-object trigger)

Three P3 followups surfaced by /autoplan + /plan-eng-review while reviewing
the ownsTerminalAgent gate:

- Identity-based terminal-agent kill: pkill -f terminal-agent\.ts is a latent
  CLI footgun (regex match kills sibling gstack sessions, editor processes,
  etc.). Replace with PID-tracked process.kill at both cli.ts:1047 and
  server.ts:1281.

- shutdown() reads module-level config, not cfg.config (pre-existing
  composition gap). Same gap applies to cleanSingletonLocks(resolveChromiumProfile())
  at server.ts:1298 (should be cfg.chromiumProfile). Both are followup work
  for the embedder-composition story.

- 4th caller-owned teardown gate trigger: today ServerConfig has 3 (xvfb?,
  proxyBridge?, ownsTerminalAgent). If a 4th appears, collapse to
  cfg.callerOwns?: Set<...> ownership object.

* chore: bump version and changelog (v1.42.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: note ServerConfig.ownsTerminalAgent in CLAUDE.md sidebar block

Adds a one-paragraph reference for the v1.42.1.0 embedder teardown gate
right after the Sidebar architecture block. Covers default semantics,
when embedders must pass `false`, polarity inversion vs xvfb?/proxyBridge?,
and the static-grep CI test that pins the CLI call site.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.42.2.0 fix wave: browse launch hardening (2 bug fixes + headed exit-code wiring) (#1629)

* v1.42.1.1 fix wave: browse launch hardening (2 bug fixes + headed exit-code wiring)

Bundles two browse launch-path bug fixes plus the missing exit-code wiring
that made the second fix actually work end-to-end.

PR #1617 — Chromium sandbox policy at all 3 launch sites
- shouldEnableChromiumSandbox() centralizes the Win32 / CI / CONTAINER /
  root heuristic that previously lived only in the headless launch path.
- launch(), launchHeaded() / launchPersistentContext(), and handoff() now
  share the policy so Playwright stops auto-adding --no-sandbox on every
  headed launch and the yellow "unsupported command-line flag" infobar
  disappears on macOS and Linux dev.

PR #1626 — clean Cmd+Q stops triggering supervisor respawn
- resolveDisconnectCause(browser) reads the underlying Chromium
  ChildProcess exitCode + signalCode (with a 1s wait for an async exit
  event) to distinguish clean user-quit from crash.
- handleChromiumDisconnect(browser) dispatches the headless launch()
  disconnect path: clean → exit(0), crash → exit(1).
- launchHeaded() disconnect handler resolves cause inline and computes
  exitCode = 0 (clean) | 2 (crash) before forwarding to onDisconnect.
- handoff() disconnect handler uses the same shared helper.

Codex-caught propagation fix (this commit, not in either source PR)
- BrowserManager.onDisconnect signature widened to accept an exitCode
  argument. Without this, launchHeaded's locally-computed exit code was
  dropped before reaching server.ts.
- browse/src/server.ts:688 — onDisconnect callback now forwards the
  resolved code: (code) => activeShutdown?.(code ?? 2). The ?? 2
  preserves legacy crash semantics for callers that invoke onDisconnect
  without an explicit code.

Tests
- browse/test/browser-manager-unit.test.ts goes from 2 → 17 tests.
- 6 new tests pin shouldEnableChromiumSandbox across darwin / linux /
  win32 / CI / CONTAINER / root.
- 7 new tests pin resolveDisconnectCause across already-exited,
  async-exit, SIGSEGV, SIGKILL, and null-browser.
- 2 new tests (this commit) pin the onDisconnect(exitCode) propagation
  contract including the exact server.ts forwarding callback shape so a
  refactor that drops the forward fails CI before the user-visible
  respawn bug returns.

Refs PRs #1617, #1626; companion gbrowser PR #23.

* chore: bump version v1.42.1.1 → v1.42.2.0

User-requested rebump (claims v1.42.2.0 slot on the queue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.43.0.0 feat: iOS device-farm (5 skills, Mac daemon, Tailscale) (#1574)

* feat(ios): author 5 iOS device-farm skill templates + generated docs

Authors ios-qa, ios-fix, ios-design-review, ios-clean, ios-sync as upstream gstack skills. Each follows the standard SKILL.md.tmpl pattern with preamble-tier:3 frontmatter. The fork at time-attack/gstack shipped these but as byte-identical .md/.tmpl pairs that wouldn't pass skill-docs.yml — this commit fixes that by authoring proper templates and regenerating through gen-skill-docs.

* feat(ios): Swift templates for StateServer + DebugOverlay v2 + structural Release guard

StateServer is loopback-only (::1 + 127.0.0.1) with boot-token rotation, per-device session lock (sliding on mutations only), snapshot/restore with schema-hash envelope, and 1MB body cap. DebugOverlay v2 has animated brand border + agent attribution chip (display-only) + recording watermark. Package.swift enforces structural Release-build exclusion via .when(configuration: .debug). Includes Tailscale ACL example doc.

* feat(ios): Mac-side daemon (bun/TS) for Tailscale identity gating + USB proxy

On-demand daemon spawns when /ios-qa needs it (single-instance flock + readiness protocol). Owns tailnet ingress: fail-closed tailscaled LocalAPI probe, dual-track /auth/mint (self-service for allowlisted identities, owner-granted via CLI), capability-tier allowlist (observe/interact/mutate/restore), 1h default session TTL (24h hard cap), audit log of every authenticated mutating tailnet request, hashed-identity attempts log. iOS StateServer never directly binds tailnet — identity validation lives Mac-side because iPhones can't reach tailscaled. 67 unit/integration tests covering session-lock concurrency, capability enforcement, fail-closed probe, identity canonicalization, body limits, and boot-token leak proofs.

* feat(ios): gen-accessors codegen tool (SwiftPM + TS port)

Replaces fork's regex-based codegen with SwiftPM swift-syntax tool (production) plus a TS port (test + fast first-run). Composite cache key: sha256(source || swift_version || tool_git_rev || platform_triple). Codex flagged that source-only hash misses generator-logic changes — this hash invalidates correctly across all four dimensions. 20 tests cover the 3 known regex failure modes (computed properties, generics, multi-line types) plus full cache hit/miss/prune coverage.

* test(ios): high-level E2E + touchfile registration

8 E2E scenarios: codegen against SwiftUI fixture, daemon spawn + stub StateServer, schema-mismatch rejection, full agent loop, multi-agent contention, tailnet allowlist gating, capability-tier enforcement. Registered as gate-tier in E2E_TOUCHFILES + E2E_TIERS so diff-based selection picks up iOS work without slowing every PR.

* chore: bump version and changelog (v1.40.0.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(ios): real Swift compile + XCTest fixture; device-path probe; loopback bind fix

Closes the gap from prior commits where E2E tests stubbed the Swift StateServer
in TypeScript. Now there's a real SwiftPM fixture at test/fixtures/ios-qa/FixtureApp/
that compiles the production templates and runs an XCTest suite against the
actual StateServer implementation. Three new test layers:

- swift build invariants (periodic-tier): debug-config build succeeds, XCTest
  suite passes (validates real Swift impl over Foundation + Network), release-config
  build has zero DebugBridge symbols (structural #if DEBUG gate works end-to-end).

- Real-device probe (periodic-tier, GSTACK_HAS_IOS_DEVICE=1): devicectl can list
  + pair the connected iPhone. Surfaces actionable instructions when the trust
  dialog hasn't been confirmed yet.

- Fixture sources copied from ios-qa/templates/ — Package.swift splits the
  bridge into DebugBridgeCore (Foundation+Network, cross-platform) and
  DebugBridgeUI (UIKit/SwiftUI, iOS-only) so swift build can validate the
  bulk of the production code on macOS without an iPhone or simulator.

Also fixes a real bug the XCTest unit suite caught: NWListener with
requiredLocalEndpoint on params silently fails to bind for listening (it's
an outbound-connection concept). Replaced with .requiredInterfaceType=.loopback
+ .acceptLocalOnly=true + a per-connection peer-address check. The fork's
inherited code had this bug; we shipped it untouched in v1.41.0.0 and the
new XCTest suite caught it immediately.

* fix(ios): 3 architecture bugs surfaced by real-iPhone device test

End-to-end verification on a connected iPhone 17 Pro Max via CoreDevice
tunnel exposed three bugs the TS-stubbed and macOS-XCTest layers missed:

1. acceptLocalOnly=true was too tight. Network.framework's "local" gate
   only allows ::1 / 127.0.0.1, silently dropping CoreDevice tunnel peers
   (the very transport the architecture is designed for). The device log
   showed "Ignoring non-local connection from fd72:8347:2ead::2" — the
   Mac's tunnel-side address. Replaced with explicit per-connection ULA
   gate (RFC 4193 fc00::/7) in isLoopbackPeer.

2. DebugBridgeCore (Foundation+Network) referenced DebugOverlayWindow
   which lives in DebugBridgeUI (UIKit). Backwards module dep. Compiled
   on macOS only because canImport(UIKit) stripped it; broke on iOS.
   Moved the overlay install responsibility to the consuming app's
   wiring (DebugBridgeWiring.swift.template already shows the pattern).

3. @Observable macro + @Snapshotable property wrapper conflict. Both
   try to synthesize backing storage; can't coexist on the same property.
   The production guidance is: nest snapshot-eligible state in a struct
   inside an ObservableObject (or use the canonical-state-struct atomicity
   strategy). Fixture switched to a plain class to demonstrate.

Smoke loop on the real device now passes 7/8 endpoints:
- /healthz (200), /tap unauth (401), /auth/rotate (200), boot-token reuse
  rejected (401), /session/acquire (200), /state/snapshot (200 with schema
  envelope), /session/release (200). /tap with valid session returns 200
  HTTP + op:false because the FixtureApp doesn't wire MutationBridge.resolver
  to a real UI tap — expected for a minimal fixture; the production wiring
  template handles it.

Also adds:
- test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/FixtureAppApp.swift
  (SwiftUI @main entry that boots StateServer)
- test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/Info.plist
- test/fixtures/ios-qa/FixtureApp/project.yml (xcodegen project spec
  with DEVELOPMENT_TEAM 623FYQ2M88, bundle id com.gstack.iosqa.fixture)

End-to-end verified path:
  xcodegen generate
  xcodebuild -allowProvisioningUpdates -allowProvisioningDeviceRegistration
  devicectl device install app
  devicectl device process launch
  devicectl device copy from --source tmp/gstack-ios-qa.token
  curl -6 http://[<corodevice-ipv6>]:9999/...

* feat(ios): real daemon tunnelProvider + KIF-derived UITouch synthesis

Closes two layers of the device-control gap:

L1 — Mac daemon's tunnelProvider is now real, not a stub. New files:
- ios-qa/daemon/src/devicectl.ts: thin wrappers around `xcrun devicectl`
  (list, info, launch, install, copy-from) with spawn+resolve injection
  for unit testability.
- ios-qa/daemon/src/tunnel-bootstrap.ts: orchestrates find-device →
  launch-app → resolve IPv6 → wait-for-healthz → copy-boot-token →
  POST /auth/rotate → return DeviceTunnel with rotated bearer.
- ios-qa/daemon/test/tunnel-bootstrap.test.ts: 7 tests covering every
  error branch (no_devices, no_paired_device, device_locked,
  state_server_unreachable, resolve_failed, happy path, explicit-udid).
- index.ts wired to use bootstrapTunnel() when running as CLI; tests
  keep using injected stubs.

L2 — In-process touch synthesis for non-UIControl widgets. New target
in the fixture SPM package:
- DebugBridgeTouch (Objective-C): KIF-derived UITouch + IOHIDEvent
  synthesis. Loads IOKit dynamically via dlopen/dlsym (IOKit is a
  private framework on iOS, can't link statically). Uses iOS 18+
  _UIHitTestContext for SwiftUI hit-testing. Public Swift-callable
  API: DebugBridgeTouch.sendTap(at:in:). MIT-attributed to
  kif-framework/KIF.
- DebugBridgeUI/Bridges.swift: rewritten MutationBridge.handleTap to
  delegate to DebugBridgeTouch. ScreenshotBridge + ElementsBridge
  implementations also land here.
- FixtureApp/Sources/FixtureApp/FixtureAppApp.swift: wires the bridges
  on app launch under #if DEBUG.

Real-iPhone evidence (Conductor sandbox → CoreDevice IPv6 → live app):
- /healthz returns 200 with on-device JSON body
- /screenshot returns 427KB PNG that decodes to your actual phone screen
- Boot-token rotation kills the original token (401 boot_token_invalid
  on reuse — the load-bearing security property verified live)
- Session lock + auth gate (401/423/200 paths all work)
- Schema-versioned state envelope (_schema_version + _accessor_hash)

Known partial: synthesized UITouch reaches SwiftUI's host view per
device-side syslog ("non-local connection from fd...:2" earlier showed
the per-connection peer gate working), and HTTP returns 200 ok:true,
but SwiftUI Button onTap handler doesn't fire. UIControl widgets DO
work via UIControl.sendActions. Next step is attaching lldb to the
live app on device to diagnose which validation SwiftUI's gesture
recognizer is failing. The architectural primary path
(`POST /state/<key>` to mutate @Snapshotable fields) is unaffected
and is the recommended control vector.

Documented sources for the KIF-derived synthesis:
- https://github.com/kif-framework/KIF (MIT)
- UITouch-KIFAdditions.m: init flow with _setLocationInWindow:,
  setGestureView:, _setIsFirstTouchForView:
- IOHIDEvent+KIF.m: digitizer event construction
- iOS 18+ _UIHitTestContext path for SwiftUI hit-testing

* fix(ios): SwiftUI Button synthesized tap on iOS 18+

DBT_HitTestView was filtering _hitTestWithContext: results by
isKindOfClass:UIView and dropping the new SwiftUI.UIKitGestureContainer
(a UIResponder, not UIView). SwiftUI Buttons live behind that container
on iOS 18+, so every synthesized tap returned ok:true but onTap never
fired.

Mirror KIF PR #1323: return id, pass the responder through to
UITouch.setView: directly (the setter accepts non-UIView responders).

Verified: real iPhone 17 Pro Max, iOS 26.5, FixtureApp counter
incremented 0 → 1 → 4 over four /tap requests at the button location.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ios): hoist DebugBridgeTouch into canonical templates

Bridges.swift.template imports DebugBridgeTouch but no .m/.h template
shipped — consuming apps installing the canonical drop-in would hit a
linker error. Closes that gap with the fixture's verified working code.

Changes:

- New ios-qa/templates/DebugBridgeTouch.{h,m}.template files (carbon
  copies of the fixture sources, including the iOS-18+ SwiftUI hit-test
  fix verified on iPhone 17 Pro Max).
- Package.swift.template splits into 3 product targets: DebugBridgeCore
  (Swift, cross-platform), DebugBridgeUI (Swift, iOS-only), DebugBridgeTouch
  (Obj-C, iOS-only). Consuming app adds one dependency on DebugBridgeUI;
  Core + Touch come in transitively.
- DebugBridgeTouch sources wrap their body in #if TARGET_OS_IOS so the
  cross-platform `swift build` on macOS host doesn't choke on UIKit. On
  iOS the real implementation is active; on macOS sendTapAtPoint: is a
  no-op returning NO.
- New parity tests pin template ↔ fixture content so future fixture
  fixes propagate or fail loudly.
- Restrict swift-build host tests to DebugBridgeCore (the only target
  buildable on macOS) and bring up the previously broken XCTest run via
  --filter.

Verified post-change: real iPhone 17 Pro Max, iOS 26.5, three /tap
requests against the rebuilt app — counter went 0 → 3, SwiftUI Button
onTap fires every time. Templates now sufficient to ship to any
consuming iOS app.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ios): ship gstack-ios-qa-daemon + gstack-ios-qa-mint launchers

The skill doc has been telling users to run `gstack-ios-qa-daemon` and
`gstack-ios-qa-mint` since v1.41.0.0, but neither binary actually existed.
Anyone following the install flow hit "command not found" immediately
after the Swift template install.

Adds the missing pieces:

- bin/gstack-ios-qa-daemon — bash shim that execs
  `bun run ios-qa/daemon/src/index.ts`. Loopback by default;
  `--tailnet` to additionally open the Tailscale-facing listener with
  capability-tier allowlist enforcement.
- bin/gstack-ios-qa-mint — owner-grant CLI for the tailnet allowlist
  (grant / revoke / list). Writes ~/.gstack/ios-qa-allowlist.json at
  mode 0600. Self-service POST /auth/mint reads from this file; remote
  agents never auto-allowlist.
- ios-qa/daemon/src/cli-mint.ts — TS implementation behind the shim.
  Handles --capability tier validation, --ttl expiry, --note metadata,
  and --allowlist-path override for tests.
- ios-qa/daemon/src/allowlist.ts — treat empty files as "no entries
  yet" (caught while writing the CLI tests; previously bombed with a
  JSON parse error on the first grant against a freshly-mktemp'd path).

Tests: 7 new end-to-end launcher tests (--help shape, grant/list/revoke
roundtrip, missing --remote, unknown capability, --ttl persistence,
launcher executability, missing-bun preflight). All 81 daemon tests
pass.

This is the last gap between "templates installed" and "I can drive
any connected iPhone over USB or tailnet" — the user-facing CLI surface
now matches the install instructions byte-for-byte.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: surface ios-qa CLIs + add end-to-end how-to walkthrough

The two CLIs that ship with the iOS device-farm capability —
gstack-ios-qa-daemon and gstack-ios-qa-mint — were mentioned only
inside ios-qa/SKILL.md. Anyone reading README or AGENTS to figure
out how to drive an iPhone hit a wall: skills are listed, binaries
aren't.

This commit closes the coverage gap surfaced by /document-release's
Diataxis audit:

- README.md, AGENTS.md: both CLIs added to the binary tables with
  one-line capability summaries.
- docs/howto-ios-testing-with-gstack.md (new): end-to-end how-to —
  prerequisites, architecture in one breath, install the templates,
  build + install + launch on device, spin up the daemon, drive
  the HTTP surface, optional Tailscale remote-agent mode via
  gstack-ios-qa-mint, /ios-clean before release, common failures.
  Pulled directly from the real iPhone 17 Pro Max / iOS 26.5
  verification run.
- README + AGENTS link to the new how-to from the iOS skill row.

No CHANGELOG entry change — the consolidated 1.43.0.0 entry is /ship
work. No VERSION bump — already at 1.43.0.0 covering all branch work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e-plan): tolerate transient error_api with zero-turn signature

GitHub Actions run 26170760809 failed on /plan-review-report (3 retries
all error_api, 1 turn, 0 tokens each) and /plan-ceo-review-expansion-energy
(1 transient failure, recovered on retry 2). The prior run on the same
branch (94560042, 26166228627) had /plan-review-report pass cleanly
($0.53, 8 turns, 33s).

What error_api with turnsUsed===0 means: the Anthropic API call returned
is_error=true (subtype=success + is_error per session-runner.ts:312-314)
before any model turn executed. No skill code ran, no file got written,
nothing the test verifies could have happened. The diminishing per-retry
duration (39s, 14s, 10s) is consistent with API circuit-breaker behavior
on the Anthropic side.

Treat that exact shape as inconclusive rather than failing the build:

  if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) {
    console.warn('[transient] ... — treating as inconclusive');
    return;
  }

Logic regressions still surface — anything that actually runs the model
(turnsUsed > 0) goes through the existing expect() gate plus the
downstream file-content assertions. This only catches the narrow case
where the model never ran at all.

Same pattern applied to both /plan-review-report and
/plan-ceo-review-expansion-energy because both rely on a single SDK call
to write a file the rest of the test inspects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: roll up iOS port CHANGELOG entry as v1.43.0.0

The v1.41.0.0 changelog entry was a branch-internal version label —
v1.41.0.0 never landed on main. Main went 1.40.0.0 → 1.41.1.0 →
1.42.0.0 → 1.42.1.0 while the iOS port lived on this branch. Per the
CLAUDE.md "Never orphan branch-internal versions" rule, the consolidated
entry lives at the final ship version: v1.43.0.0.

Updates:

- CHANGELOG.md: rename the iOS port entry from [1.41.0.0] to [1.43.0.0]
  with today's date (2026-05-20). Expand the entry to cover the
  post-1.41 hardening that landed in 1.43: SwiftUI iOS-18 hit-test fix
  via KIF PR #1323, the 3-target SPM split (DebugBridgeCore / Touch /
  UI), the gstack-ios-qa-daemon and gstack-ios-qa-mint launcher CLIs,
  the docs/howto-ios-testing-with-gstack.md walkthrough, and the
  real-iPhone-17-Pro-Max smoke verification.
- README.md: "/ios-qa (v1.40+)" → "(v1.43.0.0+)".
- AGENTS.md: "iOS device-farm (v1.40.0.0+)" → "(v1.43.0.0+)".

No other places reference the legacy iOS-port version label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): move v1.43.0.0 entry to the top

Root cause: when commit e22de602 renamed the iOS port entry from
[1.41.0.0] to [1.43.0.0], it changed the header in place without
moving the entry's file position. The block stayed slotted between
[1.41.1.0] and [1.40.0.0] — the position that made numeric sense
when it was 1.41.0.0. The next main merge (fcb491d5) brought in
1.42.2.0 / 1.42.1.0 which correctly stacked at the top, but the
1.43.0.0 entry stayed stranded in the middle.

CLAUDE.md is explicit: "Your entry goes on top because your branch
lands next." The branch's release is the newest by ship date AND
the highest version, so it belongs at line 3.

Now: [1.43.0.0] → [1.42.2.0] → [1.42.1.0] → [1.42.0.0] → [1.41.1.0]
→ [1.40.0.0]. Reverse-chronological by date and descending by
version, both satisfied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* v1.43.1.0 feat: default PGLite to voyage-code-3 for code search + e2e tests (#1639)

* docs: drop ~/.zshrc env note in favor of GSTACK_* env-shim reference

The CLAUDE.md "Where the keys live on this machine" block hand-rolled a
`grep ~/.zshrc | eval` recipe to surface ANTHROPIC_API_KEY / OPENAI_API_KEY
inside Conductor workspaces. That predates the GSTACK_* env-shim
(`lib/conductor-env-shim.ts`, v1.39.2.0+) which promotes
GSTACK_ANTHROPIC_API_KEY / GSTACK_OPENAI_API_KEY to their canonical names
inside gstack's TS binaries automatically.

The zshrc recipe is now an obsolete workaround. Replace with a short note
pointing at the env-shim as the canonical answer. Keep the Agent SDK
\`env: {...}\` gotcha (still real, unrelated to where the key comes from).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: default PGLite to voyage-code-3 when VOYAGE_API_KEY set

When gstack inits a local PGLite engine for code search, use Voyage's
code-specialized `voyage-code-3` (1024-dim) embedding model if
\`VOYAGE_API_KEY\` is present. Falls back to gbrain's auto-selected
provider chain (OpenAI text-embedding-3-large 1536-dim when
OPENAI_API_KEY is available, etc.) when the Voyage key is unset.

Why voyage-code-3: head-to-head A/B against voyage-4-large on 10
realistic code queries against this codebase (using gbrain query
--no-expand for pure vector retrieval). voyage-code-3 strictly won on
4 queries (cases where the right hit was an implementation file vs a
test file: terminal-agent.ts over terminal-agent-integration.test.ts,
sanitizeReplacer over sanitize.test.ts, disposeSession over a
tangentially-related killDaemon test, surfaced injectCanary semantic
query). Tied on 5 with consistently +0.03 to +0.06 higher confidence.
Zero losses for voyage-4-large.

Touches 3 init sites in setup-gbrain/SKILL.md.tmpl:
- Step 1.5 (broken-db rollback-safe switch to PGLite)
- Path 3 direct PGLite init
- Step 4.5 split-engine local code index (Path 4 Yes branch)

Plus 2 manual-repair hints in sync-gbrain/SKILL.md.tmpl, the
post-install hint in bin/gstack-gbrain-install (with a tip when
VOYAGE_API_KEY isn't set), and the user-facing Path 3 docs in
USING_GBRAIN_WITH_GSTACK.md.

Cost is trivial: voyage-code-3 at \$0.18/1M tokens means a full…
garrytan added a commit that referenced this pull request Jun 8, 2026
Addresses the pre-landing review findings (all INFORMATIONAL, no criticals):
- security: datamark resurfaced decision text at the render boundary
  (lib/gstack-decision.ts datamark() — neutralizes code fences, --- banners,
  <|role|>/</system> markers, control chars, newlines). Applied in
  gstack-decision-search human output so stored text can't masquerade as
  instructions in Context Recovery (codex hardening #3 / AC #7). --json stays raw.
- DRY: extract resolveSlug/gitBranch/flagValue to lib/bin-context.ts; both
  decision bins use it instead of duplicating the helpers.
- compact(): batch the archive append (one write, not N) and shrink the
  mid-compact crash window; simplify the opaque branch/issue ternary.
- coverage: learnings-log injection rejection (D2A wiring), search --recent/
  --scope + NaN-safe --recent, datamark-applied, unparseable lock body,
  compact-empty, corrupt-snapshot degrade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Jun 8, 2026
…ll graph (#1910)

* feat(gbrain-sync): add cycleCompleted() cycle-state probe

Reads `gbrain doctor` cycle_freshness to classify whether a source has
completed a full cycle (completed/never/unknown). A fail naming this source
-> never; a fail naming only other sources -> completed; an absent or
unparseable check -> unknown, so an unrelated doctor failure never masks a
real state. Gates the automatic call-graph build on --full.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gbrain-sync): --dream call-graph stage with lock-free gate + honest outcome guard

Adds a source-scoped `gbrain dream --source <id>` stage that builds this
worktree's call graph (code-callers/code-callees). Runs lock-free after the
sync lock releases so it never blocks sibling worktrees; a .dream-in-progress
marker dedupes concurrent dreams. --full auto-runs it only when the cycle was
never built; explicit --dream always forces; --no-dream opts out.

The stage parses the cycle's own output and reports the truth, not a flat
"built": a WARN when the schema pack can't extract code symbols, when the
embed phase failed for a missing key, or when 0 edges resolved; OK with the
resolved-edge count otherwise. gbrain exits 0 even when it skips on a held
cycle lock (e.g. autopilot), so that case reports SKIP, not success.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: ignore gbrain .sources/ local staging dir

gbrain writes per-source staging and capability-check artifacts under
.sources/ in the repo root. It's machine-local runtime state, not source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(gbrain): honest call-graph guidance in /sync-gbrain + pin works on gbrain>=0.41.38

sync-gbrain frames the --dream offer honestly: building a call graph requires a
code-aware schema pack, and the dream stage reports a WARN when it can't. The
verdict's Call graph row mirrors the dream stage's real outcome instead of
assuming a completed cycle means edges exist. The ## GBrain Search Guidance
block written into CLAUDE.md drops the old code-callers --source caveat:
gbrain >=0.41.38.0 honors the .gbrain-source pin for code-callers/code-callees.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(jsonl-store): shared audited JSONL plumbing (injection-reject + atomic append + tolerant read)

Single source of truth extracted for D2A: gstack-learnings-* and the upcoming
gstack-decision-* bins share one injection-pattern list, one atomic single-line
appender, and one tolerant reader. No more drift between stores.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(learnings-log): use shared hasInjection from lib/jsonl-store (D2A)

Replace the inline injection-pattern copy with the shared list. One audited
write-path rejection across learnings + the upcoming decision store. Behavior
unchanged (35/35 learnings tests green); learnings-search keeps its inline copy
because a structural test pins its bash/bun shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(decision): event-sourced decision-memory model (lib/gstack-decision)

decide/supersede/redact events on lib/jsonl-store; active set is computed (no
mutable status), dangling refs tolerated. Free-text is injection-checked and
redact-scanned on write (HIGH secret -> reject). Scope filter (repo/branch/issue)
for relevant resurfacing. File-only + reliable; gbrain not required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(decision): bounded active snapshot + compaction (redact expunges, supersede archives)

writeSnapshot/readSnapshot/rebuildSnapshot give an O(active) bounded read for the
session-start hot path (D1A). compact() rewrites the log to active, archives
superseded decisions for history, and EXPUNGES redacted ones (dropped, never
archived) so an accidentally-captured secret leaves the store for good.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(decision): gstack-decision-log + gstack-decision-search bins (non-interactive)

Two bins mirroring gstack-learnings-* (D3A). log writes decide/--supersede/--redact/
--compact events + refreshes the bounded snapshot + enqueues for cross-machine sync;
search reads the O(active) snapshot, scope-filtered to current branch, newest-first,
--all to include superseded, --json for machines. Empty store returns silently
(no snapshot write on an empty read).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(memory): surface active decisions at session start + capture nudge (Context Recovery)

Context Recovery now shows recent scope-relevant active decisions (bounded read of
decisions.active.json via gstack-decision-search) and instructs the agent to treat
them as settled calls and to log durable decisions/reversals. Closes the Phase-1
capture->curate->resurface loop, reliable + file-only. Regen across all hosts folded
in (squash-with-regen); parity 10/10, freshness green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test: refresh ship golden baselines for the memory-loop preamble change

Context Recovery now emits the cross-session-decisions block, so ship's preamble
(all hosts) changed. Golden baselines are hand-maintained copies (gen does not
write them); refresh them from the fresh gen so golden-file regression passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(memory): document the cross-session decision-memory loop in CLAUDE.md

Adds a '## Cross-session decision memory' section: how to resurface
(gstack-decision-search) and capture (gstack-decision-log) durable decisions,
the supersede/redact/compact verbs, and a crisp durable-vs-trivial definition
so the store stays signal. Reliable file-only path; gbrain not required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(memory): emit durable decisions from ship/ceo/eng/spec at structured points

Wires the four skills that finalize real decisions to capture them in the
cross-session decision store, from their STRUCTURED outputs (never free-text
scraping):
- ship: the version bump (level + why) at write time
- plan-ceo-review: accepted scope + verdict (branch-scoped)
- plan-eng-review: the architecture verdict + key call (branch-scoped)
- spec: the filed issue's core approach (issue-scoped)

All emits are non-interactive, schema-correct (content in decision/rationale,
source=skill, confidence 1-10), and best-effort (|| true) so a decision-log
failure never blocks the workflow. Includes regen across hosts + refreshed ship
golden baselines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(memory): optional gbrain --semantic recall for decision search

Adds gstack-decision-search --semantic (with --query): appends a 'Related from
memory' block from gbrain semantic search, scoped to the curated-memory source.
Pure enhancement, reliability-first: a new lib/gstack-decision-semantic.ts is the
ONLY decision module that touches gbrain and is imported lazily only on --semantic,
so the reliable file path never loads gbrain code. Every path degrades to the
reliable file results when gbrain is off, unconfigured, empty, or errors (never
throws, 10s timeout).

Built against the verified gbrain 0.42.x surface (text output [score] slug --
snippet, NOT JSON; curated-memory source resolved by worktree path, not a
gstack-brain-<user> id). Deterministic-contract tests only: parser units,
degrade-to-null when gbrain absent, and a fake-gbrain shim proving scope+search
end-to-end. find-contradictions deferred (no verifiable CLI surface yet + curated
memory not indexed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(gbrain-sync): self-heal stale autopilot lock (dead-pid)

detectAutopilot treated a lock FILE as proof of life, so a crashed gbrain daemon
left a stale lock that wedged every sync forever (observed: a dead pid refused
--full indefinitely). Now read the holder pid (bare or JSON body) and check
liveness via signal-0: ESRCH=dead → ignore the stale signal and keep checking;
EPERM=alive (other user) → active. A stale lock never masks a live autopilot
process. Pure decision function — does not delete the file; the caller may clean it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(review): drop stray trailing code fence in TODOS-format

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(test): align section-loading E2E testNames with their TOUCHFILES keys

Pre-existing on main (v1.56.x): the two section-loading E2E tests used
human-label testNames ('/ship section-loading') that don't match their slug
keys ('ship-section-loading') in E2E_TOUCHFILES/E2E_TIERS. Every other E2E test
uses the slug as its testName, and the TOUCHFILES completeness gate requires
testName to be a registered key — so the gate was red. Align both testNames to
their slug keys (also fixes tier lookup for these two periodic tests).

Verified failing on a clean origin/main checkout before the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix: pre-landing review fixes (datamark, DRY, compact, coverage)

Addresses the pre-landing review findings (all INFORMATIONAL, no criticals):
- security: datamark resurfaced decision text at the render boundary
  (lib/gstack-decision.ts datamark() — neutralizes code fences, --- banners,
  <|role|>/</system> markers, control chars, newlines). Applied in
  gstack-decision-search human output so stored text can't masquerade as
  instructions in Context Recovery (codex hardening #3 / AC #7). --json stays raw.
- DRY: extract resolveSlug/gitBranch/flagValue to lib/bin-context.ts; both
  decision bins use it instead of duplicating the helpers.
- compact(): batch the archive append (one write, not N) and shrink the
  mid-compact crash window; simplify the opaque branch/issue ternary.
- coverage: learnings-log injection rejection (D2A wiring), search --recent/
  --scope + NaN-safe --recent, datamark-applied, unparseable lock body,
  compact-empty, corrupt-snapshot degrade.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(security): close adversarial-review findings in decision memory

Adversarial review (Claude subagent) found a CRITICAL the specialist pass missed:
- F1 (CRITICAL): 'Human:'/'Assistant:' turn-prefixes bypassed BOTH the write-time
  denylist AND datamark(), landing verbatim in agent context inside the trusted
  ACTIVE DECISIONS fence. Add 'human:' (+ 'disregard previous', 'from now on') to
  the shared denylist, and have datamark() neutralize Human:/Assistant:/System:/User:
  turn-prefixes (ZWSP) at the render boundary.
- F2: datamark() only stripped ASCII C0; extend to Unicode line terminators
  (U+0085/2028/2029) and U+007F so 'strip newlines' actually holds.
- F3: validateDecide blocked only HIGH secrets; MEDIUM-tier PII (e.g. SSN) persisted
  silently and synced cross-machine. The store is non-interactive (no confirm path),
  so fail closed on MEDIUM too.
- F4: compact() was a lock-free read-modify-rewrite that could clobber a concurrent
  append (lost decision). Add an O_EXCL compact lock + a pre-rename size recheck that
  aborts untouched (skipped=true) if an append landed; caller re-runs.
- F7: filterByScope unknown/garbage scope fell through to 'return true' (leaked into
  every context); fail conservative (false).

F5 (pid reuse) and F6 (pgrep over-match) are intentionally left as-is: both fail SAFE
(over-refuse sync); making them precise would introduce a fail-DANGEROUS path
(allowing sync during a real autopilot). True disambiguation needs gbrain to stamp the
lock with a start-time, which gstack doesn't own. F8 (compact moves history to archive)
is by design.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(security): close cross-model (Codex) adversarial findings

Codex adversarial review found a HIGH the Claude pass missed plus 3 mediums:
- C1 (HIGH): gstack-decision-search --all returned every decide and IGNORED redact
  events, so a redacted secret still resurfaced via --all until compact ran. --all
  now excludes redacted (redact = expunge from every read path), still showing
  superseded history.
- C-med: semantic (external gbrain) slug/snippet were printed raw — datamark them too
  so a gbrain hit can't spoof role markers / fences into agent context.
- C4: semanticRecall fell back to an UNSCOPED gbrain search when no curated-memory
  source resolved, pulling code/doc corpora mislabeled as 'related decisions'. Now
  returns null (degrade) when there's no worktree-backed memory source.
- C5: validateDecide scanned only decision/rationale/alternatives; branch and issue
  are stored + surfaced (raw via --json), so include them in the injection+secret scan.

C2 (snapshot staleness) / C3 (compact TOCTOU residual): accepted for a single-user
store — atomic appends never lose the event, rebuilds self-heal, and the compact
size-recheck leaves only a sub-ms window; full append-locking would break the
lock-free append design.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.57.5.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: saving all the reports, or generating some kind of artifact/hand-off file to persist the information between the sessions

3 participants