v1.57.5.0 feat: cross-session decision memory + gbrain dream-stage call graph by garrytan · Pull Request #1910 · garrytan/gstack

garrytan · 2026-06-08T04:47:11Z

Summary

Cross-session decision memory (the headline). An institutional memory loop so durable decisions and their rationale survive across sessions: capture → curate → resurface.

Event-sourced store (decide/supersede/redact) at ~/.gstack/projects/<slug>/decisions.jsonl; "active" is computed, dangling-ref tolerant. Bounded active snapshot keeps session-start reads O(active), with explicit --compact.
Two non-interactive bins: gstack-decision-log (--supersede/--redact/--compact) and gstack-decision-search (--scope/--recent/--query/--all/--json/--semantic).
Context Recovery resurfaces scope-relevant active decisions at session start; a ## Cross-session decision memory section added to CLAUDE.md.
/plan-ceo-review, /plan-eng-review, /spec, /ship auto-capture their structured decisions.
Optional gbrain --semantic recall, lazily loaded; degrades to the reliable file results when gbrain is off/empty.
Shared lib/jsonl-store.ts (injection-reject + atomic append + tolerant read; learnings-log refactored onto it) and lib/bin-context.ts.
Security properties: resurfaced text is datamarked (fences, --- banners, <|role|>/</system>, chat turn-prefixes, Unicode line terminators), HIGH and MEDIUM secrets blocked on write, redact expunges from every read path (incl. --all), compact is lock + size-recheck guarded.

gbrain dream-stage + reliability.

/sync-gbrain --dream builds the symbol cross-reference call graph behind a lock-free gate with an honest WARN-not-false-success guard; cycleCompleted() cycle-state probe.
Self-heal for a crashed autopilot daemon's stale lock (reads holder pid, signal-0 liveness, conservative when it can't tell) so a dead pid no longer wedges every sync.
Accurate pin/call-graph guidance in /sync-gbrain; ignore gbrain .sources/ staging dir.

Test Coverage

All new code paths have test coverage. 117 tests across the decision store + gbrain stages (jsonl-store, gstack-decision, gstack-decision-bins, gstack-decision-semantic, gbrain-dream-stage, gbrain-cycle-completed, gbrain-guards). Full bun test passes with 0 failures; gen:skill-docs clean (0 stale), parity 10/10, ship goldens refreshed.

Pre-Landing Review

4 findings, all INFORMATIONAL (no criticals), all auto-fixed: datamark resurfaced text, DRY-extract shared bin helpers, batch the compact archive append, close test-coverage gaps. PR quality score 9.0.

Adversarial review ran cross-model (Claude + Codex). Claude found a real injection bypass (chat turn-prefixes like Human: defeating both the write denylist and datamark) plus edge cases; Codex found --all ignoring redact and a few mediums. All actionable findings are fixed and tested. Two findings are intentionally left as-is because they fail SAFE (the autopilot pid-reuse and pgrep-substring cases over-refuse sync; tightening them would risk a fail-dangerous path); documented as such.

Design Review

No frontend files changed — design review skipped.

Eval Results

Diff-based selection pulls ~142 E2E + 20 LLM-judge tests (the changes touch the shared preamble/resolvers, which fan out to most skills). Paid evals deferred to CI gate-tier; the free suite (gen-skill-docs, parity, goldens, 117 decision/gbrain tests) is green locally.

Plan Completion

Plan: institutional decision-memory (capture → curate → resurface). All three phases delivered — reliable file-only core + bins + Context Recovery; CLAUDE.md section + structured skill emits; optional gbrain --semantic. find-contradictions deferred (the gbrain CLI surface for it can't be verified deterministically yet and the curated-memory source isn't indexed).

TODOS

No TODO items completed in this PR.

Test plan

bun test full suite passes (0 failures)
gen:skill-docs clean (0 stale), parity 10/10, ship goldens refreshed
Decision store + gbrain stages: 117 unit tests pass
CI runs gate-tier evals on this PR

🤖 Generated with Claude Code

Reads `gbrain doctor` cycle_freshness to classify whether a source has completed a full cycle (completed/never/unknown). A fail naming this source -> never; a fail naming only other sources -> completed; an absent or unparseable check -> unknown, so an unrelated doctor failure never masks a real state. Gates the automatic call-graph build on --full. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…est outcome guard Adds a source-scoped `gbrain dream --source <id>` stage that builds this worktree's call graph (code-callers/code-callees). Runs lock-free after the sync lock releases so it never blocks sibling worktrees; a .dream-in-progress marker dedupes concurrent dreams. --full auto-runs it only when the cycle was never built; explicit --dream always forces; --no-dream opts out. The stage parses the cycle's own output and reports the truth, not a flat "built": a WARN when the schema pack can't extract code symbols, when the embed phase failed for a missing key, or when 0 edges resolved; OK with the resolved-edge count otherwise. gbrain exits 0 even when it skips on a held cycle lock (e.g. autopilot), so that case reports SKIP, not success. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gbrain writes per-source staging and capability-check artifacts under .sources/ in the repo root. It's machine-local runtime state, not source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on gbrain>=0.41.38 sync-gbrain frames the --dream offer honestly: building a call graph requires a code-aware schema pack, and the dream stage reports a WARN when it can't. The verdict's Call graph row mirrors the dream stage's real outcome instead of assuming a completed cycle means edges exist. The ## GBrain Search Guidance block written into CLAUDE.md drops the old code-callers --source caveat: gbrain >=0.41.38.0 honors the .gbrain-source pin for code-callers/code-callees. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…in-gstack # Conflicts: # bin/gstack-gbrain-sync.ts # lib/gbrain-sources.ts

…atomic append + tolerant read) Single source of truth extracted for D2A: gstack-learnings-* and the upcoming gstack-decision-* bins share one injection-pattern list, one atomic single-line appender, and one tolerant reader. No more drift between stores. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… (D2A) Replace the inline injection-pattern copy with the shared list. One audited write-path rejection across learnings + the upcoming decision store. Behavior unchanged (35/35 learnings tests green); learnings-search keeps its inline copy because a structural test pins its bash/bun shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ion) decide/supersede/redact events on lib/jsonl-store; active set is computed (no mutable status), dangling refs tolerated. Free-text is injection-checked and redact-scanned on write (HIGH secret -> reject). Scope filter (repo/branch/issue) for relevant resurfacing. File-only + reliable; gbrain not required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…, supersede archives) writeSnapshot/readSnapshot/rebuildSnapshot give an O(active) bounded read for the session-start hot path (D1A). compact() rewrites the log to active, archives superseded decisions for history, and EXPUNGES redacted ones (dropped, never archived) so an accidentally-captured secret leaves the store for good. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n-interactive) Two bins mirroring gstack-learnings-* (D3A). log writes decide/--supersede/--redact/ --compact events + refreshes the bounded snapshot + enqueues for cross-machine sync; search reads the O(active) snapshot, scope-filtered to current branch, newest-first, --all to include superseded, --json for machines. Empty store returns silently (no snapshot write on an empty read). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ge (Context Recovery) Context Recovery now shows recent scope-relevant active decisions (bounded read of decisions.active.json via gstack-decision-search) and instructs the agent to treat them as settled calls and to log durable decisions/reversals. Closes the Phase-1 capture->curate->resurface loop, reliable + file-only. Regen across all hosts folded in (squash-with-regen); parity 10/10, freshness green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Context Recovery now emits the cross-session-decisions block, so ship's preamble (all hosts) changed. Golden baselines are hand-maintained copies (gen does not write them); refresh them from the fresh gen so golden-file regression passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…DE.md Adds a '## Cross-session decision memory' section: how to resurface (gstack-decision-search) and capture (gstack-decision-log) durable decisions, the supersede/redact/compact verbs, and a crisp durable-vs-trivial definition so the store stays signal. Reliable file-only path; gbrain not required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ured points Wires the four skills that finalize real decisions to capture them in the cross-session decision store, from their STRUCTURED outputs (never free-text scraping): - ship: the version bump (level + why) at write time - plan-ceo-review: accepted scope + verdict (branch-scoped) - plan-eng-review: the architecture verdict + key call (branch-scoped) - spec: the filed issue's core approach (issue-scoped) All emits are non-interactive, schema-correct (content in decision/rationale, source=skill, confidence 1-10), and best-effort (|| true) so a decision-log failure never blocks the workflow. Includes regen across hosts + refreshed ship golden baselines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds gstack-decision-search --semantic (with --query): appends a 'Related from memory' block from gbrain semantic search, scoped to the curated-memory source. Pure enhancement, reliability-first: a new lib/gstack-decision-semantic.ts is the ONLY decision module that touches gbrain and is imported lazily only on --semantic, so the reliable file path never loads gbrain code. Every path degrades to the reliable file results when gbrain is off, unconfigured, empty, or errors (never throws, 10s timeout). Built against the verified gbrain 0.42.x surface (text output [score] slug -- snippet, NOT JSON; curated-memory source resolved by worktree path, not a gstack-brain-<user> id). Deterministic-contract tests only: parser units, degrade-to-null when gbrain absent, and a fake-gbrain shim proving scope+search end-to-end. find-contradictions deferred (no verifiable CLI surface yet + curated memory not indexed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

detectAutopilot treated a lock FILE as proof of life, so a crashed gbrain daemon left a stale lock that wedged every sync forever (observed: a dead pid refused --full indefinitely). Now read the holder pid (bare or JSON body) and check liveness via signal-0: ESRCH=dead → ignore the stale signal and keep checking; EPERM=alive (other user) → active. A stale lock never masks a live autopilot process. Pure decision function — does not delete the file; the caller may clean it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…in-use # Conflicts: # plan-ceo-review/SKILL.md # plan-ceo-review/SKILL.md.tmpl # plan-eng-review/SKILL.md # plan-eng-review/SKILL.md.tmpl

…keys Pre-existing on main (v1.56.x): the two section-loading E2E tests used human-label testNames ('/ship section-loading') that don't match their slug keys ('ship-section-loading') in E2E_TOUCHFILES/E2E_TIERS. Every other E2E test uses the slug as its testName, and the TOUCHFILES completeness gate requires testName to be a registered key — so the gate was red. Align both testNames to their slug keys (also fixes tier lookup for these two periodic tests). Verified failing on a clean origin/main checkout before the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Addresses the pre-landing review findings (all INFORMATIONAL, no criticals): - security: datamark resurfaced decision text at the render boundary (lib/gstack-decision.ts datamark() — neutralizes code fences, --- banners, <|role|>/</system> markers, control chars, newlines). Applied in gstack-decision-search human output so stored text can't masquerade as instructions in Context Recovery (codex hardening #3 / AC #7). --json stays raw. - DRY: extract resolveSlug/gitBranch/flagValue to lib/bin-context.ts; both decision bins use it instead of duplicating the helpers. - compact(): batch the archive append (one write, not N) and shrink the mid-compact crash window; simplify the opaque branch/issue ternary. - coverage: learnings-log injection rejection (D2A wiring), search --recent/ --scope + NaN-safe --recent, datamark-applied, unparseable lock body, compact-empty, corrupt-snapshot degrade. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adversarial review (Claude subagent) found a CRITICAL the specialist pass missed: - F1 (CRITICAL): 'Human:'/'Assistant:' turn-prefixes bypassed BOTH the write-time denylist AND datamark(), landing verbatim in agent context inside the trusted ACTIVE DECISIONS fence. Add 'human:' (+ 'disregard previous', 'from now on') to the shared denylist, and have datamark() neutralize Human:/Assistant:/System:/User: turn-prefixes (ZWSP) at the render boundary. - F2: datamark() only stripped ASCII C0; extend to Unicode line terminators (U+0085/2028/2029) and U+007F so 'strip newlines' actually holds. - F3: validateDecide blocked only HIGH secrets; MEDIUM-tier PII (e.g. SSN) persisted silently and synced cross-machine. The store is non-interactive (no confirm path), so fail closed on MEDIUM too. - F4: compact() was a lock-free read-modify-rewrite that could clobber a concurrent append (lost decision). Add an O_EXCL compact lock + a pre-rename size recheck that aborts untouched (skipped=true) if an append landed; caller re-runs. - F7: filterByScope unknown/garbage scope fell through to 'return true' (leaked into every context); fail conservative (false). F5 (pid reuse) and F6 (pgrep over-match) are intentionally left as-is: both fail SAFE (over-refuse sync); making them precise would introduce a fail-DANGEROUS path (allowing sync during a real autopilot). True disambiguation needs gbrain to stamp the lock with a start-time, which gstack doesn't own. F8 (compact moves history to archive) is by design. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codex adversarial review found a HIGH the Claude pass missed plus 3 mediums: - C1 (HIGH): gstack-decision-search --all returned every decide and IGNORED redact events, so a redacted secret still resurfaced via --all until compact ran. --all now excludes redacted (redact = expunge from every read path), still showing superseded history. - C-med: semantic (external gbrain) slug/snippet were printed raw — datamark them too so a gbrain hit can't spoof role markers / fences into agent context. - C4: semanticRecall fell back to an UNSCOPED gbrain search when no curated-memory source resolved, pulling code/doc corpora mislabeled as 'related decisions'. Now returns null (degrade) when there's no worktree-backed memory source. - C5: validateDecide scanned only decision/rationale/alternatives; branch and issue are stored + surfaced (raw via --json), so include them in the injection+secret scan. C2 (snapshot staleness) / C3 (compact TOCTOU residual): accepted for a single-user store — atomic appends never lose the event, rebuilds self-heal, and the compact size-recheck leaves only a sub-ms window; full append-locking would break the lock-free append design. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…in-use

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

trunk-io · 2026-06-08T04:47:14Z

Merging to main in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

Resolves VERSION/CHANGELOG/package.json (branch keeps its higher v1.57.5.0; my CHANGELOG entry on top, main's v1.57.2.0 entry below). Regenerated SKILL.md across hosts from the merged templates. Bumped maxSizeRatio to 1.07 for investigate, cso, and design-consultation: the cross-cutting preamble growth (v1.57.2.0 AUQ-failure prose fallback + the decision-memory nudge) lands them just over the strict 1.05. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolves VERSION/CHANGELOG/package.json (branch keeps v1.57.5.0; my entry on top, main's v1.57.3.0 entry below). Regenerated SKILL.md across hosts from merged templates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-08T06:03:48Z

E2E Evals: ❌ FAIL

53/59 tests passed | $11.56 total cost | 12 parallel runners

Suite	Result	Status	Cost
e2e-browse	0/3	❌	$0.21
e2e-deploy	5/5	✅	$1.2
e2e-design	4/4	✅	$0.67
e2e-plan	8/8	✅	$2.65
e2e-qa-workflow	3/3	✅	$1.23
e2e-review	6/6	✅	$1.9
e2e-workflow	3/3	✅	$0.67
llm-judge	16/19	❌	$0.38
e2e-plan	8/8	✅	$2.65

12x ubicloud-standard-8 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

Failures

❌ operational learning: error_max_turns
❌ operational learning: error_max_turns
❌ operational learning: error_max_turns
❌ document-release/SKILL.md workflow: unknown
❌ document-release/SKILL.md workflow: unknown
❌ document-release/SKILL.md workflow: unknown

xg-gh-25 · 2026-06-08T10:33:17Z

This is a master class in institutional memory architecture — the "capture → curate → resurface" loop you've built solves the hardest problem in cross-session agent systems: how do decisions persist without becoming noise?

Why this matters beyond gstack:

Event-sourced decisions: The decide/supersede/redact triad is exactly the right semantic primitive. Decisions aren't mutable state — they're a timeline with revisions and explicit invalidations. The "active" computation from the log is how all durable agent memory should work.
Bounded active snapshot: O(active) session-start reads, not O(all-time), is the difference between "usable at scale" and "unusable past week 2." The explicit --compact is smart — garbage collection as an intentional human action, not automatic (which can destroy forensic value).
Context Recovery in CLAUDE.md: Injecting the ## Cross-session decision memory section at session start is how you bridge ephemeral agent context and durable institutional state. This is the pattern every agent framework needs but most don't have.

Two design notes that stand out:

Datamarking resurfaced text: The security properties section (fences, --- banners, <|role|>, turn-prefixes, Unicode line terminators) shows you've internalized the prompt injection risk. Resurfacing past decisions is a trust boundary — if an adversary can inject a fake decision into the log, they control future behavior. The multi-layer datamarking + HIGH/MEDIUM secret blocking on write is the right paranoia level.
Gbrain as optional enhancement, not dependency: The --semantic flag degrades gracefully to file-only results when gbrain is off/empty. This is how you make advanced features (semantic search over decisions) additive without creating fragility. The lazy-load + fail-safe architecture means the core (JSONL + bins) is always reliable.

The adversarial review rigor is rare:

Cross-model (Claude + Codex) adversarial review finding real injection bypasses (chat turn-prefixes like Human: defeating both write denylist and datamark) is exactly how agent systems should be stress-tested.
The two findings intentionally left as-is (autopilot pid-reuse and pgrep-substring over-refuse) because fixing them would create fail-dangerous paths — this is safety engineering. The "fail safe, not fail silent" principle documented.

Comparison to the ecosystem: Most agent systems treat memory as append-only logs with no curation or explicit invalidation. The result is context pollution — old decisions resurface and confuse new sessions. Your supersede/redact semantics + active-set computation solves this cleanly. The gstack pattern here is a blueprint for durable agent memory.

Similar event-sourced memory patterns in SwarmAI. Discussion: T-MEM

Resolves VERSION/CHANGELOG/package.json (branch keeps v1.57.5.0; my entry on top, main's v1.57.4.0 Boil-the-Ocean rename entry below). Regenerated SKILL.md across hosts (picks up the Boil-the-Lake to Boil-the-Ocean rename). Bumped plan-eng-review maxSizeRatio to 1.06 for the cumulative cross-cutting preamble growth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

garrytan and others added 24 commits May 31, 2026 08:54

chore: ignore gbrain .sources/ local staging dir

da7c4dc

gbrain writes per-source staging and capability-check artifacts under .sources/ in the repo root. It's machine-local runtime state, not source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into garrytan/upgrade-gbra…

a2cceba

…in-gstack # Conflicts: # bin/gstack-gbrain-sync.ts # lib/gbrain-sources.ts

docs(review): drop stray trailing code fence in TODOS-format

71a56ce

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into garrytan/upgrade-gbra…

373bbd9

…in-use # Conflicts: # plan-ceo-review/SKILL.md # plan-ceo-review/SKILL.md.tmpl # plan-eng-review/SKILL.md # plan-eng-review/SKILL.md.tmpl

Merge remote-tracking branch 'origin/main' into garrytan/upgrade-gbra…

52d08b1

…in-use

chore: bump version and changelog (v1.57.5.0)

f5708ae

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

garrytan and others added 2 commits June 7, 2026 22:45

garrytan merged commit 45cc95d into main Jun 8, 2026
22 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.57.5.0 feat: cross-session decision memory + gbrain dream-stage call graph#1910

v1.57.5.0 feat: cross-session decision memory + gbrain dream-stage call graph#1910
garrytan merged 27 commits into
mainfrom
garrytan/upgrade-gbrain-use

garrytan commented Jun 8, 2026

Uh oh!

trunk-io Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

xg-gh-25 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan commented Jun 8, 2026

Summary

Test Coverage

Pre-Landing Review

Design Review

Eval Results

Plan Completion

TODOS

Test plan

Uh oh!

trunk-io Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Evals: ❌ FAIL

Failures

Uh oh!

xg-gh-25 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 8, 2026 •

edited

Loading