Skip to content

Add governed memory model, provider boundaries, diagnostics, and CLI validation#1145

Merged
bradygaster merged 10 commits into
bradygaster:devfrom
tamirdresher:squad/memory-governance-provider
May 21, 2026
Merged

Add governed memory model, provider boundaries, diagnostics, and CLI validation#1145
bradygaster merged 10 commits into
bradygaster:devfrom
tamirdresher:squad/memory-governance-provider

Conversation

@tamirdresher

@tamirdresher tamirdresher commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR represents the full memory governance/provider work on squad/memory-governance-provider, not just diagnostics config.

It adds:

  • Governed memory classes: TRANSIENT, LOCAL, DECISION, POLICY, COPILOT_MEMORY, FORBIDDEN
  • Load guidance: [ALWAYS], [ON-DEMAND], [ARCHIVE], [NEVER]
  • Local memory store with classification, write/search/audit, promotion, deletion, tombstones, and safe audit records
  • CLI bridge: squad memory classify|write|search|audit|provider
  • Honest provider boundaries: local default, provider=copilot reported unavailable without a concrete callable API, host-injected adapter opt-in and fail-closed
  • Safe diagnostics to stderr with .squad/config.json support
  • Real Copilot CLI A/B harness with isolated COPILOT_HOME
  • Deterministic memory-value benchmark for context size, retrieval quality, decision consistency, and stale/unsafe memory exclusion
  • Tests, docs, proposals, template guidance, and changeset coverage

This work is based on Tamir Dresher's blog post, The Ship's Computer Has a Memory Problem — Designing Memory for AI Agent Squads. The core idea is that agent memory is runtime state with lifecycle, governance, and context-budget implications — not just more Markdown in the repo or more text in the prompt.

Short guide: how to use it

Initialize or upgrade a Squad repo so local governance files are scaffolded:

squad init
# or, for an existing repo:
squad upgrade

By default, governance is local-only:

  • memory config lives under .squad/memory/
  • audit records are written locally
  • local .squad/ memory remains the source of truth
  • Copilot Memory provider integration is not assumed or faked

Use the CLI bridge:

squad memory classify "Always run tests before merge"

squad memory write \
  --content "Use Vitest for SDK regression tests" \
  --class DECISION \
  --author scribe \
  --approved

squad memory search --query "Vitest"
squad memory audit
squad memory provider

Diagnostics and logs

For one command:

squad memory search --query "Vitest" --log-level info
squad memory provider --verbose

For project-level diagnostics, add this to .squad/config.json:

{
  "memory": {
    "logLevel": "info"
  }
}

Supported levels:

none | error | info | debug

Precedence:

  1. CLI --log-level / --verbose
  2. SQUAD_MEMORY_LOG_LEVEL
  3. .squad/config.json memory.logLevel
  4. default none

Diagnostics go to stderr; JSON output remains on stdout. Diagnostics include safe metadata such as command name, provider, paths, result counts, load guidance, and timing. They intentionally do not print raw memory content or raw search text.

How to measure memory value

Run the deterministic benchmark:

npm run build -w packages/squad-sdk
npm run experiment:memory-value
npm run experiment:memory-value -- --json

The benchmark compares naive baseline behavior (load all memory into context) against governed retrieval (ALWAYS + relevant ON-DEMAND, excluding ARCHIVE, NEVER, deleted, or superseded facts). It measures context payload, estimated token pressure, precision, recall, decision consistency, and stale/unsafe-memory avoidance.

Latest local result from this branch:

Verdict: PASS
Context bytes: baseline=3540, governed=1601
Estimated tokens: baseline=885, governed=401
Context reduction: 54.8%
Precision: baseline=0.16, governed=0.41
Recall: baseline=1.00, governed=1.00
Decision consistency: baseline=0.50, governed=1.00
Stale/unsafe facts avoided: 12/12

Validation evidence

Latest branch validation:

npm run build -w packages/squad-sdk
npm run lint
npm run test -- test/memory-value-benchmark.test.ts test/memory-governance.test.ts test/real-cli-ab.test.ts test/bench-runner.test.ts
npm run experiment:memory-value
npm run experiment:real-cli-ab -- --repo <repo> --variants memory-governance --dry-run

Results:

  • SDK build: passed
  • TypeScript lint: passed
  • Related memory/harness tests: 41 passed
  • Deterministic memory-value benchmark: PASS
  • Real CLI A/B dry-run plan/artifact path generation: passed
  • Previous full GitHub Actions Squad CI before this benchmark commit: passed (run 26140594530)
  • Current GitHub Actions Squad CI for head 56bd3136: passed (run 26141373561)

Real CLI A/B evidence

The branch includes a real Copilot CLI paired A/B harness and prior reruns across two local Squad repos:

  • ADC Squad runner demo:
    • baseline: 0 memory diagnostic events
    • memory-governance variant: 10 memory diagnostic events
  • Squad repo:
    • baseline: 0 memory diagnostic events
    • memory-governance variant: 9 memory diagnostic events

This proves the harness, isolated COPILOT_HOME runs, memory command invocation diagnostics, and baseline-vs-governance signal separation.

Current verdict

Stronger than before, but still honest.

This PR now proves:

  • governed local memory operations work
  • classification/isolation behavior is enforced
  • forbidden/transient content is rejected or kept out of durable memory
  • provider status is honest
  • host-injected provider mode fails closed without a real host client
  • diagnostics are configurable and safe
  • the real CLI A/B harness can isolate runs and capture evidence
  • deterministic governed retrieval can reduce context pressure while preserving recall and improving precision/decision consistency in seeded memory tasks

This PR still does not prove:

  • real Copilot Memory provider write/search/delete integration
  • statistically significant live-agent quality gains from external semantic memory
  • long-run production improvement across many repositories and real issues

The right claim is: this makes the blog-post promise measurable and demonstrates the local governed-memory mechanism can reduce context size and improve memory selection quality in deterministic tests. Full live-agent longitudinal proof still requires repeated real-world paired runs once a callable provider/host adapter exists.

Copilot and others added 4 commits May 19, 2026 18:49
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add safe log-level diagnostics for memory commands and document how to enable them.
Sync memory governance guidance into the Squad agent template mirrors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a bounded real CLI experiment runner that isolates COPILOT_HOME per repository and variant, captures command/stdio/log artifacts, and records measured memory diagnostics evidence without faking in-turn command use.

Co-authored-by: GitHub Copilot <copilot@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher tamirdresher changed the title Add memory governance diagnostics config Add governed memory model, provider boundaries, diagnostics, and CLI validation May 20, 2026
Copilot and others added 5 commits May 20, 2026 06:57
Update compatibility expectations for the added memory governance tools and wrap the memory help entry so the terminal-width UX gate passes.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the extra memory action line so the help output remains within the speed-gate line budget while preserving the shorter memory command description.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a deterministic benchmark that compares baseline full-memory context loading against governed retrieval for context size, precision, recall, decision consistency, and stale/unsafe memory exclusion.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…prototype providers

Implements issue #3287: prove external/local provider seams can plug into
the memory governance model without impersonating Copilot Memory.

Changes:
- Add MemoryProviderStatus and MemoryProvider interfaces
- Add MemPalaceMemoryProvider: in-memory spatial store, loci-based,
  accepts LOCAL / DECISION / POLICY
- Add IndexServerMemoryProvider: in-memory knowledge catalog, topic-based,
  accepts LOCAL / DECISION / POLICY
- LocalMemoryStoreOptions.registeredProviders: optional provider list
- LocalMemoryStore.write(): routes governed writes to matching providers
  AFTER local write; FORBIDDEN/TRANSIENT/COPILOT_MEMORY never reach providers
- LocalMemoryStore.search(): merges and deduplicates provider results
- LocalMemoryStore.providerStatus(): includes registered provider statuses
- Audit records for provider replication contain only id/class/title/path
  (no raw content or query text)
- All new types exported from sdk barrel via existing export *

Tests (test/memory-providers.test.ts, 19 tests):
- MemPalaceMemoryProvider: status, write/search LOCAL and DECISION, delete, empty search
- IndexServerMemoryProvider: status, write/search POLICY and DECISION, delete, empty search
- LocalMemoryStore routing: LOCAL/POLICY writes reach both providers, search
  merges provider hits, FORBIDDEN/TRANSIENT rejected before providers,
  COPILOT_MEMORY and provider=copilot still fail closed, providerStatus
  reflects registered providers, audit records are safe
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tamirdresher

Copy link
Copy Markdown
Collaborator Author

Addressed the review feedback from Brady's screenshot in commit 0139af50:

  • Removed raw secret-shaped fixture content from the deterministic memory-value benchmark and added a regression assertion.
  • Added bounded audit log rotation via policy.auditMaxBytes / policy.auditMaxArchives.
  • Replaced silent registered-provider write/search failures with safe provider-error audit entries that omit raw error text and query/content.
  • Preserved provider provenance/class/load guidance/score in registered provider search results instead of normalizing them back to local results.
  • Bounded the in-memory MemPalace/IndexServer prototype providers and sanitized provider path segments.
  • Hardened local index path resolution and tightened promote successor-id handling.

Validation: focused memory tests, SDK build, lint, memory-value benchmark, and PR CI are passing.

@bradygaster

Copy link
Copy Markdown
Owner

Code Review — Memory Governance Provider

Overall: Well-designed governance model with solid test coverage. The architecture cleanly separates classification, storage, audit, and provider concerns. A few issues worth addressing before merge:


🔴 Critical

Race condition in concurrent index updates (memory/index.ts)

The memory index uses read-modify-write without locking or atomic writes. Concurrent agents/CLI commands can cause lost entries when writing simultaneously. The proposal doc acknowledges this as future work, but it's a real data-loss risk in multi-agent scenarios.

Fix: Write to temp file + atomic rename at minimum; consider file-based locking.


🟡 Medium

  1. Partial failure in delete() leaves inconsistent state — Tombstone is written after the index is updated and source file deleted. If tombstone write fails, state is irrecoverable. Reorder: write tombstone → update index → delete source.

  2. Provider search hardcodes class: 'LOCAL' — Provider results lose their actual classification. The CopilotMemoryProviderSearchResult interface lacks a class field, so all provider results appear as LOCAL regardless of their true class.

  3. Silent provider write failures — Empty catch block on provider replication means auth/network failures are invisible. Suggest writing a safe audit entry (no content) on failure.

  4. Corrupted index silently resets to empty — If index.json is malformed, readIndex() returns [] with no backup or notification. Could silently lose an entire memory store.


🟢 Low / Informational

  • Unbounded audit log growthaudit.jsonl is append-only with no rotation. Long-running projects will see unbounded growth.
  • Promote path has weak type safetyMemoryWriteResult.id is typed as optional, but promote dereferences it unconditionally. Runtime is safe (implementation always sets it), but type should enforce the invariant.

✅ Strengths

  • Clean classification taxonomy (TRANSIENT → FORBIDDEN) with enforcement
  • Honest provider boundaries — Copilot Memory reported unavailable without a real API
  • Fail-closed host-injected adapter pattern
  • Deterministic benchmark proving 54.8% context reduction with maintained recall
  • Good test coverage (753 lines governance tests, 412 lines provider tests)
  • Safe diagnostics (no raw content in stderr)

…bstone ordering

- readIndex() now throws with a .corrupt backup instead of silently resetting
  to [] when index.json contains invalid JSON or a non-array root.
- writeIndex() now writes to index.json.tmp then renames into place, preventing
  partial writes from corrupting the index on crash.
- Added backupCorruptIndex() private helper used by readIndex().
- Regression tests added (30 total, was 23):
  * concurrent writes: 2 writers and 10 writers both produce all entries
  * corrupt index (invalid JSON, non-array): throws with message + .corrupt file
  * delete tombstone ordering: tombstone present after delete, source gone
  * tombstone survives source-delete failure via faulting proxy
@tamirdresher

Copy link
Copy Markdown
Collaborator Author

Fix summary — commit 7458b81

Picking up from commit 0139af50 (which addressed provider search class, silent write failures, audit log rotation, and promote id invariant), this commit resolves the remaining three open items from the Brady PR review:


1. Race condition in concurrent index updates (FIXED)

Added an async serialization mutex to LocalMemoryStore:

  • New private field indexLockTail: Promise<unknown> (initialized to Promise.resolve())
  • New private method withIndexLock<T>(fn: () => Promise<T>): Promise<T> — each caller chains onto the previous tail and awaits it before running its critical section
  • All readIndex() → mutate → writeIndex() sequences in write(), promote(), and delete() are now wrapped in withIndexLock

Regression tests (LocalMemoryStore — concurrent writes and index integrity):

  • 2 concurrent write() calls → both IDs in persisted index
  • 10 concurrent write() calls → all 10 IDs in persisted index

2. Corrupted index silently resets to [] (FIXED)

readIndex() previously caught any JSON parse error and returned [], causing silent data loss.
Now:

  • On invalid JSON or non-array root: writes the raw bytes to index.json.corrupt as a recovery aid, then throws a descriptive Error so the caller surfaces the problem instead of silently overwriting the index
  • backupCorruptIndex() private helper wraps the backup write; failure to write the backup is suppressed so callers always see the original error

Regression tests (LocalMemoryStore — corrupted index.json handling):

  • Invalid JSON → throws /Memory index is corrupt/; .corrupt backup file written with original content
  • Non-array root → same throw

3. Delete tombstone ordering (FIXED)

Original code deleted the source file before writing the tombstone and updating the index. A crash between those steps left the entry unrecoverable.

Fixed order for local entries: tombstone write → index update → source file delete
For COPILOT_MEMORY entries: external provider delete is gated first (returns false on failure); then tombstone → index (source is external).

Regression tests (LocalMemoryStore — delete tombstone ordering):

  • After a successful delete(): tombstone exists, source gone, audit contains delete record
  • When source delete fails (faulting storage proxy): call rejects, tombstone survives

4. Atomic writeIndex() (FIXED)

writeIndex() previously wrote directly to index.json, leaving a window where a crash mid-write produces a partial/corrupt file.
Now: writes to index.json.tmpstorage.rename(tmp, indexPath) (wraps fs.rename, atomic on POSIX, near-atomic on Windows same-volume).


Test results

All 30 tests pass (23 pre-existing + 7 new regression tests). SDK build clean (tsc zero errors).

@tamirdresher tamirdresher marked this pull request as ready for review May 21, 2026 17:57
Copilot AI review requested due to automatic review settings May 21, 2026 17:57
@bradygaster bradygaster merged commit c6148e7 into bradygaster:dev May 21, 2026
6 checks passed
@tamirdresher tamirdresher review requested due to automatic review settings May 21, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants