Skip to content

feat(memory): add persistent agent memory system (Phase 1)#179

Draft
fanhongy wants to merge 15 commits into
mainfrom
feat/memory-implementation-phase-1
Draft

feat(memory): add persistent agent memory system (Phase 1)#179
fanhongy wants to merge 15 commits into
mainfrom
feat/memory-implementation-phase-1

Conversation

@fanhongy

Copy link
Copy Markdown
Contributor

Summary

Adds a persistent, cross-session memory system to CAO. Agents can store facts, decisions, and preferences during a session; CAO automatically injects relevant memories as context when the agent's next session starts.

  • MCP toolsmemory_store, memory_recall, memory_forget exposed via cao-mcp-server
  • 4 scopesglobal (cross-project), project (cwd-hashed), session, agent
  • 4 typesuser, feedback, project, reference (classification label, no effect on storage)
  • Context injection<cao-memory> block prepended to first user message on session start
  • Auto-save hooks — Claude Code: Stop + PreCompact; Kiro CLI: agentSpawn + userPromptSubmit
  • CLI commandscao memory list/show/delete/clear
  • File-based storage — wiki markdown files under ~/.aws/cli-agent-orchestrator/memory/; no SQLite
  • Concurrent-write safetyfcntl.flock() on index writes
  • Tiered retention — global/agent: never expire; project: 90d; session: 14d
  • Docsdocs/memory.md full reference

Changed files

Area Files
Models models/memory.py
Service services/memory_service.py
MCP tools mcp_server/server.py
API api/main.py (+/terminals/{id}/memory-context endpoint)
CLI cli/commands/memory.py, cli/main.py
Hooks hooks/ (4 scripts + registration.py)
Cleanup services/cleanup_service.py
Terminal services/terminal_service.py (injection + hook registration)
Agent profiles agent_store/{code_supervisor,developer,reviewer}.md
Config constants.py
Docs docs/memory.md, README.md, CHANGELOG.md
Tests test/services/test_memory_service.py, test/providers/test_memory_injection.py, test/cli/commands/test_memory.py

Testing

# Unit tests (1296 passing, 7 pre-existing failures unrelated to this PR)
uv run pytest test/ --ignore=test/e2e --ignore=test/providers/test_q_cli_integration.py -v

# Manual test with local build
uv tool install --force --reinstall .
cao-server &
cao launch --agents developer --provider claude_code

Notes

- Phase 1 is file-based only — no SQLite, no vector search. Semantic search and backend abstraction are Phase 2.
- Kiro CLI hooks require cao-server to be running at terminal creation time to register correctly.
- async def on store/recall/forget is intentional for caller compatibility; proper aiofiles migration is a future improvement.

  - File-based wiki storage under ~/.aws/cli-agent-orchestrator/memory/
  - 4 scopes: global, project (cwd-hashed), session, agent
  - MCP tools: memory_store, memory_recall, memory_forget
  - CLI commands: cao memory list/show/delete/clear
  - Auto-inject <cao-memory> context block on first message
  - Auto-save hooks: Claude Code (Stop + PreCompact), Kiro CLI (agentSpawn + userPromptSubmit)
  - Tiered retention with background cleanup
  - fcntl.flock() for concurrent-write safety
  - 1296 tests passing
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread src/cli_agent_orchestrator/hooks/registration.py Fixed
Comment thread docs/memory.md
@@ -0,0 +1,197 @@
# Memory System

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this an opt-in feature? such as add a enableMemory config key in ~/.aws/cli-agent-orchestrator/settings.json.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would shiped in phase 2/3

@haofeif haofeif added the enhancement New feature or request label Apr 17, 2026
Comment thread docs/memory.md
### Claude Code

Registered in `.claude/settings.local.json` (in the working directory) at terminal creation:

@patricka3125 patricka3125 Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some pushback concern regarding this approach. Doing this would result in any provider-specific config to contain hook configuration for CAO memory usage even when outside of CAO orchestration flows.

At least for claude code, I would recommend

  1. Creating a Claude plugin and load it with claude --plugin-dir This CLI flag will load all the plugin related hooks & references scoped to the session
  2. Perhaps building on top of the hook system introduced in feat: Build support for external plugins #172 to inject context hints that can substitute stop hooks. PreCompact does not seem feasible though, so option 1 may be preferred.

Comment thread docs/memory.md Outdated
```markdown
## Memory

When you discover something worth remembering — user preferences, project conventions,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to align with the diffs made to the agent_store profiles

Comment thread docs/memory.md
| Scope | Storage location | Use when |
|---|---|---|
| `global` | `memory/global/wiki/global/` | Cross-project facts: user preferences, coding standards |
| `project` | `memory/{cwd_hash}/wiki/project/` | Project-specific: architecture decisions, conventions |

@patricka3125 patricka3125 Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cwd hash has one major problem, which is that any change made to the filesystem (i.e. folder renaming, mv project to a different directory) would immediately break project memory reference

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also makes any type of workflow involving worktrees incompatible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

This should be indexed in to SQL in phase 2

@patricka3125

Copy link
Copy Markdown
Collaborator

Feedback on memory usage guidance for LLMs — the current agent-profile block and MCP docstrings will likely produce store-happy but low-signal behavior. Some gaps worth closing:

  1. No guidance on scope. The tool accepts global|project|session|agent but neither the docstrings nor the profile explain when to use which. Given the project default, an LLM told to persist "user preferences" will almost always land in project when most belong in global. This is the biggest likely failure mode.
  2. No guidance on memory_type. Four types (user|feedback|project|reference) with no decision criteria; the default is project, so the type dimension will collapse regardless of content.
  3. No "when NOT to store" list. Without explicit exclusions (code patterns derivable from the repo, git history, ephemeral task state, fix recipes), agents will accumulate redundant/decaying entries like "fixed bug in X" or "ran tests and they passed."
  4. No key-naming convention. The auto-slug from the first 6 words will collide and hurt search. Agents need a nudge to pick stable, topical slugs.
  5. recall triggers are too narrow. "Check before asking the user" misses the more useful trigger: before starting a task in an area, recall what we know about it. LLMs tend to skip recall entirely when a task feels self-contained.
  6. No post-injection feedback loop. The <cao-memory> block shows what's loaded on first message, but after that the agent has no visibility into the store unless it calls recall, and nothing nudges it to.

Net effect: the "ALWAYS store" rule is forceful enough that agents will store things, but scope/type selection will be near-random and recall will be underused mid-task. Consider expanding the profile Memory section with (a) a scope/type decision rubric with examples, (b) an explicit exclusion list, and (c) explicit save/recall triggers.

Comment thread docs/memory.md
**Kiro CLI:** Memory injection happens via the `agentSpawn` hook registered at `~/.kiro/agents/{profile}.json`, which fires before the first user message. CAO does not double-inject for Kiro.

## Auto-Save Hooks

@patricka3125 patricka3125 Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would gently recommend de-scoping or deferring auto-save hook to future phases since
A. there isn't a clear path to provide this for all supported providers today
B. stale config point in the comment i made below

Comment thread docs/memory.md
- **Stop hook**: fires every 15 human messages, prompts agent to save key findings
- **PreCompact hook**: fires before context compression, prompts emergency save

### Kiro CLI

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a section for codex

Comment on lines +182 to +197
# Step 3d: Register provider-specific hooks for memory self-save
try:
from cli_agent_orchestrator.hooks.registration import (
register_hooks_claude_code,
register_hooks_codex,
register_hooks_kiro,
)

if provider == ProviderType.CLAUDE_CODE.value and working_directory:
register_hooks_claude_code(working_directory)
elif provider == ProviderType.CODEX.value and working_directory:
register_hooks_codex(working_directory)
elif provider == ProviderType.KIRO_CLI.value:
register_hooks_kiro(agent_profile)
except Exception as e:
logger.warning(f"Failed to register hooks for terminal {terminal_id}: {e}")

@patricka3125 patricka3125 Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: move provider-specific hook registration into providers/ — the current hooks/registration.py is feature-envy on the provider abstraction.

Why

hooks/registration.py encodes intimate knowledge of three different config file formats (Claude Code's nested settings.local.json, Codex's flat .codex/hooks.json, Kiro's ~/.kiro/agents/{name}.json with event keys), and the service layer dispatches on provider type to pick the right function:

# services/terminal_service.py:182-197 (current)
from cli_agent_orchestrator.hooks.registration import (
    register_hooks_claude_code,
    register_hooks_codex,
    register_hooks_kiro,
)
if provider == ProviderType.CLAUDE_CODE.value and working_directory:
    register_hooks_claude_code(working_directory)
elif provider == ProviderType.CODEX.value and working_directory:
    register_hooks_codex(working_directory)
elif provider == ProviderType.KIRO_CLI.value:
    register_hooks_kiro(agent_profile)

That provider == ... ladder is exactly what the provider interface is supposed to eliminate. Every new provider requires touching hooks/registration.py and terminal_service.py.

Proposed shape

1. Add a method to BaseProvider (providers/base.py) with a default no-op:

# providers/base.py
class BaseProvider(ABC):
    ...
    def register_memory_hooks(
        self,
        working_directory: Optional[str],
        agent_profile: str,
    ) -> None:
        """Register CAO memory self-save hooks in this provider's config files.

        Default: no-op for providers that don't support hooks.
        Overridden by providers that do (Claude Code, Codex, Kiro).
        """
        return None

2. Each provider overrides register_memory_hooks with its own config-file logic — Claude Code writes its nested settings.local.json entries, Codex writes the flat .codex/hooks.json, Kiro writes agentSpawn/userPromptSubmit into ~/.kiro/agents/{profile}.json. Providers that don't support hooks (QCliProvider, etc.) inherit the no-op default — no code needed.

3. The call site collapses to one line:

# services/terminal_service.py (proposed)
try:
    provider_instance.register_memory_hooks(working_directory, agent_profile)
except Exception as e:
    logger.warning(f"Failed to register hooks for terminal {terminal_id}: {e}")

Tradeoff worth naming

Dispersing registration across providers loses the single-file audit surface for "what hooks does CAO install?". Mitigate by keeping path constants centralized in hooks/paths.py and adding a short table in docs/memory.md.

Net

Three concrete wins:

  1. Adding a new provider stops requiring edits to hooks/registration.py and the terminal_service dispatch.
  2. _SAFE_PROFILE_RE and the agents-dir normalization in register_hooks_kiro move next to the rest of Kiro's agent-config handling, where they're easier to keep consistent.
  3. Provider-level unit tests already exist per provider — hook registration fits naturally there rather than as a cross-cutting test file.

@haofeif

haofeif commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

@fanhongy thanks for the great work, it delivered majority of the features as discussed. Here are some feedbacks:

PR #179 — Phase 1 Scoped Review

TL;DR

  • Status: Request changes. ~70% of Phase 1 spec shipped (13 ✅ / 5 ⚠️ / 2 ❌).
  • Two merge blockers:
    1. C1memory_forget registered as a Phase 1 MCP tool; design explicitly labels it Phase 2. (Given the complexity of the PR, i suggest move this to next PR)
    2. C3 — SQLite metadata layer spec'd in four separate places in the doc; not shipped.
  • Both blockers are doc-vs-code mismatches — resolution can be "fix the code" or "fix the doc," but today they disagree.

Scope: This review compares PR #179 against only the Phase 1 (MVP) section of memory-management.md (L530-550) and Phase 1 Success Criteria (L587-594). Out-of-scope items (Phase 2–4) are not graded.

Method: item-by-item walk of each Phase 1 deliverable against the shipped code.


Phase 1 deliverables

§Phase 1 — Memory creation

# Phase 1 deliverable Verdict Evidence
1 MemoryMetadataModel database table + migration in database.py Not shipped clients/database.py has no memory model; memory_service.py comment L1 states "Phase 1 — file-based, no SQLite"
2 Memory pydantic model in models/memory.py ✅ Shipped models/memory.py defines Memory, MemoryScope, MemoryType
3 MemoryService (store, recall, forget, resolve_scope_id, get_memory_context_for_terminal) ✅ Shipped services/memory_service.py has all 5 methods
3a Phase 1 doc note: "forget is an internal method for cleanup; MCP exposure via memory_forget is Phase 2" ⚠️ Violated forget() is ALSO registered as memory_forget MCP tool at mcp_server/server.py:765-794
4 memory_store MCP tool in server.py ✅ Shipped mcp_server/server.py:653-704
5 memory_recall MCP tool in server.py ✅ Shipped mcp_server/server.py:707-762
6 REST endpoints: POST /memory, GET /memory, DELETE /memory, GET /terminals/{id}/memory ⚠️ Partial Only GET /terminals/{id}/memory-context exists (api/main.py:437-458). No CRUD endpoints.
7 Memory cleanup in cleanup_service.py (tiered retention per Rec #9) ✅ Shipped cleanup_expired_memories() + RETENTION_POLICY matches §Rec 9 table exactly

§Phase 1 — Hook-triggered self-save

# Phase 1 deliverable Verdict Evidence
8 Register Stop hook for Claude Code (.claude/settings.local.json) ✅ Shipped hooks/registration.py:60-122
9 Register Stop hook for Codex (.codex/hooks.json) ✅ Shipped hooks/registration.py:192-220
10 Register PreCompact hook for Claude Code ✅ Shipped hooks/registration.py:117-119
11 Other providers: instruction-based save reminder ⚠️ Partial Agent-profile instruction shipped (developer/reviewer/code_supervisor). Kiro additionally gets a userPromptSubmit hook counter — exceeds the Phase 1 spec. Gemini/Kimi/Copilot: instruction-only, which matches spec.

§Phase 1 — Context delivery

# Phase 1 deliverable Verdict Evidence
12 Rule-based injection: SQLite metadata query → read matching wiki files → fill token budget ⚠️ Adapted Fills budget correctly, but query path is _parse_index() regex scan (no SQLite).
13 Scope precedence session > project > global on recall with no scope filter ✅ Shipped memory_service.py:402-408
14 Hard budget ~2KB per §Rec 6 (or ~2-4KB per §Phase 1 L543) total injection ⚠️ Off-spec get_memory_context_for_terminal(budget_chars=3000) — 3KB sits between the §Rec 6 ceiling (~2KB) and the §Phase 1 ceiling (2-4KB). More importantly, no per-scope cap ("10 most recent per scope" per §Rec 6 L756) — a single scope can monopolize the whole budget.
15 Memory injection at terminal creation for all 6 providers (per Provider-Specific table) ⚠️ Simplified Per terminal_service.py:52-74,324-325: one universal path — prepend <cao-memory> to first user message. Kiro alone uses agentSpawn hook. Spec requires per-provider mechanism (GEMINI.md append for Gemini, -c developer_instructions for Codex, etc. — see §6a table L765-772). Phase 1 explicitly allows "instruction file/system prompt injection for all providers" as a Phase 1 simplification (L774) — so this is within spec if you read L774 as authoritative, off-spec if you read L544 + §6a table as authoritative. The design doc is internally ambiguous.
16 Memory instruction added to built-in agent profiles ✅ Shipped developer.md, reviewer.md, code_supervisor.md each have a ## Memory section

§Phase 1 — Storage setup

# Phase 1 deliverable Verdict Evidence
17 ~/.aws/cli-agent-orchestrator/memory/{project_hash}/wiki/ structure ✅ Shipped memory_service.py:112-123
18 index.md master map — created on first memory_store, updated on each new topic file ✅ Shipped _update_index() with fcntl.flock
19 Each memory_store writes content to topic file under wiki/{scope}/ ✅ Shipped get_wiki_path() resolves to wiki/{scope}/{key}.md
20 SQLite metadata (key, scope, tags, timestamps, file_path) for lookup + scope resolution Not shipped Same gap as #1.

§Phase 1 Success Criteria — verdict per item

# Success Criterion Verdict Notes
S1 Agent memory_store → wiki file created + SQLite upsert + index.md updated ⚠️ Partial Wiki + index OK; SQLite step absent
S2 Different agent can memory_recall the same fact recall() works cross-terminal via shared wiki
S3 Injection at terminal creation for all 6 providers ⚠️ Partial Universal first-message prepend; per-provider table in §6a not followed
S4 Hook-triggered self-save fires reliably for Claude Code + Codex Stop hooks installed; Claude also has PreCompact
S5 Memories survive server restart (wiki files + SQLite metadata) ⚠️ Partial Wiki files persist; no SQLite. Since metadata is in index.md, data survives but via a different mechanism than spec'd
S6 Tiered retention cleanup removes expired session/project/reference while preserving user/feedback RETENTION_POLICY matches §Rec 9 table; cleanup exercises all tiers

Summary scorecard

Counting only the 20 main deliverables (treating #3a as a sub-note on #3):

Outcome Count Items
✅ Shipped as spec'd 13 #2, #3, #4, #5, #7, #8, #9, #10, #13, #16, #17, #18, #19
⚠️ Partial / adapted / off-spec 5 #6 (REST), #11 (hook extras), #12 (no SQLite-backed query), #14 (no per-scope cap), #15 (universal injection)
❌ Not shipped 2 #1 (SQLite table), #20 (SQLite metadata layer) — one concern surfacing in two rows

Sub-note #3a (memory_forget exposed via MCP) is a doc-vs-code mismatch on top of shipped functionality and is tracked separately under C1 below.


Phase 1 comments — by severity

Must fix for Phase 1 merge

C1 — memory_forget is explicitly Phase 2; shipped as Phase 1 MCP tool.
The design doc is unambiguous at L169 (### memory_forget (Phase 2)) and L535 ("forget is an internal method for cleanup; MCP exposure via memory_forget is Phase 2"). The PR registers it anyway at mcp_server/server.py:765-794.-- Sugest remove this to a separate PR

Current state — shipped without doc update — means the design doc lies to anyone reading it.

C2 — REST CRUD endpoints missing.
§Phase 1 L537 calls for POST /memory, GET /memory, DELETE /memory, GET /terminals/{id}/memory. Only the last exists — and even that is shipped as GET /terminals/{terminal_id}/memory-context (api/main.py:437-458), with a different URL suffix than the spec.

Fix options (pick one):

  1. Implement the three missing endpoints — they're thin wrappers over MemoryService calls.
  2. Update §Phase 1 L537 to say "MCP + CLI only; REST CRUD deferred to Phase 4 alongside Web UI" and explain the rationale (web UI is Phase 4, no non-UI HTTP consumer exists yet).

Either works — the PR and doc just need to agree.

C3 — SQLite metadata layer absent.

One-line summary: The design describes a hybrid wiki + SQLite architecture in four separate places; the code ships file-only.

Design citations (all in memory-management.md): L533 (Phase 1 deliverable: MemoryMetadataModel table + migration), L543 (context delivery spec'd as "SQLite metadata query"), L593 (success criterion: "wiki files + SQLite metadata persisted"), L670-671 (§Rec 1 hybrid call-out), L743-749 (§Rec 5 full "Hybrid Storage" argument). The PR description acknowledges the divergence.

Why this still needs action:

  • cleanup_service._find_expired_entries (L164-167) and memory_service._parse_index (L461-464) are two hand-rolled regex variants of the same format — drift risk §Rec 5 explicitly warns about.
  • recall() and get_memory_context_for_terminal do O(N) regex-parse + file-read per call. Fine at Phase 1 scale; the design chose SQLite specifically to not worry about this.

Fix options (pick one):

  1. Add a MemoryMetadataModel SQLAlchemy table that mirrors the index.md bullet line. Write to both on store; read from SQLite on recall and _find_expired_entries. Keep index.md as the human-readable index.
  2. Update memory-management.md to re-scope Phase 1 to file-only — specifically: L533 (drop the MemoryMetadataModel line), L543 (replace with "index.md metadata scan"), L593 (drop "+ SQLite metadata"), and add a note to §Rec 5 explaining why Phase 1 ships without SQLite despite the hybrid pitch.

The doc and the code should agree. Right now they don't.

Should fix before merge

C4 — Per-scope injection cap (§Rec 6) not enforced.
get_memory_context_for_terminal(budget_chars=3000) is a single global budget (L615-628). §Rec 6 L756 specifies "Hard limit of 10 most recent memories per scope, max ~2KB total injection" — note per scope. A scope with 100 memories will monopolize the budget.

Fix: In get_memory_context_for_terminal, take at most N=10 memories per scope before applying the byte budget. Small change — one [:10] slice inside the scope loop. Keeps injection balanced across session/project/global.

C5 — Universal first-message injection vs per-provider table in §6a.

This is primarily a design-doc ambiguity, not a code failure. L543-544 and §6a L765-772 prescribe per-provider injection; L774 then licenses "Phase 1 simplification: Use instruction file/system prompt injection for all providers". The PR follows L774. Action is on the doc owner to pick one reading, then update Phase 1 Success Criterion S3 (L591) to match.

Shipped behavior: terminal_service.inject_memory_context (L52-74) prepends <cao-memory> to the first user message for all providers except Kiro. Kiro alone uses the agentSpawn hook.

If the doc settles on per-provider: push the injection logic behind a BaseProvider.format_memory_injection() method — matches @patricka3125 comment #4 and sets up Phase 2 cache-aware separation cleanly.

Nice to fix — low risk items worth noting

C6 — _memory_injected_terminals is module-level in-memory state.
terminal_service.py:49 — lost on server restart. Next user message after a restart mid-session will re-inject the memory block (harmless but surprising). Could be backed by a DB column or a per-terminal flag file. Low priority for Phase 1 — worth a comment at the definition site noting the known limitation.

C7 — Index regex drift risk (tied to C3).
memory_service._parse_index regex (L461-464) and cleanup_service._find_expired_entries regex (L164-167) are two separate hand-rolled patterns for the same format. Right now they agree — but there's no invariant enforcing that. Add a single test:

def test_parse_index_reads_what_update_index_writes():
    svc = MemoryService(base_dir=tmp_path)
    await svc.store(content="x", key="test-key", memory_type="project", scope="project", terminal_context={...})
    entries = svc._parse_index(svc.get_index_path("project", scope_id))
    assert entries[0]["key"] == "test-key"
    # Also parse with cleanup regex and assert it matches the same line

Or, if C3 is resolved by adding SQLite, this goes away entirely.

C8 — Hook flag file cleanup is time-based, not PID-based.
_cleanup_stale_hook_flags removes flag files older than 5 min (cleanup_service.py:218-242). If a Claude Code session runs a long operation inside a Stop hook (rare but possible during large memory_store calls), the flag could be deleted while still in use. PID-based cleanup (kill -0 $pid check) is more precise. Low priority — time-based is fine for Phase 1 if documented.

C9 — register_hooks_kiro always rewrites the agent config.
Every cao launch triggers register_hooks_kiro which self-heals the .json by removing all existing CAO entries and re-adding them (hooks/registration.py:174-186). Two concerns:

  1. If a user hand-edits ~/.kiro/agents/{profile}.json to tweak a CAO hook, their edit is silently erased on next launch.
  2. Writing to a user-managed config outside the CAO-owned filesystem is what @patricka3125's comment Feature: async delegate #3 flagged. The spec doesn't explicitly prohibit it, but Kiro's AgentSpawn hook is the only way CAO can inject memory into Kiro, so this is load-bearing. Worth either documenting the behavior in docs/memory.md or adding a "managed-by: cao" marker comment in the hook entries.

C10 — cao memory forget CLI + memory_forget MCP both delegate to MemoryService.forget().
If C1 is resolved by removing the MCP tool, the CLI command (cli/commands/memory.py delete/clear) still exposes the operation to the user — which is the intent of §User-control question (L641): "Should users be able to view, edit, and delete memories via CLI before Phase 4 web UI?". So the CLI path is correct Phase 1 behavior; only the MCP surface needs adjustment.


Testing — Phase 1 gaps

Phase 1 Success Criteria S1-S6 should each have at least one automated test. Coverage review:

Criterion Test coverage
S1 (store → wiki + index) test_memory_service.py covers this
S2 (cross-agent recall) ✅ Covered — any test that stores then calls recall() in a fresh MemoryService()
S3 (injection all 6 providers) ✅ for shipped behavior — test_memory_injection.py covers the universal prepend path + Kiro skip. Per-provider variants aren't tested because the code doesn't do them; if C5 is resolved in favor of per-provider, add coverage.
S4 (hooks fire reliably) ✅ for unit-level — test_memory_injection.py covers inject_memory_context. Shell-script end-to-end tests for Claude Code / Codex Stop hooks are an acceptable Phase 1 gap.
S5 (survive server restart) ⚠️ No test that restarts the service and recalls. Add: create MemoryService(), store(), discard, create fresh MemoryService(), recall() — passes because wiki is durable. Easy to add.
S6 (tiered retention) ✅ Cleanup tests likely exist — verify with rg "cleanup_expired_memories" src/tests

Recommended test additions (low effort, high signal):

  1. Writer/reader regex round-trip test (guards against C7 drift).
  2. Concurrent _update_index test — two threads calling store() simultaneously should produce two index entries, not one corrupted one. fcntl.flock is the safety net being tested.
  3. Restart durability test (S5).

Net — Phase 1

13 / 20 deliverables ✅, 5 / 20 ⚠️, 2 / 20 ❌ (plus sub-note #3a — doc/code mismatch). Call it ~70% spec-fidelity for Phase 1 — same number as the broader review arrived at, but this time with each item pinned to a specific design-doc line.

Merge-blocking: two items.

  1. C1 (memory_forget in Phase 1 MCP) — pick one of the two fixes above. Currently the code ships a tool the doc says is Phase 2, with no rationale.
  2. C3 (SQLite metadata layer absent) — pick one of the two fixes above. Currently the design doc describes a hybrid architecture and the code ships a file-only one. Either implement the hybrid or re-scope the doc.

Strongly recommended before merge:
3. C2 (REST CRUD) — implement or re-scope.
4. C4 (per-scope injection cap) — one-line fix, trivial.
5. Reply to @patricka3125's 7 inline comments — at least items 3, 4, 6, 7 have concrete code actions; 1, 2, 5 are design discussions the PR author should weigh in on.

Post-merge acceptable (Phase 2 rollover):

  • C5 (per-provider injection), C6 (in-memory tracker), C7 (regex drift if SQLite adopted), C8 (PID-based flag cleanup), C9 (Kiro config ownership)

The Phase 1 implementation is solid work — security posture is good, retention is faithful, hook registration is idempotent and self-healing, and the wiki+index pattern matches §Rec 1 exactly. The gaps are specific and fixable. Fix C1 and C3, reconcile the doc, and Phase 1 is mergeable.

haofeif added a commit that referenced this pull request Apr 22, 2026
- Remove "Persistent agent memory system" bullet — PR #179 is still open
  and was not part of this release
- Add missing #106 (stale spinners blocking Claude Code inbox delivery)
- Add missing #135 to cryptography version-bump coverage (46.0.5 → 46.0.7)
- Correct #137 and #157 references to "resolves" — they are issues, not PRs
- Fix install hint in #169 bullet to use real install path
  (uv tool install git+https://... rather than PyPI, which CAO isn't on)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
haofeif added a commit that referenced this pull request Apr 22, 2026
…efs (#197)

- Remove "Persistent agent memory system" bullet — PR #179 is still open
  and was not part of this release
- Add missing #106 (stale spinners blocking Claude Code inbox delivery)
- Add missing #135 to cryptography version-bump coverage (46.0.5 → 46.0.7)
- Correct #137 and #157 references to "resolves" — they are issues, not PRs
- Fix install hint in #169 bullet to use real install path
  (uv tool install git+https://... rather than PyPI, which CAO isn't on)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
fanhongy and others added 7 commits April 25, 2026 16:19
Pulls in PR #172 plugin system and PR #193 opencode_cli so later
phases can build plugins instead of provider hooks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Injects the CAO memory context into <cwd>/.claude/CLAUDE.md on
post_create_terminal. Path-validates the target, replaces any prior
block, logs-not-raises on every error path. 7 tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Writes <cwd>/.kiro/steering/cao-memory.md on post_create_terminal for
kiro_cli. Coexists with U7's agent-identity.md (separate owners).
Rename CC plugin test file to avoid rootdir collision. 8 tests for
Kiro plugin.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Accepted regression matching CC + Kiro: periodic save-memory reminder
dropped. Q/Gemini/Kimi/Copilot/OpenCode never had hooks, so Codex
joins them on BaseProvider's no-op default.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move CC/Kiro memory plugins to src/.../plugins/builtin/ with
cao.plugins entry points. Delete hooks/ dir and register_hooks
overrides; BaseProvider no-op stays.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Relabel memory_forget as Phase 1 MCP tool in MEMORY_SYSTEM_DESIGN.md
(closes haofeif C1). Add Codex + Other providers + Plugin architecture
subsections to docs/memory.md (closes patricka3125 Codex gap). Split
agent-profile guidance into Minimal vs Aggressive (closes patricka3125
L190). Ignore local MEMORY_TESTING.md runbook.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@fanhongy

fanhongy commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

The work evolved through Phase 2 (cao-memory-feature), Phase 2.5 (feat/memory-implementation-phase-1), and a plugin-migration pass. Summary of what landed.

Storage + correctness

  • SQLite metadata layer — MemoryMetadataModel in clients/database.py:72 with WAL mode and UniqueConstraint(key, scope, scope_id). Wiki files remain the source of truth; SQLite is the fast-lookup + concurrency mechanism.
  • Per-scope injection cap — MEMORY_MAX_PER_SCOPE=10 and MEMORY_SCOPE_BUDGET_CHARS=1000 enforced in get_memory_context_for_terminal(). No single scope can monopolize the envelope.
  • Index round-trip invariant — writer/reader regex drift is locked by a test (test_memory_service_index_roundtrip.py); ISO-8601 Z-suffix strptime format is part of the contract.
  • Durability + concurrent writes — MemoryService reinstantiation recovers all prior memories via SQLite; concurrent writers to the same scope are serialized by per-connection SQLite locking + atomic os.replace(). fcntl.flock remains as defense-in-depth in the fallback path.

Architecture

  • Opt-in flag — memory.enabled settings key. When false, writes raise, get_memory_context_for_terminal() returns "", and no filesystem or SQLite side-effects occur. Default behavior unchanged.
  • Stable project identity — three-tier resolver (explicit config → git remote → cwd hash) plus an alias table. Memories survive directory rename and resolve identically across git worktrees. Dry-run migration planner (plan_project_dir_migration) classifies renames as none | rename | merge | conflict; live --apply is deferred to Phase 3.
  • Hook registration on BaseProvider — terminal_service.py no longer contains any if provider == ProviderType.* hook conditionals. BaseProvider.register_hooks() is a no-op default; providers that need hooks override it.
  • Plugin migration — memory hooks for Claude Code and Kiro CLI now ship as entry-point plugins under src/cli_agent_orchestrator/plugins/builtin/. The hooks/ directory has been retired entirely. Non-CAO Claude Code sessions are now hook-free.

Scope cleanup

  • REST CRUD endpoints — formally deferred to Phase 3. Surface is now 1 shipped (GET /terminals/{id}/memory-context) + 3 deferred write endpoints. Rationale and Phase 3 prerequisites in aidlc-docs/phase2.5/phase3-backlog.md.
  • Spec sync — design docs reconciled against shipped reality. Stale Phase-2 labels on memory_forget corrected to Phase 1 (already shipped as an MCP tool in this PR).

Documentation

  • docs/memory.md gained dedicated subsections for Codex, Gemini/Kimi/Copilot (all instruction-only; no native hooks), and the plugin architecture.
  • Agent-profile guidance is now split into Minimal (default template; matches built-in profiles) and Aggressive (test-harness template with concrete store/recall triggers plus session-start / session-end rituals). The Aggressive template is used by the -local variants under ~/.aws/cli-agent-orchestrator/agent-store/; built-ins will be upgraded once the memory feature exits dogfood.

Quality gates

  • All 9 Phase 2.5 units passed a two-gate review (challenger + security-reviewer) using a source-mutation teeth-test methodology. Audit trail: aidlc-docs/phase2.5/audit.md.
  • 1,504+ tests passing, zero critical findings across both gates.
  • 14 CodeQL path-expression alerts previously flagged on hooks/registration.py — the file is now deleted, so the alerts should auto-resolve on this push.

Known deferrals (Phase 3)

Non-blocking items rolled forward: 9 INFO-severity findings from security review + the 3 REST write endpoints. Full list in aidlc-docs/phase2.5/phase3-backlog.md.

Happy to walk any specific change in more detail.

fanhongy and others added 3 commits May 13, 2026 20:56
AIDLC working docs are internal team artefacts, not project deliverables.
Remove from index and add to .gitignore to keep them out of PR diffs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Non-git projects fall back to sha256(realpath(cwd)) for identity, which
breaks on folder rename or mv. Add an opt-in .cao/project_id marker
(JSON: project_id + nonce + created_at) so identity travels with the
folder. Nonce + recorded realpath in SQLite enable rename vs copy
detection: if recorded path is gone treat as rename, else mint a fresh
id. Marker logic only runs in the cwd-hash fallback branch — override
and git-remote paths untouched. Opt out with memory.project_marker=false
or CAO_PROJECT_MARKER=0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- test_send_input_success and the parametrized post_send_message test
  mocked terminal metadata without a "provider" key, but the first-
  message path in send_input reads metadata["provider"] for
  log_session_event. Add provider to both mock dicts.
- BM25 perf budget of 500ms was too tight; bump to 1s. The 100-file
  corpus still catches pathological regressions with 2x headroom.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@haofeif haofeif requested a review from Copilot May 19, 2026 00:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@fanhongy fanhongy marked this pull request as draft May 19, 2026 04:33
@fanhongy

Copy link
Copy Markdown
Contributor Author

Converting this PR to Draft to split it into multiple for review purpose

fanhongy added a commit that referenced this pull request May 19, 2026
Phase 1 ships memory primitives (store/recall/forget) only. Hook-driven
auto-save is delivered via per-provider plugins in a subsequent PR
(per @patricka3125's recommendation in #179 to use Claude Code's plugin
system rather than in-line settings.local.json edits).

Removes:
- src/cli_agent_orchestrator/hooks/registration.py
- src/cli_agent_orchestrator/hooks/*.sh
- terminal_service hook registration call site
- api/main.py install_hooks() call
- cleanup_service stale-hook-flag sweep
- docs/memory.md auto-save section

Side effect: 14 CodeQL findings on hooks/registration.py vaporise with
the file. Resolves the path-injection warnings flagged on PR #179.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
haofeif added a commit that referenced this pull request May 22, 2026
…245)

* feat(memory): add persistent agent memory system (Phase 1)

  - File-based wiki storage under ~/.aws/cli-agent-orchestrator/memory/
  - 4 scopes: global, project (cwd-hashed), session, agent
  - MCP tools: memory_store, memory_recall, memory_forget
  - CLI commands: cao memory list/show/delete/clear
  - Auto-inject <cao-memory> context block on first message
  - Auto-save hooks: Claude Code (Stop + PreCompact), Kiro CLI (agentSpawn + userPromptSubmit)
  - Tiered retention with background cleanup
  - fcntl.flock() for concurrent-write safety
  - 1296 tests passing

* style: apply black formatter

CI black --check flagged 5 files. Run black to align with the
formatter version pinned in CI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(memory): remove hooks/registration.py from Phase 1

Phase 1 ships memory primitives (store/recall/forget) only. Hook-driven
auto-save is delivered via per-provider plugins in a subsequent PR
(per @patricka3125's recommendation in #179 to use Claude Code's plugin
system rather than in-line settings.local.json edits).

Removes:
- src/cli_agent_orchestrator/hooks/registration.py
- src/cli_agent_orchestrator/hooks/*.sh
- terminal_service hook registration call site
- api/main.py install_hooks() call
- cleanup_service stale-hook-flag sweep
- docs/memory.md auto-save section

Side effect: 14 CodeQL findings on hooks/registration.py vaporise with
the file. Resolves the path-injection warnings flagged on PR #179.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(memory): address P0/P1/P2/P3 findings from review

P0 — silent breakage:
- _get_terminal_context() now resolves cwd via the existing tmux
  working-directory helper, so project-scope storage works in
  production rather than only in tests that monkeypatch the helper.
- send_input() injects memory context for every provider, including
  Kiro (the default). The original Phase 1 commit skipped Kiro
  expecting the AgentSpawn hook to handle it; that hook lands in a
  later plugin-migration PR, so inline injection is the correct
  Phase 1 path for now.

P1 — correctness gaps:
- Session and agent scopes nest scope_id into wiki paths, so two
  sessions or two agent profiles can no longer collide on the same
  key. Index entries follow the same nested shape.
- Retention is keyed on scope (global/agent never expire; project 90d;
  session 14d) rather than memory_type, with user/feedback as a
  scope-independent override. Matches docs/memory.md.
- store(scope="project") raises when no working directory is
  resolvable, instead of silently mixing project memories into the
  global container.

P2 — cheap correctness:
- tags: regex no longer truncates hyphens, so tags like ci-cd parse
  intact.
- forget() passes the current timestamp into the index header so
  removals don't blank the <!-- Updated: --> line.
- updated_at serialises via strftime("%Y-%m-%dT%H:%M:%SZ"), avoiding
  the invalid "+00:00Z" form for tz-aware datetimes.
- store() returns the action ("created" or "updated") on the Memory
  model so the MCP layer no longer infers it from second-granularity
  timestamps.
- CLI key validation enforces the same 60-char ceiling as
  MemoryService._sanitize_key.
- cleanup_expired_memories() offloads sync glob/parse work to
  asyncio.to_thread so it doesn't block the FastAPI event loop on
  startup.

P3 — process:
- Added a production-path test for _get_terminal_context() that
  doesn't monkeypatch the function itself, so a P0-A regression
  surfaces immediately.
- docs/memory.md now matches the scope-keyed retention semantics.

67 tests pass; black and isort clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(memory): address Copilot follow-up comments on 5af8217

Six findings from the second Copilot pass: rewrite wiki header on
upsert so recall() sees fresh type/tags; normalize tag whitespace
so the index regex still matches; offload forget() in cleanup via
asyncio.to_thread; add success:True to memory_recall envelope;
fix misleading container comment and storage-layout doc diagram.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Potential fix for pull request finding

fix as suggested

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix(memory): address Copilot review findings

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix(memory): isolate recall by session/agent scope_id

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: haofeif <56006724+haofeif@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants