You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Letta AI's claude-subconscious (MIT, TypeScript, v1.5.1) introduces a fundamentally different approach to persistent memory: instead of the main agent managing its own memory (our current approach), a separate "subconscious" agent observes session transcripts asynchronously, extracts patterns, and injects guidance — without the conscious agent needing to think about memory at all.
This is inspired by MemGPT's stateful agent architecture and Letta's sleep-time compute concept, where agents process information during downtime to form new connections. The key architectural insight is separation of concerns: the conscious agent focuses on the task; the subconscious agent handles learning, pattern detection, and proactive guidance.
Hermes already has the foundational infrastructure to implement this natively — auxiliary_client for cheap background LLM calls, flush_memories for pre-compression processing, session_search for cross-session recall, and the memory tool for persistent storage. What's missing is the observer architecture that ties these together into an autonomous memory processing pipeline.
Research Findings
How claude-subconscious Works
The plugin registers 4 hook points in Claude Code's lifecycle:
SessionStart — Notifies the Letta agent, syncs initial memory blocks
UserPromptSubmit — Injects memory blocks + agent messages via stdout; sends user's prompt to Letta as "early notification" so it can start processing while Claude works
PreToolUse (checkpoint) — At natural pause points (when Claude is about to ask the user a question), sends the full transcript to Letta and blocks for 2-5 seconds waiting for guidance. The guidance is injected as additionalContext before the tool executes, giving the subconscious advisory power at decision points
Stop — Fire-and-forget background worker sends the full transcript to Letta asynchronously (detached Node process that survives the hook's exit)
The Letta agent has 8 structured memory blocks (each 20K char limit):
core_directives — Role and behavior rules
guidance — Active advice for the current session (the primary output channel)
tool_guidelines — How to use its own tools effectively
Transcript processing is efficient: thinking is truncated to 500 chars, tool results to 1500 chars, tool inputs are summarized intelligently (file operations → file path, bash → command, search → query). The formatted transcript is sent as XML:
<claude_code_session_update>
<session_id>...</session_id>
<transcript>
<messagerole="user">...</message>
<messagerole="claude_code">...</message>
</transcript>
<instructions>You may provide commentary or guidance...</instructions>
</claude_code_session_update>
Key Design Decisions
Diff-based memory injection: First prompt gets full memory blocks. Subsequent prompts only show line-level diffs of changed blocks, minimizing context waste.
Conversation multiplexing: One Letta agent serves ALL projects. Each Claude Code session gets its own Letta "conversation" (thread), but memory blocks are shared globally. Learning from project-A automatically benefits project-B.
Three operating modes: whisper (messages only), full (blocks + messages), off. Default is whisper — minimal intrusion.
The subconscious has personality: System prompt establishes it as a "persistent presence that builds rapport" — not a logging service. It can "share partial thoughts, have opinions, express curiosity."
Checkpoint intervention: The blocking checkpoint at AskUserQuestion hooks is the most interesting pattern — the subconscious can modify Claude's behavior at decision points by injecting advisory context BEFORE Claude asks its question.
Current State in Hermes Agent
What we have:
MEMORY.md + USER.md: Flat-file, free-form entries separated by §, character-limited (~3.5K chars total). The main agent manually decides what to save via the memory tool. (tools/memory_tool.py)
flush_memories: Pre-compression mechanism that injects a system message asking the agent to save anything worth remembering, then makes ONE LLM call with only the memory tool available. (run_agent.py)
Memory nudges: Every N turns, appends a system reminder to consider saving memories.
Frozen snapshot pattern: Memory loaded once at session start, writes go to disk but don't update the system prompt mid-session (preserves KV-cache prefix caching).
Session search: FTS5-indexed SQLite database of all past conversations, searchable via session_search tool with LLM-summarized results.
Skills: Procedural memory as SKILL.md files with templates and scripts.
auxiliary_client: Cheaper model (e.g., Gemini Flash) already available for background LLM tasks.
Honcho integration: Optional external cross-session user modeling.
What's missing (the gap):
No automatic memory processing — the agent must consciously decide to use the memory tool
No pattern detection across sessions — the agent must manually search past sessions
No structured memory blocks — everything is free-form text
No proactive guidance — no mechanism to prepare insights for the next session
No background processing — flush_memories runs synchronously in the main loop
No diff-based injection — full memory snapshot injected every time
The current flush_memories is a single rushed API call at compression time — it's a "last chance save" not a thoughtful observer
How this issue differs: All existing issues improve memory primitives but assume the same agent manages its own memory within the conversation loop. This issue proposes a separate observer process that runs asynchronously, processing transcripts with a dedicated LLM call and updating memory independently. It's an architectural pattern that layers on top of whatever memory storage and operations we build.
Implementation Plan
Skill vs. Tool Classification
This should be a core codebase change, not a skill or tool. Reasons:
It requires deep integration with the agent lifecycle (session start, compression, session end)
It needs access to the full conversation transcript and session database
It runs automatically in the background without explicit invocation
It modifies the system prompt injection (structured blocks, diffs)
It manages the auxiliary_client for background LLM calls
The observer would be a new module (e.g., agent/subconscious.py) integrated into run_agent.py's lifecycle hooks.
What We'd Need
New module: agent/subconscious.py — Observer agent logic
Modified prompt injection: Block-based memory with optional diff mode
New config section: subconscious with enable/disable, block definitions, processing triggers
Integration points in run_agent.py: post-session, pre-compression, session-start
Phased Rollout
Phase 1: Post-Session Observer (MVP)
After session end or context compression, spawn an auxiliary LLM call (Gemini Flash / cheap model)
Feed it the full conversation transcript (truncated/summarized like claude-subconscious does — thinking to 500 chars, tool results to 1500 chars, tool inputs summarized)
Prompt it to extract and categorize information into structured blocks:
user_preferences — coding style, tool preferences, communication patterns
During long sessions, periodically process the current transcript (every N turns or at compression)
Inject mid-session guidance: "Consider saving this helper function as a skill", "The user seems frustrated — the error is probably in X"
Sleep-time compute: Schedule a cron job that processes recent sessions during downtime, consolidates memory blocks, resolves contradictions, and prepares briefings
Zero cognitive overhead: The main agent focuses on tasks instead of memory management. No more "did I remember to save that?" — the observer handles it automatically
Richer memory: A dedicated observer with the full transcript extracts more information than the current rushed flush_memories single-call approach
Cross-session continuity: Pattern detection and guidance blocks help the agent pick up where it left off, which is especially valuable for gateway/messaging sessions that reset frequently
Cheap to run: Observer uses auxiliary_client (Gemini Flash or similar) — pennies per session. The transcript summarization keeps input small
Builds on existing infrastructure: No new external dependencies. Uses auxiliary_client, session DB, memory tool, and config system already in place
Proven pattern: Letta's implementation validates the architecture. Their blog reports meaningful improvements in cross-session task continuation
Cache-friendly: Structured blocks injected at the start of the system prompt (stable position) are more cache-friendly than free-form entries that change unpredictably
Cons / Risks
Extra API cost: Each session end triggers a background LLM call. For heavy users (50+ sessions/day on gateway), this adds up. Mitigation: configurable, use cheapest available model, skip short sessions
Latency at session start: Phase 2's session-start review adds an LLM call before the first response. Mitigation: pre-compute during sleep-time, cache results, make it async
Memory staleness: If the observer runs only at session end, its knowledge is always one session behind. Phase 3 addresses this with real-time processing but adds complexity
Contradictory guidance: The observer might provide guidance that contradicts the user's current intent (e.g., "you usually use npm" when the user has switched to pnpm). Mitigation: recency-weighted extraction, user can override via memory tool
Complexity: Another moving part in the agent loop. Must be well-tested and fail gracefully (observer failure should never block the main agent)
Block size management: Structured blocks need size limits and consolidation logic to prevent unbounded growth (claude-subconscious uses 20K per block — we'd want something smaller for context efficiency)
Open Questions
Block granularity: Should we start with the full 8-block structure from claude-subconscious, or begin with fewer blocks (e.g., just guidance, user_preferences, pending_items) and expand based on real usage?
Interaction with existing memory: Should the observer replace MEMORY.md/USER.md entirely, or coexist alongside them? The manual memory tool is useful for explicit "remember this" — perhaps keep USER.md manual and make other blocks observer-managed?
Gateway vs CLI: Gateway sessions are typically shorter and more frequent. Should the observer behave differently (e.g., batch-process recent sessions instead of processing each one)?
Multi-user: For gateway deployments with multiple users, should each user get their own observer state? (Probably yes — this aligns with existing per-user session isolation)
Relationship to Honcho: If Honcho integration is enabled, should the observer write to Honcho instead of/in addition to local blocks? Honcho already does some cross-session user modeling
Overview
Letta AI's claude-subconscious (MIT, TypeScript, v1.5.1) introduces a fundamentally different approach to persistent memory: instead of the main agent managing its own memory (our current approach), a separate "subconscious" agent observes session transcripts asynchronously, extracts patterns, and injects guidance — without the conscious agent needing to think about memory at all.
This is inspired by MemGPT's stateful agent architecture and Letta's sleep-time compute concept, where agents process information during downtime to form new connections. The key architectural insight is separation of concerns: the conscious agent focuses on the task; the subconscious agent handles learning, pattern detection, and proactive guidance.
Hermes already has the foundational infrastructure to implement this natively —
auxiliary_clientfor cheap background LLM calls,flush_memoriesfor pre-compression processing,session_searchfor cross-session recall, and the memory tool for persistent storage. What's missing is the observer architecture that ties these together into an autonomous memory processing pipeline.Research Findings
How claude-subconscious Works
The plugin registers 4 hook points in Claude Code's lifecycle:
additionalContextbefore the tool executes, giving the subconscious advisory power at decision pointsThe Letta agent has 8 structured memory blocks (each 20K char limit):
core_directives— Role and behavior rulesguidance— Active advice for the current session (the primary output channel)user_preferences— Learned coding stylesproject_context— Architecture decisions, key filessession_patterns— Recurring struggles, time-based patternspending_items— Unfinished work, TODOsself_improvement— Meta-learning guidelinestool_guidelines— How to use its own tools effectivelyTranscript processing is efficient: thinking is truncated to 500 chars, tool results to 1500 chars, tool inputs are summarized intelligently (file operations → file path, bash → command, search → query). The formatted transcript is sent as XML:
Key Design Decisions
Diff-based memory injection: First prompt gets full memory blocks. Subsequent prompts only show line-level diffs of changed blocks, minimizing context waste.
Conversation multiplexing: One Letta agent serves ALL projects. Each Claude Code session gets its own Letta "conversation" (thread), but memory blocks are shared globally. Learning from project-A automatically benefits project-B.
Three operating modes:
whisper(messages only),full(blocks + messages),off. Default is whisper — minimal intrusion.The subconscious has personality: System prompt establishes it as a "persistent presence that builds rapport" — not a logging service. It can "share partial thoughts, have opinions, express curiosity."
Checkpoint intervention: The blocking checkpoint at
AskUserQuestionhooks is the most interesting pattern — the subconscious can modify Claude's behavior at decision points by injecting advisory context BEFORE Claude asks its question.Current State in Hermes Agent
What we have:
§, character-limited (~3.5K chars total). The main agent manually decides what to save via thememorytool. (tools/memory_tool.py)session_searchtool with LLM-summarized results.What's missing (the gap):
flush_memoriesis a single rushed API call at compression time — it's a "last chance save" not a thoughtful observerRelated existing issues:
How this issue differs: All existing issues improve memory primitives but assume the same agent manages its own memory within the conversation loop. This issue proposes a separate observer process that runs asynchronously, processing transcripts with a dedicated LLM call and updating memory independently. It's an architectural pattern that layers on top of whatever memory storage and operations we build.
Implementation Plan
Skill vs. Tool Classification
This should be a core codebase change, not a skill or tool. Reasons:
The observer would be a new module (e.g.,
agent/subconscious.py) integrated intorun_agent.py's lifecycle hooks.What We'd Need
agent/subconscious.py— Observer agent logicsubconsciouswith enable/disable, block definitions, processing triggersrun_agent.py: post-session, pre-compression, session-startPhased Rollout
Phase 1: Post-Session Observer (MVP)
user_preferences— coding style, tool preferences, communication patternsproject_context— architecture decisions, key files, tech stackpending_items— unfinished work, TODOs, follow-upssession_patterns— recurring topics, common errors, time patternsguidance— proactive suggestions for the next session~/.hermes/memories/)flush_memorieswith this richer processing pipelinesubconscious.enabled: true,subconscious.model: auto(uses auxiliary_client),subconscious.blocks: [user_preferences, project_context, pending_items, session_patterns, guidance]Phase 2: Cross-Session Pattern Detection + Guidance Injection
guidanceblockpending_itemsblock highlights unfinished work from the previous sessionsubconscious.session_start_review: true,subconscious.review_depth: 3(sessions)Phase 3: Real-time Advisory + Sleep-time Compute
clarifytool calls (like claude-subconscious's AskUserQuestion hook)subconscious.realtime: false(opt-in),subconscious.sleep_compute: false(opt-in)Pros & Cons
Pros
Cons / Risks
Open Questions
guidance,user_preferences,pending_items) and expand based on real usage?~/.hermes/memories/? Or wait for Feature: Structured Memory System — Typed Nodes, Graph Edges, and Hybrid Search #346's structured storage?References
/init,/remember, and skill learning