You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hermes Agent's current memory system stores flat text entries (MEMORY.md and USER.md) with manual add/replace/remove operations. Every memory operation is a passive read or write — the agent decides what to save and where, there's no contradiction detection, no confidence-aware retrieval, no automatic extraction, and no forgetting mechanism. This means the agent can store "We use PostgreSQL" on Monday and "We switched to MySQL" on Friday, and both coexist permanently until the agent manually notices the conflict.
CrewAI's Cognitive Memory (v1.10.1, MIT licensed) introduces a fundamentally different approach: memory as cognition rather than storage. Every memory operation — encode, consolidate, recall, extract, forget — is an active reasoning process powered by LLM analysis. When you store a memory, the system auto-classifies it, detects contradictions against existing knowledge, and resolves them. When you recall, the system evaluates its own confidence and searches deeper when unsure. The result is a self-maintaining knowledge base that compounds over time rather than accumulating contradictions.
This issue proposes adding a cognitive operations layer to Hermes Agent's memory system, building on the structured storage foundation proposed in #346. Where #346 defines the storage and retrieval infrastructure (typed nodes, vector search, LanceDB), this issue defines the intelligent operations that make memory behave like cognition.
classMemoryAnalysis(BaseModel):
suggested_scope: str# Hierarchical path, e.g. "/infrastructure/database"categories: list[str] # Tags like ["postgresql", "migration"]importance: float# 0.0 to 1.0extracted_metadata: dict# Entities, dates, topics
This is the key insight: structure emerges from content, the agent doesn't need to specify types or categories — the system infers them. But the agent CAN override when it wants control.
2. Consolidate (triggered during encoding)
When similarity ≥0.85 is detected against existing memories, the LLM produces a ConsolidationPlan:
Example: "We use PostgreSQL for the user database" exists. New memory: "We migrated to MySQL last week." Consolidation detects the contradiction and produces: [{record_id: "abc", action: "update", updated_content: "We migrated from PostgreSQL to MySQL for the user database last week"}], insert_new: false. Result: one coherent memory instead of two contradictory ones.
3. Recall (recall()) — Adaptive Depth
The RecallFlow implements confidence-based retrieval routing:
Analyze query — LLM distills the query into targeted sub-queries (skipped for queries <200 chars)
Filter and chunk — Select candidate scopes to search
Search chunks — Parallel vector search across (embeddings × scopes)
Confidence routing:
confidence ≥ 0.8 → synthesize and return results
confidence < 0.5 and budget > 0 → explore deeper (broader scopes, different strategies)
Defaults: semantic=0.5, recency=0.3, importance=0.2, half_life=30 days. This means a critical architecture decision from 6 months ago outranks a trivial note from yesterday that happens to mention the same keyword.
Each result includes match_reasons (why it scored high) and evidence_gaps (what information is still missing) — the system knows what it doesn't know.
4. Extract (extract_memories())
Decomposes raw text into atomic, self-contained facts using LLM:
raw="""After reviewing options, the team recommends PostgreSQL for JSONB support.Estimated cost is $2,400/month on RDS. Compliance requires EU data residency.DevOps prefers managed services."""facts=memory.extract_memories(raw)
# → ["Team recommends PostgreSQL for user database due to JSONB support",# "Estimated database cost is $2,400/month on RDS",# "Compliance requires all user data to remain in EU regions",# "DevOps prefers managed services over self-hosted"]
Each extracted fact enters the full encoding pipeline independently. This is what powers automatic memory capture from task outputs — the agent doesn't need to decide what's worth remembering.
5. Forget (forget())
Targeted purging by scope, categories, age, or specific record IDs:
Memories are organized in filesystem-like paths: /infrastructure/database, /compliance/eu, /project/alpha/decisions. The LLM auto-assigns scopes during encoding, building a self-organizing hierarchy.
Two access patterns:
MemoryScope — Restricts an agent to a subtree (e.g., memory.scope("/agent/researcher"))
MemorySlice — Reads from multiple disjoint branches (e.g., memory.slice(["/compliance", "/security"]))
Key Design Decisions
Structure emerges from content — No predefined schema. The LLM infers scope, categories, importance. This avoids the "8 fixed types" problem where the agent needs to learn a taxonomy.
Consolidation over contradiction edges — Instead of maintaining "Contradicts" graph edges (Spacebot), resolve conflicts at write time via LLM. Cleaner data, fewer stale edges.
Confidence-aware retrieval — The system tells you when it's not sure, rather than silently returning low-quality matches.
Non-blocking writes — remember_many() runs encoding in a background thread; recall() auto-drains pending writes first. Good for performance-sensitive paths.
Graceful degradation — Every LLM call in the pipeline has a fallback (safe defaults on failure). If the analysis LLM is down, memories still get stored with default importance/scope.
Current State in Hermes Agent
What We Have
Component
Current Implementation
Gap
Memory write
Manual add with flat text entry
No auto-classification, no contradiction detection
It extends tools/memory_tool.py with new operations (recall, extract, forget)
It requires LLM integration inside the memory tool (not just storing text the agent provides)
It modifies the memory storage layer (new fields: scope, categories, importance, embeddings)
It integrates with context compression (automatic memory extraction)
It needs embedding infrastructure as a core dependency
It must work across all platforms (CLI, Telegram, Discord) uniformly
What We'd Need
Embedding infrastructure — Local (fastembed/sentence-transformers) or API-based (OpenAI). LanceDB as vector store (Apache 2.0, already validated by both Spacebot and CrewAI).
Auxiliary LLM calls — Use the existing auxiliary client (agent/auxiliary_client.py, same pattern as session_search) for encoding analysis, consolidation, extraction, and query analysis.
Memory tree visualization: tree command showing scope hierarchy with counts
Deliverable: Multi-agent, multi-user memory with access control
Pros & Cons
Pros
Self-maintaining knowledge — Contradictions auto-resolved instead of silently accumulating. The agent stops having to manually detect and fix conflicting memories.
No capacity ceiling — Importance decay + pruning replaces hard character limits (currently at 97% capacity). Knowledge compounds instead of hitting walls.
Semantic retrieval — Find relevant memories by meaning, not just keyword overlap. "What database are we using?" matches "We migrated to MySQL" even without shared keywords.
Automatic knowledge capture — Memory extraction from task outputs and context compression means less information is lost between sessions.
Confidence awareness — The system tells you when retrieval is uncertain, enabling better decision-making.
Backward compatible — Phase 1 maintains existing add/replace/remove interface. Cognitive features are additive.
MIT-licensed reference — CrewAI's Python implementation (3,500 lines) can be freely studied and adapted.
Proven at scale — CrewAI processes billions of agentic executions; this design is battle-tested.
Configurable — Weights, thresholds, half-life, and depth are all tunable per deployment.
Cons / Risks
LLM cost on every write — Encoding analysis adds 1-2 LLM calls per remember(). For high-volume scenarios, this could be expensive. Mitigations: use cheap models (gpt-4o-mini), skip analysis when all fields provided (Group A path), batch processing.
Latency on writes — LLM calls add ~200-500ms per memory write. Mitigation: non-blocking writes for batch operations.
Complexity jump — Moving from flat text files to LLM-powered cognitive pipelines is a major architectural change. Must be phased carefully to avoid breaking the current simple workflow.
Dependency on auxiliary LLM — Encoding/consolidation/extraction require a working LLM. If the auxiliary model is unavailable, cognitive features degrade to simple storage. CrewAI handles this with graceful fallbacks — we should too.
Embedding model dependency — Vector search requires either an API-based embedder (adds API cost) or a local model (adds ~100MB disk + memory). Decision needed.
Testing complexity — Cognitive operations are non-deterministic (LLM-dependent). Need mocked tests + integration tests with real models.
Risk of over-classification — LLM might assign incorrect scopes or importance. Bad classifications could make recall worse. Mitigated by configurable overrides and fallback defaults.
Open Questions
Embedding model: local vs. API? Local (fastembed with all-MiniLM-L6-v2, 384 dims, ~100MB) is private and fast. API (OpenAI text-embedding-3-small, 1536 dims) is higher quality but adds cost and latency. Could support both with a config flag.
Should cognitive encoding be opt-in or opt-out? Default enabled with memory.cognitive: true in config, or default disabled requiring explicit opt-in? Given the LLM cost, opt-in might be safer initially.
Consolidation threshold? CrewAI uses 0.85 cosine similarity. Too low = false positives (merging unrelated memories). Too high = missed contradictions. Should be configurable.
How should scoped memory interact with current MEMORY.md/USER.md distinction? Replace the two-file split with scopes (/agent/notes and /user/profile) or keep them as special-case scopes?
Integration with session_search? Should recall also search session transcripts, or keep structured memory and session recall as separate tools?
Storage backend? LanceDB (validated by both Spacebot and CrewAI) vs. extending the existing SQLite state.db with pgvector-like vector operations?
Overview
Hermes Agent's current memory system stores flat text entries (MEMORY.md and USER.md) with manual add/replace/remove operations. Every memory operation is a passive read or write — the agent decides what to save and where, there's no contradiction detection, no confidence-aware retrieval, no automatic extraction, and no forgetting mechanism. This means the agent can store "We use PostgreSQL" on Monday and "We switched to MySQL" on Friday, and both coexist permanently until the agent manually notices the conflict.
CrewAI's Cognitive Memory (v1.10.1, MIT licensed) introduces a fundamentally different approach: memory as cognition rather than storage. Every memory operation — encode, consolidate, recall, extract, forget — is an active reasoning process powered by LLM analysis. When you store a memory, the system auto-classifies it, detects contradictions against existing knowledge, and resolves them. When you recall, the system evaluates its own confidence and searches deeper when unsure. The result is a self-maintaining knowledge base that compounds over time rather than accumulating contradictions.
This issue proposes adding a cognitive operations layer to Hermes Agent's memory system, building on the structured storage foundation proposed in #346. Where #346 defines the storage and retrieval infrastructure (typed nodes, vector search, LanceDB), this issue defines the intelligent operations that make memory behave like cognition.
Research source: CrewAI memory source code (~3,500 lines across 8 files), Blog post. License: MIT.
Research Findings
CrewAI's Architecture: Five Cognitive Operations
CrewAI's
Memoryclass exposes five cognitive operations, each backed by an LLM-powered pipeline:1. Encode (
remember())When content is stored, an
EncodingFlowruns a 5-step pipeline:The LLM produces a
MemoryAnalysisfor each item:This is the key insight: structure emerges from content, the agent doesn't need to specify types or categories — the system infers them. But the agent CAN override when it wants control.
2. Consolidate (triggered during encoding)
When similarity ≥0.85 is detected against existing memories, the LLM produces a
ConsolidationPlan:Example: "We use PostgreSQL for the user database" exists. New memory: "We migrated to MySQL last week." Consolidation detects the contradiction and produces:
[{record_id: "abc", action: "update", updated_content: "We migrated from PostgreSQL to MySQL for the user database last week"}],insert_new: false. Result: one coherent memory instead of two contradictory ones.3. Recall (
recall()) — Adaptive DepthThe
RecallFlowimplements confidence-based retrieval routing:Composite scoring formula:
Defaults: semantic=0.5, recency=0.3, importance=0.2, half_life=30 days. This means a critical architecture decision from 6 months ago outranks a trivial note from yesterday that happens to mention the same keyword.
Each result includes
match_reasons(why it scored high) andevidence_gaps(what information is still missing) — the system knows what it doesn't know.4. Extract (
extract_memories())Decomposes raw text into atomic, self-contained facts using LLM:
Each extracted fact enters the full encoding pipeline independently. This is what powers automatic memory capture from task outputs — the agent doesn't need to decide what's worth remembering.
5. Forget (
forget())Targeted purging by scope, categories, age, or specific record IDs:
Hierarchical Scopes
Memories are organized in filesystem-like paths:
/infrastructure/database,/compliance/eu,/project/alpha/decisions. The LLM auto-assigns scopes during encoding, building a self-organizing hierarchy.Two access patterns:
MemoryScope— Restricts an agent to a subtree (e.g.,memory.scope("/agent/researcher"))MemorySlice— Reads from multiple disjoint branches (e.g.,memory.slice(["/compliance", "/security"]))Key Design Decisions
remember_many()runs encoding in a background thread;recall()auto-drains pending writes first. Good for performance-sensitive paths.Current State in Hermes Agent
What We Have
addwith flat text entryremoveRelevant Existing Issues
Implementation Plan
Skill vs. Tool Classification
This should be a core codebase change because:
tools/memory_tool.pywith new operations (recall, extract, forget)What We'd Need
agent/auxiliary_client.py, same pattern as session_search) for encoding analysis, consolidation, extraction, and query analysis.recall(semantic search),extract(decompose text),forget(targeted purge). Existing actions (add/replace/remove) remain backward compatible.add, optionally run LLM analysis for auto-scope/categories/importance and check for contradictions.extract_memorieson the compressed content and store atomic facts.Phased Rollout
Phase 1: Cognitive Encoding — Auto-Classification + Contradiction Resolution
memory_tool.pywith optional LLM analysis onadd(auto-infer scope, categories, importance if not provided by agent)scopeandimportanceparameters to the memory tool schemamemory.cognitive: true|falsein config.yaml (default: true)Phase 2: Semantic Recall + Composite Scoring
recallaction: semantic search with composite scoring (similarity × w_sim + recency × w_rec + importance × w_imp)match_reasonsandevidence_gapsPhase 3: Automatic Memory Extraction + Forgetting
extractaction: decompose text blobs into atomic factsforgetaction: targeted purge by scope, age, categoriesPhase 4: Scoped Access + Multi-Agent Memory
treecommand showing scope hierarchy with countsPros & Cons
Pros
Cons / Risks
remember(). For high-volume scenarios, this could be expensive. Mitigations: use cheap models (gpt-4o-mini), skip analysis when all fields provided (Group A path), batch processing.Open Questions
memory.cognitive: truein config, or default disabled requiring explicit opt-in? Given the LLM cost, opt-in might be safer initially./agent/notesand/user/profile) or keep them as special-case scopes?recallalso search session transcripts, or keep structured memory and session recall as separate tools?References