Your Agent Has Amnesia
Every agentic memory system is a vector database pretending to be a brain. Here’s what happens when you build one that actually works like one.
Your coding agent is brilliant for thirty minutes.
Then the context window closes, and it forgets everything.
Tomorrow morning, it doesn’t know your codebase uses yarn instead of npm. It doesn’t know you migrated from React to Svelte last month. It doesn’t know that the last time it tried to refactor the auth module, it introduced a race condition that took three hours to debug.
The industry’s answer to this has been, roughly: dump everything into a vector database and hope cosine similarity is enough.
LongMemEval --- an ICLR 2025 benchmark that tests chat assistants on long-term interactive memory --- found that even the best commercial systems achieve only 30-70% accuracy on memory tasks, with 30-60% performance drops versus oracle retrieval as conversation history scales. Your agent doesn’t just forget. It forgets badly, and it gets worse the longer you use it.
It isn’t enough. And I’ve spent the last several months building the thing I think should exist instead.
The Two Failure Modes
Current LLM memory falls into two buckets, and both are broken in ways that compound over time.
Failure Mode 1: Context Stuffing. Cram the entire conversation history --- or worse, the entire codebase --- into the context window. Tools like gitingest and repo2txt try this. It works until it doesn’t. Faros AI’s research on context engineering shows model correctness starts dropping around 32,000 tokens, even for models claiming much larger windows. You’re paying for millions of tokens the model mostly ignores, and you’re introducing noise that actively degrades performance. A hallucination in the conversation history gets referenced repeatedly --- “context poisoning” --- and the agent starts building on its own mistakes.
Failure Mode 2: Store-Everything RAG. The “serious” approach. Stand up a vector database --- Pinecone, Weaviate, Qdrant, whatever --- chunk your data, embed it, and retrieve by cosine similarity at query time. Some systems have gotten more sophisticated --- Mem0 can update and delete facts, Zep tracks temporal validity, Letta lets agents self-edit memory. But the fundamental model is the same: store first, sort later.
The problem: none of these systems filter at ingestion.
A routine “sounds good” gets the same processing as “my production database just corrupted.” The store grows monotonically --- nothing is ever consolidated into more abstract knowledge, nothing decays, nothing is forgotten by design. And the retrieval is one-dimensional: cosine similarity, with no awareness of your current cognitive state or the associative relationships between stored knowledge.
This is not how memory works.
Not human memory.
Not any memory system that scales.
What Brains Actually Do
Here’s the thing that caught my attention when I started reading the neuroscience literature: human memory is efficient precisely because it does the opposite of what we build in software.
Memory is reconstructive, not reproductive. Every act of recall doesn’t just read the memory --- it rewrites it. Karim Nader’s landmark 2000 paper demonstrated that recalled memories enter a “labile” state where they can be modified before being re-stored. This isn’t a bug in biological wetware. It’s the mechanism that keeps memories current. When you remember that your favorite restaurant moved locations, the old address doesn’t sit in a separate row contradicting the new one. The memory updates in place.
Memory is surprisal-gated. Karl Friston’s predictive coding framework --- one of the most influential theories in modern neuroscience --- proposes that the brain constantly generates predictions about incoming sensory input. Only prediction errors --- events that violate expectations --- get encoded into long-term memory. This is why you remember the car accident but not the ten thousand uneventful drives that preceded it. The hippocampus acts as a novelty detector, firing strongly in response to surprise and triggering consolidation only when something unexpected happens.
Memory is state-dependent. Gordon Bower’s 1981 experiments showed that emotional state at encoding affects retrieval: memories formed during fear are more easily recalled during fear. The amygdala tags memories with emotional salience, creating retrieval pathways that are invisible to pure semantic similarity. When you’re panicking about a production outage, your brain doesn’t surface that calm architectural discussion from last Tuesday. It surfaces the last time you fixed a production outage.
Memory is lossy by design. During sleep --- particularly slow-wave deep sleep --- the hippocampus “replays” the day’s episodic memories, allowing the neocortex to extract generalizable semantic knowledge from specific events. “I typed kubectl rollout restart at 3:47 PM on Tuesday after seeing OOMKilled errors in the staging pod” compresses to “restart deployments when you see OOM errors in Kubernetes.” The episodic detail is discarded. The useful abstraction survives.
Memory is associative. Collins and Loftus’s 1975 spreading activation theory describes semantic memory as a network where recalling one concept pre-activates related concepts. Hearing “Docker” primes “container,” “Kubernetes,” “nginx,” “deployment” --- before you consciously ask for them. This is why human recall has a “train of thought” quality that point-lookup vector search completely lacks.
Five mechanisms. Reconstructive recall. Surprisal-gated encoding. State-dependent retrieval. Lossy consolidation. Associative activation. Every modern memory system I’ve evaluated ignores all five.
So I built one that doesn’t.
mnemos: Five Modules, One Pipeline
mnemos --- from the Ancient Greek mnemos (μνῆμος), meaning “mindful, remembering” --- implements each of these five neuroscience mechanisms as a composable Python module. They can be used independently or composed together through a single orchestrating engine.
The architecture is a pipeline that mirrors the human memory cycle:
ENCODE: Input → SurprisalGate (filter) → AffectiveRouter (tag) → Store
↓
SleepDaemon episodic buffer
RETRIEVE: Query → SpreadingActivation (associative context)
+ AffectiveRouter (state-dependent re-ranking)
+ MutableRAG (background reconsolidation)
→ Ranked results
MAINTAIN: Idle trigger → SleepDaemon (episodic → semantic compression)
Let me walk through each module, because the design decisions are where the real differentiation lives.
1. SurprisalGate --- The Prediction Error Filter
Every input passes through a surprisal gate before it touches long-term storage. A lightweight LLM continuously predicts the user’s next intent from conversation history. When the actual input arrives, mnemos computes the cosine distance between the prediction embedding and the actual input embedding.
Low divergence? Expected input. Discarded. High divergence? Genuine surprise. Stored with a salience weight proportional to the prediction error.
from mnemos import MnemosEngine, MnemosConfig, SurprisalConfig, Interaction
engine = MnemosEngine(
config=MnemosConfig(
surprisal=SurprisalConfig(threshold=0.3) # ~72 degrees in embedding space
)
)
result = await engine.process(Interaction(role=”user”, content=”sounds good”))
# result.stored = False (low surprisal, discarded)
result = await engine.process(Interaction(role=”user”, content=”I’m migrating from Python to Rust”))
# result.stored = True (surprisal 0.87, stored with high salience)The threshold of 0.3 corresponds to roughly 72 degrees in embedding space --- wide enough to catch meaningful shifts in topic or intent, narrow enough to filter the conversational noise that dominates most agent interactions.
Why this matters mechanically: in a typical hour-long coding session, maybe 15% of what you say to an agent is actually worth remembering. The rest is acknowledgments, clarifications, and routine instructions. Without surprisal gating, your memory store fills with noise, and every future retrieval has to wade through it. The gate eliminates context bloat at the ingestion layer.
No other agentic memory system I’ve evaluated does this. Mem0, Zep, LangMem, Letta --- they all process everything that comes in. Some have gotten smart about updating and deduplicating after storage, but none of them ask the fundamental question at the gate: is this worth remembering at all? That’s like filing every piece of mail you receive and then hiring a librarian to keep the filing cabinet organized. The better answer is to stop filing junk mail in the first place.
2. MutableRAG --- Memory Reconsolidation
This is the module that solves the stale data problem, and it’s the one I’m most proud of architecturally.
Standard RAG is append-only. When you tell your agent “I use React” in January and “we’re migrating to Svelte” in March, both facts live forever in the vector store. Every retrieval that touches frontend frameworks forces the LLM to spend tokens resolving the contradiction. Multiply this by hundreds of evolving facts across months of interaction, and you have a memory system that actively degrades over time.
MutableRAG implements the reconsolidation cycle from neuroscience. When a memory is retrieved, it’s flagged as “labile” --- destabilized, open to modification. A background async task checks whether new conversational context contradicts or updates the stored fact. If it does, the chunk is physically overwritten with the synthesized update. A version counter tracks reconsolidations.
# After: “I use React for frontend work.”
# Then later: “We’re migrating to Svelte next quarter.”
memories = await engine.retrieve(”frontend framework”, reconsolidate=True)
# The stored chunk is now:
# “User is migrating from React to Svelte.” (version=2)
# The old “I use React” chunk no longer exists.The cooldown mechanism prevents thrashing --- a chunk can only be reconsolidated once per configurable interval (default: 60 seconds). And the labile window is capped at 20 chunks per retrieval, so reconsolidation doesn’t block the retrieval path.
Mem0 handles this differently --- its LLM decides whether to ADD, UPDATE, or DELETE extracted facts. Zep invalidates old facts with temporal metadata. Both solve the stale data problem to a degree. But neither implements reconsolidation as a first principle. mnemos treats every retrieval as an automatic opportunity to update stale information. The difference: reconsolidation isn’t a feature bolted onto storage. It’s the memory model. Memories are mutable by default, versioned, and subject to cooldown to prevent thrashing.
3. AffectiveRouter --- State-Dependent Retrieval
Embedding models retrieve on semantic similarity alone. The query “server is down” and “what’s our server stack?” have nearly identical embeddings. But the first is a crisis and the second is an architecture question. The emotional context --- the urgency, the complexity, the valence --- is invisible to cosine similarity.
AffectiveRouter classifies every interaction on three axes:
Valence: -1.0 (negative) to +1.0 (positive)
Arousal: 0.0 (calm) to 1.0 (urgent)
Complexity: 0.0 (simple) to 1.0 (complex)
These dimensions come from Russell’s circumplex model of affect (1980), extended with a complexity axis for cognitive load. The classification is attached as metadata to every stored chunk.
During retrieval, the scoring formula blends semantic similarity with affective state match:
final_score = (cosine_similarity x 0.7) + (state_match x 0.3)When you’re panicking about a production outage (high arousal, negative valence), mnemos surfaces how you resolved previous crises --- not calm architectural documentation. When you’re doing exploratory design work (low arousal, high complexity), it surfaces design discussions and architectural decisions, not incident responses.
No other memory system in this space does state-dependent retrieval. It’s a genuinely novel mechanism for agentic memory, and it solves a real problem: the context mismatch between what the user needs right now and what’s semantically closest in the store.
4. SleepDaemon --- Consolidation
Every interaction enters an episodic buffer regardless of whether it passed the surprisal gate. Periodically --- triggered by inactivity or on a configurable schedule --- the SleepDaemon runs a consolidation pass.
The process mirrors hippocampal-neocortical transfer:
The LLM reviews the episodic buffer and extracts permanent facts, preferences, and patterns
Semantic chunks are created from the extracted knowledge
These chunks are stored in long-term memory
The raw episodic buffer is pruned
“User typed kubectl rollout restart deployment/api after seeing OOMKilled errors, then checked pod memory limits, increased from 256Mi to 512Mi, confirmed resolution” becomes: “When Kubernetes pods hit OOMKilled, user increases memory limits and restarts deployments.”
The episodic detail is discarded. The generalizable knowledge survives. Your memory store gets smaller and more useful over time, not larger and noisier.
The consolidation interval defaults to one hour with a minimum of 10 episodes before triggering. For Claude Code users, consolidation fires automatically on the PreCompact and Stop hook events --- the agent consolidates its memories right before the context window gets compressed or the session ends.
5. SpreadingActivation --- Associative Graph Retrieval
Standard vector search is a point lookup: find the K nearest neighbors to the query embedding. It returns exact matches and misses everything laterally related.
SpreadingActivation builds a graph where memory chunks are nodes and edges connect semantically similar chunks (cosine similarity above a configurable threshold, default 0.6). When a query arrives:
Find the closest node by embedding similarity
Inject activation energy (1.0) at that node
Energy spreads BFS-style through edges, decaying 20% per hop
All nodes above the activation threshold (0.3) are included in results
Query “Docker networking issue” hits the Docker node (energy 1.0). Activation spreads to the Ubuntu node (0.8), the nginx node (0.64), and that old nginx reverse proxy config you set up three months ago (0.51). All of them are returned --- not because they matched the query embedding, but because they’re associatively connected to concepts that did.
This is the mechanism that gives mnemos its “train of thought” quality. The LLM doesn’t just get the exact answer. It gets the associative neighborhood --- the related concepts, the adjacent knowledge, the contextual web that a human expert would naturally bring to mind.
Scoped Memory --- The Real Feature
Everything above is the mechanism. This is the product.
mnemos isolates memory into three scopes:
Project scope: Facts about this repo. “This project uses Svelte 5 with runes.” “The auth module lives in src/lib/server/auth.” “Tests run with vitest, not jest.”
Workspace scope: Facts about a group of related repos. “The API convention across all services is REST with snake_case.” “Staging deploys go through the deploy-staging GitHub Action.”
Global scope: Facts about you. “I prefer explicit error handling over try/catch.” “I use yarn, not npm.” “Always add JSDoc to exported functions.”
Every memory operation --- store, retrieve, consolidate, forget --- is scoped. When you switch from your frontend repo to your backend repo, the agent doesn’t carry React knowledge into a Go codebase. When you start a new project, it starts with a clean project scope but retains your global preferences.
This sounds simple. It is the thing that actually matters.
The scoping system includes governance controls: configurable capture modes (auto, manual, or hook-triggered), retention TTL so project memories can expire after a configurable number of days, and per-scope chunk caps so no single project bloats the store. Every retrieval filters by scope before it touches the pipeline --- the surprisal gate, affective router, and spreading activation all operate within scope boundaries.
The competitors mostly don’t do this. Mem0 has user/agent/session metadata but not hierarchical scope isolation with governance. Zep tracks sessions but doesn’t enforce scope boundaries on retrieval. When your agent is working in repo A and retrieves a memory from repo B because the embeddings happened to be close --- that’s a scope leak. In practice, scope leaks are the most common failure mode I saw in real coding workflows. mnemos eliminates them by design.
MCP-Native From Day One
MCP is becoming the nervous system for agentic tools, and the developers getting the most leverage are the ones building the richest tool environments.
mnemos was designed as an MCP server from the beginning --- not retrofitted. Eight tools, two resources, zero external dependencies for basic operation:
{
“mcpServers”: {
“mnemos”: {
“command”: “mnemos-mcp”,
“env”: {
“MNEMOS_LLM_PROVIDER”: “ollama”,
“MNEMOS_STORE_TYPE”: “sqlite”
}
}
}
}That’s the entire configuration for Claude Code. Drop it in your MCP config, and every agent session gets persistent, biomimetic memory. The same configuration works for Cursor, Windsurf, and any MCP-compatible client.
The tool surface is deliberately minimal:
Tool
What it does
mnemos_store
Process memory through full pipeline (surprisal → affective → store)
mnemos_retrieve
Retrieve with spreading activation + emotional re-ranking
mnemos_consolidate
Trigger sleep consolidation (episodic to semantic)
mnemos_forget
Delete a specific memory
mnemos_stats
System-wide statistics
mnemos_health
Readiness diagnostics
mnemos_inspect
Full details on a specific chunk
mnemos_list
List stored memories
There’s also a Claude Code hook integration (hook_autostore) that automatically ingests user prompts and high-signal tool failures --- so the agent builds memory passively without explicit mnemos_store calls. Consolidation triggers on PreCompact and Stop events. The memory system runs in the background while you work.
For the OpenClaw users reading this: mnemos includes an OpencLawProvider for both LLM and embedding providers. Local-first, no external API dependency required if you’re running Ollama.
Codex support is available via MCP and AGENTS.md, and is included as a documented beta path. Run mnemos-cli antigravity codex for the setup guide.
The Honest Comparison
Let me be direct about positioning, because the HN crowd will check --- and they should.
The agentic memory space is more sophisticated than it was a year ago. Mem0 has $24M in funding, 41,000 GitHub stars, and ships an OpenMemory MCP server. Zep built Graphiti, a temporal knowledge graph with bi-temporal invalidation that’s genuinely interesting architecturally. Letta (the company behind MemGPT) raised $10M and their agent self-edits its own memory via tool calls. These are serious projects with serious teams.
So where does mnemos actually differentiate?
The table is more nuanced than “we do everything, they do nothing.” Mem0 and Letta both mutate memories --- but through different mechanisms. Mem0 uses an LLM to decide ADD/UPDATE/DELETE actions on extracted facts. Letta lets the agent directly edit its own core memory. Zep invalidates old facts temporally rather than overwriting them. All three solve the stale data problem to varying degrees.
What none of them do:
Filter at ingestion. Every competitor stores first, sorts later. mnemos decides at write time whether information is worth remembering, based on prediction error. This keeps the store clean rather than relying on retrieval-time ranking to wade through noise.
State-dependent retrieval. Nobody else blends emotional/cognitive context into retrieval scoring. The 70/30 similarity/state split means crisis memories surface during crises and design memories surface during design --- without explicit query engineering.
Sleep consolidation. No other system compresses episodic detail into semantic abstractions. Mem0 extracts facts from conversations, which is related but distinct --- it doesn’t have a consolidation daemon that periodically distills raw interaction logs into durable knowledge and prunes the originals. mnemos’s memory store gets smaller and more useful over time.
Spreading activation. Zep and Mem0 both offer graph-based storage, but neither implements energy-propagation retrieval. The difference: their graphs store relationships. mnemos’s graph activates relationships during retrieval, returning associative neighborhoods rather than point matches.
The composability is a genuine differentiator for adoption: each module works independently. Start with SurprisalGate alone. Add MutableRAG when stale data becomes a problem. Layer on SpreadingActivation when point-lookup retrieval isn’t enough. The composition is opt-in, not all-or-nothing.
Here’s what I want to be honest about: mnemos is v0.2.0. It’s beta.
Mem0 has 41,000 GitHub stars and exclusive integration with the AWS Agent SDK. Letta is backed by Felicis Ventures. I have 18 test files and a conviction that the architecture is right. I’m positioning on architectural merit --- the design decisions that the neuroscience literature says matter --- not market dominance. The question I’m betting on: is it better to build the right memory architecture from the ground up, or to bolt increasingly sophisticated features onto a foundation that wasn’t designed for them?
The Memory Safety Layer
One thing I want to highlight because it matters for production use: mnemos includes a memory write firewall that runs at every ingestion point.
The firewall detects:
Secrets: Private keys, AWS access keys, API tokens, GitHub PATs, credential assignments
PII: Email addresses, phone numbers, SSNs, credit card numbers
Each category has a configurable action: allow, redact (replace with [REDACTED_SECRET]), or block (reject the write entirely). Defaults are block for secrets and redact for PII.
This runs at three points: initial ingestion through SurprisalGate, reconsolidation through MutableRAG, and consolidation through SleepDaemon. Every write path is covered.
Why this matters: if you’re running an agentic memory system that automatically ingests conversation content --- which mnemos does via hooks --- you’re one export OPENAI_API_KEY=sk-... away from permanently storing a secret in your memory database. The firewall catches this before it hits storage.
How to Get Started
Zero to working memory in sixty seconds:
pip install mnemos-memoryimport asyncio
from mnemos import MnemosEngine, MnemosConfig, Interaction
async def main():
engine = MnemosEngine(config=MnemosConfig())
# Store something surprising
await engine.process(Interaction(role=”user”, content=”We’re migrating to Rust”))
# Retrieve with associative context
results = await engine.retrieve(”programming language”)
# Consolidate episodic memories into semantic knowledge
await engine.consolidate()
asyncio.run(main())That’s it. MockLLM + SimpleEmbedding + InMemoryStore. No API keys. No external services. No Docker containers.
For Claude Code / MCP users:
pip install mnemos-memory[mcp]Add to your MCP config:
{
“mcpServers”: {
“mnemos”: {
“command”: “mnemos-mcp”,
“env”: {
“MNEMOS_STORE_TYPE”: “sqlite”,
“MNEMOS_LLM_PROVIDER”: “ollama”,
“MNEMOS_LLM_MODEL”: “llama3”
}
}
}
}For production:
pip install mnemos-memory[all]mnemos-cli profile local-performance --format dotenv --write .mnemos.profile.envThree profiles: starter (SQLite, plugin-first), local-performance (embedded Qdrant), scale (Qdrant over network). Pick the one that matches your deployment.
The Biomimetic Bet
I want to address the question the skeptics will ask: is the neuroscience actually load-bearing, or is this just cosine-distance filtering with fancy names?
Fair question. Here’s my honest answer.
The neuroscience provides design principles, not implementation novelty. The algorithms underneath --- cosine distance, BFS graph traversal, LLM-based classification --- are standard ML and computer science. What the neuroscience gives you is the composition and the design decisions. It tells you what to build and why.
Predictive coding says: don’t store everything. Only store what violates expectations. That’s a design decision --- filter at ingestion, not at retrieval --- and it produces a fundamentally different system than the append-only alternative.
Reconsolidation says: memories should update on recall. That’s a design decision --- mutable chunks with version tracking and cooldown --- that solves the stale data problem that every append-only system creates.
State-dependent memory says: emotional context affects retrieval. That’s a design decision --- blend affective state into scoring --- that produces contextually appropriate results that pure semantic similarity misses.
Sleep consolidation says: compress episodic detail into semantic abstractions. That’s a design decision --- run a background daemon that extracts durable knowledge and prunes raw logs --- that makes the store shrink instead of grow.
Spreading activation says: retrieval should be associative, not point-based. That’s a design decision --- build a graph, propagate energy --- that returns contextual neighborhoods instead of isolated matches.
Each decision is defensible on engineering grounds alone. The neuroscience is what told me which decisions to make. The combination of all five is what makes mnemos architecturally distinct. And I’m not alone in thinking this direction is right --- a 2025 paper in Engineering proposed the M2I (Machine Memory Intelligence) framework, calling for “multilayered, distributed network storage” with “dynamic updates, spatiotemporal associations, and fuzzy hash access” inspired by biological memory. The field is converging on the conclusion that brains got memory right and databases got it wrong.
And yes --- the naming is memorable. Surprisal gates. Sleep daemons. Affective routers. Spreading activation. People remember these concepts because they map to intuitions everyone has about how their own memory works. That’s not an accident. Good architecture deserves good names.
Where This Goes
mnemos is v0.2.0. Here’s what’s coming:
Neo4j backend for SpreadingActivation at scale (the interface is already defined)
Head-to-head benchmarks against Mem0 and Zep on standardized retrieval datasets
Proceduralization improvements --- the ability for the SleepDaemon to extract repeated reasoning patterns and crystallize them into executable tools (currently disabled by default, and rightfully so --- generated code from an LLM needs more safety work before it auto-executes)
Multi-agent memory sharing --- scoped memory with project/workspace/global isolation is already implemented, but the collaboration patterns need more iteration
The project is MIT-licensed, open-source, and designed to be extended. If you’re building agentic tools and you’re tired of the append-only vector store that every other memory system gives you, I’d genuinely love your feedback.
GitHub: https://github.com/anthony-maio/mnemos
Website: https://mnemos.making-minds.ai
PyPI: pip install mnemos-memory
More of Me: https://making-minds.ai





I find this project to be really fascinating. I’ll admit I’m a bit out of my depth on the finer technical details, but the overall design mimicking neuroscience principles seems like it’s moving in the right direction for long term memory for agents. Curious to see how this kind of thing plays out!