Summary
Implement automated nightly indexing of agent memories to build a RAG (Retrieval-Augmented Generation) layer that enables efficient semantic search before LLM context injection, significantly reducing token consumption as agents accumulate user knowledge over time.
Problem to solve
As agents learn more about users through extended interactions, their memory context grows substantially, leading to:
Token bloat: Full memory context must be loaded into every LLM call, exponentially increasing costs
Performance degradation: Large context windows slow response times
Inefficient retrieval: Current linear memory loading lacks semantic ranking, pulling irrelevant memories into context
Cost escalation: Token costs grow proportionally with memory size, making long-term agent usage increasingly expensive
For users with months of interaction history, a single query can consume thousands of tokens just loading memory context, even when only a small subset is relevant.
Proposed solution
Implement a scheduled memory indexing system that:
Off-Peak Indexing:
Run memory embedding/indexing during configurable low-usage hours (default: 2-6 AM local time)
Process only new/modified memory entries since last indexing cycle
Generate vector embeddings for semantic search capabilities
RAG-First Retrieval Pipeline:
Before LLM context injection, query the indexed memory with user's current input
Retrieve only the top-K most semantically relevant memory chunks
Inject only filtered memories into LLM context, dramatically reducing token load
Configuration Options:
{
agents: {
defaults: {
memory: {
indexing: {
enabled: true,
schedule: "0 2 * * *", // cron format: 2 AM daily
provider: "openai", // or local embeddings
chunkSize: 512,
retrievalTopK: 10 // max memories per query
}
}
}
}
}
Backward Compatibility:
- Graceful fallback to current full-memory loading if indexing fails
Optional feature that doesn't break existing workflows
Benefits
- Massive token savings: Reduce memory-related token consumption by 70-90% for users with extensive histories
I- mproved relevance: Semantic search surfaces only contextually relevant memories
- Faster responses: Smaller context = faster LLM processing
- Scalability: Enables indefinite memory growth without linear cost increase
- Cost predictability: Token costs stabilize rather than scaling with agent tenure
- Resource efficiency: Off-peak processing minimizes impact on active usage hours
Technical Considerations
Leverage existing memorySearch configuration infrastructure (already supports embeddings)
Index storage in ~/.openclaw/agents//memory-index/
Incremental updates to avoid re-indexing entire memory history nightly
Monitoring via openclaw doctor to verify index health/freshness
Alternatives considered
Real-time embedding on every query
❌ Adds latency to every user interaction
❌ Multiplies embedding API costs (embedding on every query vs. once nightly)
❌ Doesn't solve the core problem of growing token consumption
Manual memory pruning/archiving
❌ Requires constant user intervention
❌ Risk of losing valuable context permanently
❌ Doesn't scale for non-technical users
Sliding window approach (keep only recent N memories)
❌ Loses valuable long-term context about user preferences
❌ Arbitrary cutoff ignores semantic relevance
❌ "Old but relevant" memories get dropped incorrectly
Compression-based approaches
❌ Lossy compression degrades memory quality
❌ Still requires loading full compressed context into LLM
❌ Adds computational overhead without semantic filtering
Why RAG indexing is superior: Preserves all memories permanently while loading only relevant subsets, combining the benefits of comprehensive memory with efficient retrieval.
Impact
Impact
For Users:
- Lower costs: 70-90% reduction in memory-related token consumption for long-term users
- Faster responses: Smaller context windows = quicker LLM processing
- Better quality: More relevant memories surfaced instead of information overload
- Future-proof: Agents can accumulate unlimited memories without degrading performance
For OpenClaw:
- Competitive advantage: Enables sustainable long-term agent relationships that competitors can't match economically
- Resource efficiency: Reduced token usage = lower infrastructure costs at scale
- User retention: Users won't abandon agents due to escalating costs
- Differentiator: "Memory that scales" becomes a key product feature
For Developers:
- Reuses existing infrastructure: Leverages
memorySearch provider system already in place
- Modular implementation: Can be developed incrementally without breaking changes
- Clear metrics: Easy to measure token savings and retrieval relevance
Evidence/examples
Real-world scenario:
User with 6 months of daily OpenClaw usage:
- Total memories accumulated: ~1,200 entries
- Average memory size: 150 tokens
- Full context load: 180,000 tokens per query
- Anthropic API cost: ~$1.44 per query (Claude Sonnet input tokens)
With RAG indexing (top-10 retrieval):
- Indexed memories: 1,200 (one-time embedding cost)
- Context load per query: ~1,500 tokens (10 memories)
- API cost per query: ~$0.012
- Savings: 98.8% reduction in memory-related token costs
Existing patterns in the ecosystem:
LangChain, LlamaIndex, and other frameworks use RAG for exactly this purpose
Enterprise AI assistants (GitHub Copilot Workspace, Cursor) use vector stores for code context
Anthropic's own Claude Projects feature likely uses similar retrieval mechanisms internally
Similar OpenClaw features that prove feasibility:
Issue #4461 shows OpenClaw already has embedding provider infrastructure
Existing memorySearch configuration supports custom embedding endpoints
Cron system (#5452, Configuration docs) provides scheduling foundation
### Additional information
Implementation phases:
Phase 1 (MVP):
Basic nightly indexing with OpenAI embeddings
Simple top-K retrieval before context injection
Opt-in configuration flag
Phase 2 (Enhancement):
Local embedding models (no API dependency)
Hybrid search (vector + keyword)
Per-agent indexing schedules
Phase 3 (Advanced):
Real-time incremental indexing for high-activity agents
Memory clustering for topic-based retrieval
Automatic relevance tuning based on user feedback
Storage estimates:
1,000 memories → ~4MB vector index (1536-dim embeddings)
Negligible disk space impact compared to session logs
Fallback behavior:
If indexing service is unavailable → fall back to current full-memory loading
If index is stale (>7 days) → warn user via openclaw doctor
If retrieval fails → graceful degradation to recent memories only
Related issues/features:
Complements #9264 (Cross-Channel Context Sharing) - indexed memories could be shared across channels
Builds on existing memorySearch configuration framework
Aligns with OpenClaw's philosophy of "personal AI that learns and scales"
Open questions for maintainers:
Should local embedding models be bundled by default, or require separate installation?
Preferred vector store backend (FAISS, Chroma, custom SQLite-based)?
Should users be able to manually trigger re-indexing via openclaw memory reindex?
Summary
Implement automated nightly indexing of agent memories to build a RAG (Retrieval-Augmented Generation) layer that enables efficient semantic search before LLM context injection, significantly reducing token consumption as agents accumulate user knowledge over time.
Problem to solve
As agents learn more about users through extended interactions, their memory context grows substantially, leading to:
Token bloat: Full memory context must be loaded into every LLM call, exponentially increasing costs
Performance degradation: Large context windows slow response times
Inefficient retrieval: Current linear memory loading lacks semantic ranking, pulling irrelevant memories into context
Cost escalation: Token costs grow proportionally with memory size, making long-term agent usage increasingly expensive
For users with months of interaction history, a single query can consume thousands of tokens just loading memory context, even when only a small subset is relevant.
Proposed solution
Implement a scheduled memory indexing system that:
Off-Peak Indexing:
Run memory embedding/indexing during configurable low-usage hours (default: 2-6 AM local time)
Process only new/modified memory entries since last indexing cycle
Generate vector embeddings for semantic search capabilities
RAG-First Retrieval Pipeline:
Before LLM context injection, query the indexed memory with user's current input
Retrieve only the top-K most semantically relevant memory chunks
Inject only filtered memories into LLM context, dramatically reducing token load
Configuration Options:
{
agents: {
defaults: {
memory: {
indexing: {
enabled: true,
schedule: "0 2 * * *", // cron format: 2 AM daily
provider: "openai", // or local embeddings
chunkSize: 512,
retrievalTopK: 10 // max memories per query
}
}
}
}
}
Backward Compatibility:
Optional feature that doesn't break existing workflows
Benefits
I- mproved relevance: Semantic search surfaces only contextually relevant memories
Technical Considerations
Leverage existing memorySearch configuration infrastructure (already supports embeddings)
Index storage in ~/.openclaw/agents//memory-index/
Incremental updates to avoid re-indexing entire memory history nightly
Monitoring via openclaw doctor to verify index health/freshness
Alternatives considered
Real-time embedding on every query
❌ Adds latency to every user interaction
❌ Multiplies embedding API costs (embedding on every query vs. once nightly)
❌ Doesn't solve the core problem of growing token consumption
Manual memory pruning/archiving
❌ Requires constant user intervention
❌ Risk of losing valuable context permanently
❌ Doesn't scale for non-technical users
Sliding window approach (keep only recent N memories)
❌ Loses valuable long-term context about user preferences
❌ Arbitrary cutoff ignores semantic relevance
❌ "Old but relevant" memories get dropped incorrectly
Compression-based approaches
❌ Lossy compression degrades memory quality
❌ Still requires loading full compressed context into LLM
❌ Adds computational overhead without semantic filtering
Why RAG indexing is superior: Preserves all memories permanently while loading only relevant subsets, combining the benefits of comprehensive memory with efficient retrieval.
Impact
Impact
For Users:
For OpenClaw:
For Developers:
memorySearchprovider system already in placeEvidence/examples
Real-world scenario: