Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content
Problem Description
Currently, every time OpenClaw runs an agent turn, all bootstrap files are injected into the context window regardless of whether the current query is relevant:
MEMORY.md (~20 KB) — full content every turn
TOOLS.md (~20 KB) — full content every turn
AGENTS.md, SOUL.md, IDENTITY.md, USER.md (~5 KB combined) — all every turn
Even though the model can technically use memory_search / memory_get on demand, MEMORY.md is still included in the Project Context bootstrap by default. This means ~40–50 KB of fixed token overhead per turn, of which 70–80% is typically irrelevant to the current query.
Desired Behavior
At the context assembly stage (assemble()), add a relevance filtering step that:
- RAG-filter
MEMORY.md: Use the current query to retrieve only the top-K relevant passages from long-term memory, rather than injecting the entire file
- Optionally segment and filter
TOOLS.md: Load only tool descriptions relevant to the current task domain
- Preserve static sections: Keep
SOUL.md, AGENTS.md, USER.md fully loaded (they are small and cacheable via KV prefix reuse)
Existing Infrastructure (Reusable)
1. Built-in Memory Search Engine
OpenClaw ships with a SQLite-based memory engine (vector + BM25 hybrid search) that already supports:
memory_search(query, corpus, topK) — exact same interface needed for bootstrap RAG
memory_get(corpus, path, from, lines) — for fetching specific segments
2. Context Engine Plugin API
The pluggable context engine API (plugins.slots.contextEngine) provides clean hooks at exactly the right lifecycle point:
async assemble({ sessionId, messages, tokenBudget }) {
const query = extractCurrentQuery(messages);
const memoryHits = await memorySearch(query, topK=3);
return {
messages: [...staticSections, ...bootstrapHits, ...memoryHits, ...recentMessages],
estimatedTokens: countTokens(assembled),
};
}
3. Bootstrap Truncation Controls
Existing mechanisms (bootstrapMaxChars 12K, bootstrapTotalMaxChars 60K) provide hard size caps but not semantic filtering.
4. KV Cache / Prompt Prefix Reuse
Static sections are already above the prompt cache boundary — reused at no marginal token cost per turn.
Proposed Implementation Path
Phase 1: Reference Smart Context Engine Plugin
Build @openclaw/smart-context-engine that wraps the legacy engine and adds MEMORY.md RAG filtering in assemble(), using the existing built-in memory search engine. No core changes required.
Phase 2: Core RAG Bootstrap Filtering
Segment MEMORY.md and TOOLS.md at meaningful boundaries (heading/paragraph for MEMORY, tool category for TOOLS). Retrieve top-K relevant segments per query at assemble() time. Can be implemented as a new built-in smart mode alongside legacy.
Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated
Key insight: Leverage locally available compute (Apple Silicon M-series with 64–128 GB RAM) to run a lightweight local LLM (1–3B parameters, Q4 quantized) for intent classification — at zero API cost, with minimal latency, and no data leaving the machine.
| Option |
Cost |
Latency |
Chinese Support |
Privacy |
| Local small model (Q4, Ollama) |
Free (local RAM/GPU) |
~100–300ms on M5 Max |
✅ Best (Qwen2.5-1.5B) |
✅ Full |
| API small model (GPT-4o-mini) |
~$0.001/turn |
~200–500ms |
✅ Good |
❌ External API |
| No filtering (current) |
Token cost |
0ms |
✅ |
⚠️ Full context uploaded |
Recommended local models:
| Model |
Size |
Memory |
Best For |
| qwen2.5:1.5b (Q4) |
~1 GB |
1 GB |
Chinese + English bilingual |
| llama3.2:1b (Q4) |
~700 MB |
700 MB |
English-only, fastest |
| phi3:mini (Q4) |
~2 GB |
2 GB |
English reasoning |
On Apple M5 Max with 128 GB RAM, running qwen2.5:1.5b Q4 via Ollama with MPS (Metal GPU) backend achieves ~50–100 tokens/second — validated by the user in production.
Phase 3 integration architecture:
User Query
↓
[Local LLM: qwen2.5:1.5b Q4 @ Ollama]
Intent classification → needed_skills, needed_memory_domains
↓
[Context Engine assemble()]
→ memory_search(domain=X, topK=3)
→ bootstrap_segment_load(skills=Y)
↓
Filtered context → Main LLM
Zero API cost · No data leaves machine · Sub-300ms latency
Reference Research
| Work |
Approach |
Key Finding |
| Anthropic Contextual Retrieval (2024) |
BM25 + embedding dual-path召回 |
Compresses context to 4% while retaining 85% accuracy |
| LLM Janitor (2024) |
Small model pre-filters context |
Reduces context by ~60% with minimal degradation |
| Gorilla (Berkeley, 2024) |
Dynamic API/tool routing via retriever |
Reduces tool-use failures |
| Ollama + Apple Silicon MPS |
Local LLM inference on M-series GPU |
1B models run at 50–100 tok/s on M5 Max |
Related Documentation
/concepts/context-engine — Context Engine Plugin API
/concepts/system-prompt — Bootstrap injection mechanism
/concepts/memory-builtin — Built-in memory search engine
/concepts/context — Context assembly overview
Tags
enhancement context-engine memory token-optimization local-llm ollama
Submitted via OpenClaw agent on behalf of a production user. Phase 3 prototype validated on Apple M5 Max (128 GB RAM). Willing to contribute a reference implementation.
Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content
Problem Description
Currently, every time OpenClaw runs an agent turn, all bootstrap files are injected into the context window regardless of whether the current query is relevant:
MEMORY.md(~20 KB) — full content every turnTOOLS.md(~20 KB) — full content every turnAGENTS.md,SOUL.md,IDENTITY.md,USER.md(~5 KB combined) — all every turnEven though the model can technically use
memory_search/memory_geton demand,MEMORY.mdis still included in the Project Context bootstrap by default. This means ~40–50 KB of fixed token overhead per turn, of which 70–80% is typically irrelevant to the current query.Desired Behavior
At the context assembly stage (
assemble()), add a relevance filtering step that:MEMORY.md: Use the current query to retrieve only the top-K relevant passages from long-term memory, rather than injecting the entire fileTOOLS.md: Load only tool descriptions relevant to the current task domainSOUL.md,AGENTS.md,USER.mdfully loaded (they are small and cacheable via KV prefix reuse)Existing Infrastructure (Reusable)
1. Built-in Memory Search Engine
OpenClaw ships with a SQLite-based memory engine (vector + BM25 hybrid search) that already supports:
memory_search(query, corpus, topK)— exact same interface needed for bootstrap RAGmemory_get(corpus, path, from, lines)— for fetching specific segments2. Context Engine Plugin API
The pluggable context engine API (
plugins.slots.contextEngine) provides clean hooks at exactly the right lifecycle point:3. Bootstrap Truncation Controls
Existing mechanisms (
bootstrapMaxChars12K,bootstrapTotalMaxChars60K) provide hard size caps but not semantic filtering.4. KV Cache / Prompt Prefix Reuse
Static sections are already above the prompt cache boundary — reused at no marginal token cost per turn.
Proposed Implementation Path
Phase 1: Reference Smart Context Engine Plugin
Build
@openclaw/smart-context-enginethat wraps the legacy engine and adds MEMORY.md RAG filtering inassemble(), using the existing built-in memory search engine. No core changes required.Phase 2: Core RAG Bootstrap Filtering
Segment
MEMORY.mdandTOOLS.mdat meaningful boundaries (heading/paragraph for MEMORY, tool category for TOOLS). Retrieve top-K relevant segments per query atassemble()time. Can be implemented as a new built-insmartmode alongsidelegacy.Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated
Key insight: Leverage locally available compute (Apple Silicon M-series with 64–128 GB RAM) to run a lightweight local LLM (1–3B parameters, Q4 quantized) for intent classification — at zero API cost, with minimal latency, and no data leaving the machine.
Recommended local models:
On Apple M5 Max with 128 GB RAM, running qwen2.5:1.5b Q4 via Ollama with MPS (Metal GPU) backend achieves ~50–100 tokens/second — validated by the user in production.
Phase 3 integration architecture:
Zero API cost · No data leaves machine · Sub-300ms latency
Reference Research
Related Documentation
/concepts/context-engine— Context Engine Plugin API/concepts/system-prompt— Bootstrap injection mechanism/concepts/memory-builtin— Built-in memory search engine/concepts/context— Context assembly overviewTags
enhancementcontext-enginememorytoken-optimizationlocal-llmollamaSubmitted via OpenClaw agent on behalf of a production user. Phase 3 prototype validated on Apple M5 Max (128 GB RAM). Willing to contribute a reference implementation.