Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

# Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

## Problem Description

Currently, every time OpenClaw runs an agent turn, **all bootstrap files are injected into the context window** regardless of whether the current query is relevant:

- `MEMORY.md` (~20 KB) — full content every turn
- `TOOLS.md` (~20 KB) — full content every turn
- `AGENTS.md`, `SOUL.md`, `IDENTITY.md`, `USER.md` (~5 KB combined) — all every turn

Even though the model can technically use `memory_search` / `memory_get` on demand, `MEMORY.md` is still included in the **Project Context bootstrap** by default. This means ~40–50 KB of fixed token overhead per turn, of which **70–80% is typically irrelevant** to the current query.

## Desired Behavior

At the **context assembly stage** (`assemble()`), add a relevance filtering step that:

1. **RAG-filter `MEMORY.md`**: Use the current query to retrieve only the top-K relevant passages from long-term memory, rather than injecting the entire file
2. **Optionally segment and filter `TOOLS.md`**: Load only tool descriptions relevant to the current task domain
3. **Preserve static sections**: Keep `SOUL.md`, `AGENTS.md`, `USER.md` fully loaded (they are small and cacheable via KV prefix reuse)

## Existing Infrastructure (Reusable)

### 1. Built-in Memory Search Engine
OpenClaw ships with a **SQLite-based memory engine** (vector + BM25 hybrid search) that already supports:
- `memory_search(query, corpus, topK)` — exact same interface needed for bootstrap RAG
- `memory_get(corpus, path, from, lines)` — for fetching specific segments

### 2. Context Engine Plugin API
The **pluggable context engine API** (`plugins.slots.contextEngine`) provides clean hooks at exactly the right lifecycle point:

```ts
async assemble({ sessionId, messages, tokenBudget }) {
  const query = extractCurrentQuery(messages);
  const memoryHits = await memorySearch(query, topK=3);
  return {
    messages: [...staticSections, ...bootstrapHits, ...memoryHits, ...recentMessages],
    estimatedTokens: countTokens(assembled),
  };
}
```

### 3. Bootstrap Truncation Controls
Existing mechanisms (`bootstrapMaxChars` 12K, `bootstrapTotalMaxChars` 60K) provide hard size caps but not semantic filtering.

### 4. KV Cache / Prompt Prefix Reuse
Static sections are already above the prompt cache boundary — reused at no marginal token cost per turn.

## Proposed Implementation Path

### Phase 1: Reference Smart Context Engine Plugin

Build `@openclaw/smart-context-engine` that wraps the legacy engine and adds MEMORY.md RAG filtering in `assemble()`, using the existing built-in memory search engine. No core changes required.

### Phase 2: Core RAG Bootstrap Filtering

Segment `MEMORY.md` and `TOOLS.md` at meaningful boundaries (heading/paragraph for MEMORY, tool category for TOOLS). Retrieve top-K relevant segments per query at `assemble()` time. Can be implemented as a new built-in `smart` mode alongside `legacy`.

### Phase 3: Local Small Model Intent Pre-filtering ⭐ **User-Validated**


**Key insight**: Leverage locally available compute (Apple Silicon M-series with 64–128 GB RAM) to run a lightweight local LLM (1–3B parameters, Q4 quantized) for intent classification — at zero API cost, with minimal latency, and **no data leaving the machine**.


| Option | Cost | Latency | Chinese Support | Privacy |
|--------|------|---------|----------------|---------|
| **Local small model (Q4, Ollama)** | Free (local RAM/GPU) | ~100–300ms on M5 Max | ✅ Best (Qwen2.5-1.5B) | ✅ Full |
| API small model (GPT-4o-mini) | ~$0.001/turn | ~200–500ms | ✅ Good | ❌ External API |
| No filtering (current) | Token cost | 0ms | ✅ | ⚠️ Full context uploaded |

**Recommended local models:**

| Model | Size | Memory | Best For |
|-------|------|--------|----------|
| **qwen2.5:1.5b** (Q4) | ~1 GB | 1 GB | **Chinese + English bilingual** |
| **llama3.2:1b** (Q4) | ~700 MB | 700 MB | English-only, fastest |
| **phi3:mini** (Q4) | ~2 GB | 2 GB | English reasoning |


On **Apple M5 Max with 128 GB RAM**, running qwen2.5:1.5b Q4 via Ollama with MPS (Metal GPU) backend achieves **~50–100 tokens/second** — validated by the user in production.


**Phase 3 integration architecture:**

```
User Query
    ↓
[Local LLM: qwen2.5:1.5b Q4 @ Ollama]
Intent classification → needed_skills, needed_memory_domains
    ↓
[Context Engine assemble()]
→ memory_search(domain=X, topK=3)
→ bootstrap_segment_load(skills=Y)
    ↓
Filtered context → Main LLM
```

**Zero API cost · No data leaves machine · Sub-300ms latency**


## Reference Research

| Work | Approach | Key Finding |
|------|----------|-------------|
| Anthropic Contextual Retrieval (2024) | BM25 + embedding dual-path召回 | Compresses context to 4% while retaining 85% accuracy |
| LLM Janitor (2024) | Small model pre-filters context | Reduces context by ~60% with minimal degradation |
| Gorilla (Berkeley, 2024) | Dynamic API/tool routing via retriever | Reduces tool-use failures |
| Ollama + Apple Silicon MPS | Local LLM inference on M-series GPU | 1B models run at 50–100 tok/s on M5 Max |

## Related Documentation

- `/concepts/context-engine` — Context Engine Plugin API
- `/concepts/system-prompt` — Bootstrap injection mechanism
- `/concepts/memory-builtin` — Built-in memory search engine
- `/concepts/context` — Context assembly overview


## Tags

`enhancement` `context-engine` `memory` `token-optimization` `local-llm` `ollama`

---

*Submitted via OpenClaw agent on behalf of a production user. Phase 3 prototype validated on Apple M5 Max (128 GB RAM). Willing to contribute a reference implementation.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content #80218

Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

Problem Description

Desired Behavior

Existing Infrastructure (Reusable)

1. Built-in Memory Search Engine

2. Context Engine Plugin API

3. Bootstrap Truncation Controls

4. KV Cache / Prompt Prefix Reuse

Proposed Implementation Path

Phase 1: Reference Smart Context Engine Plugin

Phase 2: Core RAG Bootstrap Filtering

Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated

Reference Research

Related Documentation

Tags

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Option	Cost	Latency	Chinese Support	Privacy
Local small model (Q4, Ollama)	Free (local RAM/GPU)	~100–300ms on M5 Max	✅ Best (Qwen2.5-1.5B)	✅ Full
API small model (GPT-4o-mini)	~$0.001/turn	~200–500ms	✅ Good	❌ External API
No filtering (current)	Token cost	0ms	✅	⚠️ Full context uploaded

Model	Size	Memory	Best For
qwen2.5:1.5b (Q4)	~1 GB	1 GB	Chinese + English bilingual
llama3.2:1b (Q4)	~700 MB	700 MB	English-only, fastest
phi3:mini (Q4)	~2 GB	2 GB	English reasoning

Work	Approach	Key Finding
Anthropic Contextual Retrieval (2024)	BM25 + embedding dual-path召回	Compresses context to 4% while retaining 85% accuracy
LLM Janitor (2024)	Small model pre-filters context	Reduces context by ~60% with minimal degradation
Gorilla (Berkeley, 2024)	Dynamic API/tool routing via retriever	Reduces tool-use failures
Ollama + Apple Silicon MPS	Local LLM inference on M-series GPU	1B models run at 50–100 tok/s on M5 Max

Uh oh!

Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content #80218

Description

Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

Problem Description

Desired Behavior

Existing Infrastructure (Reusable)

1. Built-in Memory Search Engine

2. Context Engine Plugin API

3. Bootstrap Truncation Controls

4. KV Cache / Prompt Prefix Reuse

Proposed Implementation Path

Phase 1: Reference Smart Context Engine Plugin

Phase 2: Core RAG Bootstrap Filtering

Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated

Reference Research

Related Documentation

Tags

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions