Skip to content

Feature: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content #80218

@Neodradynamics

Description

@Neodradynamics

Feature Request: Smart Context Assembly — On-Demand RAG Filtering for Bootstrap Content

Problem Description

Currently, every time OpenClaw runs an agent turn, all bootstrap files are injected into the context window regardless of whether the current query is relevant:

  • MEMORY.md (~20 KB) — full content every turn
  • TOOLS.md (~20 KB) — full content every turn
  • AGENTS.md, SOUL.md, IDENTITY.md, USER.md (~5 KB combined) — all every turn

Even though the model can technically use memory_search / memory_get on demand, MEMORY.md is still included in the Project Context bootstrap by default. This means ~40–50 KB of fixed token overhead per turn, of which 70–80% is typically irrelevant to the current query.

Desired Behavior

At the context assembly stage (assemble()), add a relevance filtering step that:

  1. RAG-filter MEMORY.md: Use the current query to retrieve only the top-K relevant passages from long-term memory, rather than injecting the entire file
  2. Optionally segment and filter TOOLS.md: Load only tool descriptions relevant to the current task domain
  3. Preserve static sections: Keep SOUL.md, AGENTS.md, USER.md fully loaded (they are small and cacheable via KV prefix reuse)

Existing Infrastructure (Reusable)

1. Built-in Memory Search Engine

OpenClaw ships with a SQLite-based memory engine (vector + BM25 hybrid search) that already supports:

  • memory_search(query, corpus, topK) — exact same interface needed for bootstrap RAG
  • memory_get(corpus, path, from, lines) — for fetching specific segments

2. Context Engine Plugin API

The pluggable context engine API (plugins.slots.contextEngine) provides clean hooks at exactly the right lifecycle point:

async assemble({ sessionId, messages, tokenBudget }) {
  const query = extractCurrentQuery(messages);
  const memoryHits = await memorySearch(query, topK=3);
  return {
    messages: [...staticSections, ...bootstrapHits, ...memoryHits, ...recentMessages],
    estimatedTokens: countTokens(assembled),
  };
}

3. Bootstrap Truncation Controls

Existing mechanisms (bootstrapMaxChars 12K, bootstrapTotalMaxChars 60K) provide hard size caps but not semantic filtering.

4. KV Cache / Prompt Prefix Reuse

Static sections are already above the prompt cache boundary — reused at no marginal token cost per turn.

Proposed Implementation Path

Phase 1: Reference Smart Context Engine Plugin

Build @openclaw/smart-context-engine that wraps the legacy engine and adds MEMORY.md RAG filtering in assemble(), using the existing built-in memory search engine. No core changes required.

Phase 2: Core RAG Bootstrap Filtering

Segment MEMORY.md and TOOLS.md at meaningful boundaries (heading/paragraph for MEMORY, tool category for TOOLS). Retrieve top-K relevant segments per query at assemble() time. Can be implemented as a new built-in smart mode alongside legacy.

Phase 3: Local Small Model Intent Pre-filtering ⭐ User-Validated

Key insight: Leverage locally available compute (Apple Silicon M-series with 64–128 GB RAM) to run a lightweight local LLM (1–3B parameters, Q4 quantized) for intent classification — at zero API cost, with minimal latency, and no data leaving the machine.

Option Cost Latency Chinese Support Privacy
Local small model (Q4, Ollama) Free (local RAM/GPU) ~100–300ms on M5 Max ✅ Best (Qwen2.5-1.5B) ✅ Full
API small model (GPT-4o-mini) ~$0.001/turn ~200–500ms ✅ Good ❌ External API
No filtering (current) Token cost 0ms ⚠️ Full context uploaded

Recommended local models:

Model Size Memory Best For
qwen2.5:1.5b (Q4) ~1 GB 1 GB Chinese + English bilingual
llama3.2:1b (Q4) ~700 MB 700 MB English-only, fastest
phi3:mini (Q4) ~2 GB 2 GB English reasoning

On Apple M5 Max with 128 GB RAM, running qwen2.5:1.5b Q4 via Ollama with MPS (Metal GPU) backend achieves ~50–100 tokens/second — validated by the user in production.

Phase 3 integration architecture:

User Query
    ↓
[Local LLM: qwen2.5:1.5b Q4 @ Ollama]
Intent classification → needed_skills, needed_memory_domains
    ↓
[Context Engine assemble()]
→ memory_search(domain=X, topK=3)
→ bootstrap_segment_load(skills=Y)
    ↓
Filtered context → Main LLM

Zero API cost · No data leaves machine · Sub-300ms latency

Reference Research

Work Approach Key Finding
Anthropic Contextual Retrieval (2024) BM25 + embedding dual-path召回 Compresses context to 4% while retaining 85% accuracy
LLM Janitor (2024) Small model pre-filters context Reduces context by ~60% with minimal degradation
Gorilla (Berkeley, 2024) Dynamic API/tool routing via retriever Reduces tool-use failures
Ollama + Apple Silicon MPS Local LLM inference on M-series GPU 1B models run at 50–100 tok/s on M5 Max

Related Documentation

  • /concepts/context-engine — Context Engine Plugin API
  • /concepts/system-prompt — Bootstrap injection mechanism
  • /concepts/memory-builtin — Built-in memory search engine
  • /concepts/context — Context assembly overview

Tags

enhancement context-engine memory token-optimization local-llm ollama


Submitted via OpenClaw agent on behalf of a production user. Phase 3 prototype validated on Apple M5 Max (128 GB RAM). Willing to contribute a reference implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions