memory_search: hybrid scoring bugs reduce hit rate from 95% to 40%

# `memory_search`: hybrid scoring bugs reduce hit rate from 95% to 40%

## Summary

Three compounding bugs in the hybrid search scoring pipeline cause `memory_search` to filter out valid semantic matches when `hybrid.enabled` is `true` (the default). The weighted-average fusion formula applies a 0.7× penalty to vector scores when BM25 returns nothing, AND-joined FTS queries ensure BM25 returns nothing for most natural language queries, and the `bm25RankToScore` function clamps all BM25 scores to a single value due to incorrect handling of negative FTS5 ranks. Together, these bugs raise the effective `minScore` threshold from 0.35 to ~0.50, filtering out the majority of semantically relevant results. In a 57-query test across 11 memory files, the hit rate drops from **94.7% (vector-only)** to **40.4% (hybrid)**. The fix involves switching FTS queries from AND-join to OR-join, replacing the weighted-average fusion with Reciprocal Rank Fusion (RRF), and normalizing BM25 scores relative to the result set rather than using absolute rank.

## Environment

| Component | Value |
|-----------|-------|
| OpenClaw version | `2026.2.13` |
| Node.js | `v25.5.0` |
| OS | macOS 15 (Darwin 25.2.0, arm64) |
| Embedding model | `text-embedding-3-large` (3072 dims) |
| Embedding provider | OpenAI-compatible API |
| Config | Default — no `query` or `hybrid` overrides |
| Memory corpus | 45 files, 564 chunks, avg ~1500 chars/chunk |

Relevant default config (from `src/agents/memory-search.ts`, lines 81–86):

```
DEFAULT_MAX_RESULTS = 6
DEFAULT_MIN_SCORE = 0.35
DEFAULT_HYBRID_ENABLED = true
DEFAULT_HYBRID_VECTOR_WEIGHT = 0.7
DEFAULT_HYBRID_TEXT_WEIGHT = 0.3
DEFAULT_HYBRID_CANDIDATE_MULTIPLIER = 4
```

## Reproduction

### Steps

1. Use default config (no `query.hybrid` overrides — hybrid search is enabled by default)
2. Index several markdown memory files containing varied content
3. Run `memory_search` with natural language queries

### Representative failing queries

These queries return **"No matches"** with default hybrid config but succeed with `hybrid.enabled: false`:

| Query | Target content | Vector similarity | Hybrid score (0.7×) | Result |
|-------|---------------|-------------------|---------------------|--------|
| `"how should I evaluate if a claim is actually true"` | Falsification-first reasoning protocol | 0.438 | 0.307 | ❌ Filtered (< 0.35) |
| `"how to think about problems and verify information"` | Reasoning & verification methodology | 0.505 | 0.353 | ❌ Filtered (< 0.35) |
| `"what to do when a task seems impossible"` | Tenacity & persistence protocol | 0.400 | 0.280 | ❌ Filtered (< 0.35) |
| `"tone adjustment strangers conversation style"` | Audience calibration rules | 0.484 | 0.339 | ❌ Filtered (< 0.35) |
| `"what does Michael value and how does he think"` | User personality profile | 0.456* | 0.319 | ❌ Filtered (< 0.35) |
| `"should I respond to this message or stay quiet"` | Group chat participation rules | 0.275* | 0.193 | ❌ Filtered (< 0.35) |
| `"where to send urgent notifications"` | 7-bot notification routing | 0.447* | 0.313 | ❌ Filtered (< 0.35) |
| `"how should I behave when talking to people who aren't Michael"` | Audience adaptation for non-Michael users | 0.497* | 0.348 | ❌ Filtered (< 0.35) |

*Scores marked with \* are from post-fix vector-only results; the direct cosine similarities follow the same pattern.

**Pattern:** Every natural language query that lacks exact keyword overlap with the target document gets a BM25 score of 0, which causes the hybrid formula to multiply the vector score by 0.7, dropping it below the 0.35 threshold. Direct terminology queries (e.g., `"falsification-first reasoning protocol"` → 0.664) succeed because they happen to match FTS keywords as well.

### Expected vs actual

| Config | Hit rate | Files passing (≥3/5 queries) |
|--------|----------|------------------------------|
| Default (hybrid enabled, minScore 0.35) | **40.4%** (23/57) | 4/11 |
| Hybrid disabled, minScore 0.25, maxResults 10 | **94.7%** (54/57) | 11/11 |

## Root Cause Analysis

Three bugs compound to systematically suppress natural language queries. Each one is modest in isolation; together they raise the effective threshold from 0.35 to approximately 0.50.

### Bug 1: AND-joined FTS queries kill recall

**File:** `src/memory/hybrid.ts`, line 23 — `buildFtsQuery()`

```typescript
export function buildFtsQuery(raw: string): string | null {
  const tokens = raw.match(/[A-Za-z0-9_]+/g)?.map((t) => t.trim()).filter(Boolean) ?? [];
  if (tokens.length === 0) return null;
  const quoted = tokens.map((t) => `"${t.replaceAll('"', "")}"`);
  return quoted.join(" AND ");  // ← AND requires ALL tokens in a single chunk
}
```

For the query `"how should I evaluate if a claim is actually true"`, this produces:

```
"how" AND "should" AND "I" AND "evaluate" AND "if" AND "a" AND "claim" AND "is" AND "actually" AND "true"
```

All 10 words must appear verbatim in a single ~1500-character chunk. For natural language queries, this almost never matches, so FTS returns zero results and every document gets `textScore = 0`.

**Impact:** The FTS component becomes effectively dead for natural language queries, guaranteeing Bug 2 fires on every search.

### Bug 2: Weighted-average fusion penalizes FTS misses

**File:** `src/memory/hybrid.ts`, line 41 — `mergeHybridResults()`

```typescript
const score = params.vectorWeight * entry.vectorScore + params.textWeight * entry.textScore;
// With defaults: score = 0.7 × vectorScore + 0.3 × 0 = 0.7 × vectorScore
```

When the FTS component returns nothing (due to Bug 1), `textScore = 0` for all results. The fusion formula then reduces to `score = 0.7 × vectorScore`. This means:

- A document needs a **vector similarity > 0.50** to pass the 0.35 minScore threshold (0.35 ÷ 0.7 = 0.50)
- Most natural language queries against technical content produce vector similarities in the **0.35–0.50 range** — good matches that get filtered out
- The 0.3 textWeight allocation is permanently "wasted," acting as a penalty rather than a boost

**Impact:** The effective minScore threshold rises from 0.35 to 0.50 for any query where FTS returns nothing.

### Bug 3: `bm25RankToScore` produces binary scores

**File:** `src/memory/hybrid.ts`, line 36 — `bm25RankToScore()`

```typescript
export function bm25RankToScore(rank: number): number {
  const normalized = Number.isFinite(rank) ? Math.max(0, rank) : 999;
  return 1 / (1 + normalized);
}
```

SQLite FTS5's `bm25()` function returns **negative** values (more negative = better match). The `Math.max(0, rank)` call clamps all negative values to 0, so `1 / (1 + 0) = 1.0` for every match. BM25 becomes effectively binary: match = 1.0, no match = 0.0, with no gradient between them.

Even when FTS does return results (i.e., Bug 1 doesn't fire), all matches receive the same textScore of 1.0, providing no ranking signal. The `1 / (1 + rank)` formula using absolute rank also decays extremely rapidly — rank 1 → 0.5, rank 2 → 0.33 — even if all ranks are conceptually close in quality.

**Impact:** BM25 provides no useful ranking information even in the rare cases where AND-joined FTS matches succeed.

### How these compound

```
Bug 1 (AND-join) → FTS returns nothing for natural language queries
       ↓
Bug 2 (weighted average) → score = 0.7 × vectorScore (30% penalty)
       ↓
Effective threshold: 0.35 / 0.7 = 0.50 (raised from 0.35)
       ↓
Most semantic matches (0.35–0.50 cosine sim) are filtered out
       ↓
Hit rate: 40% instead of 95%
```

When Bug 1 doesn't fire (exact terminology query), Bug 3 means the BM25 boost is always the maximum value, so the hybrid score is `0.7 × vectorScore + 0.3 × 1.0`. This inflates scores for keyword matches but provides no gradient. The system works for exact terminology but fails for semantic queries — exactly backwards from what hybrid search is supposed to achieve.

## Proposed Fix

### Fix 1: Switch FTS query joining from AND to OR

**What:** Change `buildFtsQuery()` to join tokens with `OR` instead of `AND`, with proper FTS5 special character escaping.

**Why:** OR-joining means any matching token surfaces the document. BM25 naturally ranks documents matching more terms higher, so multi-word matches still score better than single-word matches — no explicit AND required.

**Spec:**
- Tokenize the query string (current regex is fine: `/[A-Za-z0-9_]+/g`)
- Escape FTS5 special characters in tokens (double-quote escaping, etc.)
- Join with `OR` instead of `AND`
- Return `null` for empty token lists (same as current behavior)

**Edge cases:**
- Single-token queries: Behavior unchanged (no join operator needed)
- Queries with FTS5 special characters (`"`, `*`, `-`): Must be properly escaped
- Very long queries: May return too-broad FTS results, but BM25 ranking handles this
- Stop words: FTS5 handles these natively; no explicit filtering needed

### Fix 2: Replace weighted-average fusion with Reciprocal Rank Fusion (RRF)

**What:** Replace the `mergeHybridResults()` formula from `score = vectorWeight × vectorScore + textWeight × textScore` to RRF: `score = Σ(weight_i / (k + rank_i))`.

**Why:** RRF is rank-based, not score-based. When a document is missing from one list, its contribution is 0 (additive) rather than causing a multiplicative penalty. This is the standard fusion algorithm used in production search systems.

**Spec:**
- For each document, compute: `rrfScore = vectorWeight / (k + vectorRank) + textWeight / (k + keywordRank)`
- `k = 60` (standard constant from the [original RRF paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf))
- If a document appears only in the vector list, `keywordRank` contribution is 0 (not penalized)
- If a document appears only in the keyword list, `vectorRank` contribution is 0
- Ranks are 1-indexed positions within each sorted result list
- The `vectorWeight` and `textWeight` config values can still be used as multipliers on each component

**Note on minScore:** RRF scores are in a much smaller range (roughly 0–0.033) than the current weighted-average scores (0–1.0). The `minScore` threshold would need to be adjusted or the RRF scores would need to be normalized. One approach: normalize RRF scores to 0–1 by dividing by the maximum possible score `(vectorWeight + textWeight) / (k + 1)`. Another: use a separate minScore default when RRF is active. This is a design decision best left to the maintainers.

**Edge cases:**
- All results from vector only (FTS returns nothing): Equivalent to vector-only ranking, no penalty
- All results from FTS only (vector returns nothing): Equivalent to keyword-only ranking
- Duplicate documents in both lists: Scores add, correctly boosting documents found by both methods
- `k` value sensitivity: 60 is well-tested in literature; could be made configurable

### Fix 3: Fix BM25 score normalization

**What:** Replace `bm25RankToScore()` to correctly handle FTS5's negative rank values and provide a gradient across results.

**Why:** FTS5 `bm25()` returns negative values where more negative = better. The current `Math.max(0, rank)` clamps all values to 0, making every match score 1.0.

**Spec:**
- Use `Math.abs(rank)` instead of `Math.max(0, rank)` for the basic fix, or
- Use min-max normalization across the result set: `normalizedScore = (maxRank - rank) / (maxRank - minRank)` to produce a meaningful 0–1 gradient
- Single-result sets should get score 1.0 (no range to normalize)

**Note:** If Fix 2 (RRF) is adopted, this fix becomes less critical since RRF uses rank positions rather than scores. However, the `bm25RankToScore` function is still called in `searchKeyword()` (in `manager-search.ts`, assigned to `score` and `textScore`), so fixing it would improve keyword-only search quality as well.

**Edge cases:**
- Single FTS result: Should score 1.0 (currently does, and normalization should preserve this)
- FTS5 returning 0.0 rank: Handle gracefully (edge case with no term frequency data)

### Acceptance criteria

1. **Natural language queries return relevant results:** Queries like `"what to do when a task seems impossible"` should return documents about persistence/tenacity methodology with hybrid enabled.
2. **No regression on exact terminology:** Queries with exact document terms (e.g., `"falsification-first reasoning protocol"`) should continue to work and ideally rank higher (boosted by both vector and keyword signals).
3. **FTS misses don't suppress vector matches:** A document with a strong vector score (e.g., 0.45) should not be filtered out just because it didn't match the FTS query.
4. **BM25 provides ranking gradient:** When multiple documents match the FTS query, they should receive differentiated scores rather than all receiving 1.0.
5. **Hit rate ≥ 90%** on a test suite of natural language, contextual, and exact terminology queries against a representative memory corpus.

## Reference Implementation

> **Note:** These are reference implementations for the proposed approach. The maintainers know the codebase better and may choose a different implementation strategy. These are provided to show that the approach works, not to prescribe a specific solution.

### RRF fusion — replacement for `mergeHybridResults()`

<details>
<summary>Click to expand: <code>rrf-fusion.ts</code></summary>

```typescript
/**
 * Reciprocal Rank Fusion — drop-in replacement for mergeHybridResults()
 *
 * RRF Formula: RRF_score(d) = Σ(weight_i / (k + rank_i(d)))
 * k=60 (standard from original RRF paper)
 *
 * Key difference from weighted average:
 * - Weighted avg: score = 0.7 × vectorScore + 0.3 × 0 = 0.7 × vectorScore (suppressed!)
 * - RRF: score = vectorWeight/(k+vectorRank) + 0 = vectorWeight/(k+vectorRank) (no suppression)
 */

const DEFAULT_K = 60;

export function mergeHybridResults(params: {
  vector: HybridVectorResult[];
  keyword: HybridKeywordResult[];
  vectorWeight: number;
  textWeight: number;
  k?: number;
}): Array<{
  path: string;
  startLine: number;
  endLine: number;
  score: number;
  snippet: string;
  source: HybridSource;
}> {
  const k = params.k ?? DEFAULT_K;

  const byId = new Map<string, {
    id: string; path: string; startLine: number; endLine: number;
    source: HybridSource; snippet: string;
    vectorRank?: number; keywordRank?: number;
  }>();

  // Track rank positions (1-indexed)
  for (let i = 0; i < params.vector.length; i++) {
    const r = params.vector[i];
    byId.set(r.id, {
      id: r.id, path: r.path, startLine: r.startLine, endLine: r.endLine,
      source: r.source, snippet: r.snippet, vectorRank: i + 1,
    });
  }

  for (let i = 0; i < params.keyword.length; i++) {
    const r = params.keyword[i];
    const existing = byId.get(r.id);
    if (existing) {
      existing.keywordRank = i + 1;
      if (r.snippet?.length) existing.snippet = r.snippet;
    } else {
      byId.set(r.id, {
        id: r.id, path: r.path, startLine: r.startLine, endLine: r.endLine,
        source: r.source, snippet: r.snippet, keywordRank: i + 1,
      });
    }
  }

  const merged = Array.from(byId.values()).map((entry) => {
    const vectorContrib = entry.vectorRank != null
      ? params.vectorWeight / (k + entry.vectorRank) : 0;
    const keywordContrib = entry.keywordRank != null
      ? params.textWeight / (k + entry.keywordRank) : 0;
    return {
      path: entry.path, startLine: entry.startLine, endLine: entry.endLine,
      score: vectorContrib + keywordContrib,
      snippet: entry.snippet, source: entry.source,
    };
  });

  return merged.toSorted((a, b) => b.score - a.score);
}
```

</details>

### OR-joined FTS query builder — replacement for `buildFtsQuery()`

<details>
<summary>Click to expand: <code>fts-query.ts</code></summary>

```typescript
/**
 * OR-joined FTS5 query builder with proper escaping.
 *
 * OR-joining means any matching token surfaces the document.
 * BM25 naturally ranks multi-word matches higher.
 */

const FTS5_SPECIAL_CHARS = /["'(){}[\]:^*+\-~]/g;

export function buildFtsQuery(raw: string): string | null {
  const tokens = raw.match(/[A-Za-z0-9_]+/g)?.map((t) => t.trim()).filter(Boolean) ?? [];
  if (tokens.length === 0) return null;

  return tokens.map((token) => {
    const escaped = token.replace(/"/g, '""');
    return FTS5_SPECIAL_CHARS.test(token) || token.includes(' ')
      ? `"${escaped}"` : escaped;
  }).join(' OR ');
}
```

</details>

### BM25 normalization — replacement for `bm25RankToScore()`

<details>
<summary>Click to expand: <code>bm25-normalize.ts</code></summary>

```typescript
/**
 * Normalize BM25 scores to 0–1 using min-max normalization.
 *
 * FTS5 bm25() returns negative values (more negative = better match).
 * Current implementation clamps negatives to 0, making all scores = 1.0.
 *
 * This normalizes relative to the result set:
 * best match → 1.0, worst match → ~0.0
 */
export function normalizeBm25Scores<T extends { rank: number }>(
  results: T[]
): (T & { normalizedScore: number })[] {
  if (results.length === 0) return [];

  const ranks = results.map((r) => r.rank);
  const minRank = Math.min(...ranks); // Best (most negative)
  const maxRank = Math.max(...ranks); // Worst (least negative)
  const range = maxRank - minRank;

  return results.map((result) => ({
    ...result,
    normalizedScore: range === 0 ? 1.0 : (maxRank - result.rank) / range,
  }));
}

/**
 * Minimal fix for bm25RankToScore (if batch normalization isn't desired):
 * Use Math.abs() instead of Math.max(0, ...) to handle negative FTS5 ranks.
 */
export function bm25RankToScore(rank: number): number {
  const normalized = Number.isFinite(rank) ? Math.abs(rank) : 999;
  return 1 / (1 + normalized);
}
```

</details>

## Workaround

Until this is fixed upstream, the following config achieves **94.7% hit rate** by bypassing hybrid search entirely:

```json
{
  "agents": {
    "defaults": {
      "memorySearch": {
        "query": {
          "hybrid": {
            "enabled": false
          },
          "minScore": 0.25,
          "maxResults": 10
        }
      }
    }
  }
}
```

This disables the hybrid pipeline, restores the raw cosine similarity as the score, lowers the threshold to catch moderate-similarity matches, and increases the result count. The trade-off is losing keyword matching entirely — there's no BM25 boost for exact terminology matches.

## Validation

### Test methodology

- **57 queries** across **11 memory files** covering diverse content types (behavioral protocols, technical reference, personality profiles, routing rules)
- **5–6 queries per file** using varied phrasings:
  - **Direct terminology** — exact terms from the document
  - **Natural language** — how a user/agent would actually phrase the question
  - **Adjacent vocabulary** — synonyms and related terms
  - **Vague/abstract** — conceptual queries without specific terms
  - **Cross-domain** — terms from a different field that map to the same concept
- **Hit criteria:** Target file appears in results with score ≥ minScore

### Results

| Configuration | Hits | Hit Rate | Files passing |
|--------------|------|----------|---------------|
| Default hybrid (before) | 23/57 | 40.4% | 4/11 |
| Vector-only + lower threshold (after) | 54/57 | 94.7% | 11/11 |

### Failure pattern

The 32 queries that failed under default hybrid config all share the same characteristic: natural language phrasing with no exact keyword overlap with the target document. Direct terminology queries succeeded in every case. This is consistent with the root cause — AND-joined FTS returns nothing → weighted average applies 0.7× penalty → score drops below threshold.

### Only 3 queries still fail after workaround

1. `"quality assurance checklist before making claims"` — too generic, no semantic overlap with target
2. `"responsive design mobile-first"` — completely different domain vocabulary
3. `"which telegram bot should I send this to"` — too conversational for vector similarity

These are genuine edge cases rather than systematic failures.

---

Thank you for building OpenClaw — the memory search architecture is well-designed and the hybrid approach is the right idea. These bugs are subtle (especially the compounding effect) and easy to miss. Happy to provide more details, run additional tests, or help in any way that's useful.


Component	Value
OpenClaw version	`2026.2.13`
Node.js	`v25.5.0`
OS	macOS 15 (Darwin 25.2.0, arm64)
Embedding model	`text-embedding-3-large` (3072 dims)
Embedding provider	OpenAI-compatible API
Config	Default — no `query` or `hybrid` overrides
Memory corpus	45 files, 564 chunks, avg ~1500 chars/chunk

Query	Target content	Vector similarity	Hybrid score (0.7×)	Result
`"how should I evaluate if a claim is actually true"`	Falsification-first reasoning protocol	0.438	0.307	❌ Filtered (< 0.35)
`"how to think about problems and verify information"`	Reasoning & verification methodology	0.505	0.353	❌ Filtered (< 0.35)
`"what to do when a task seems impossible"`	Tenacity & persistence protocol	0.400	0.280	❌ Filtered (< 0.35)
`"tone adjustment strangers conversation style"`	Audience calibration rules	0.484	0.339	❌ Filtered (< 0.35)
`"what does Michael value and how does he think"`	User personality profile	0.456*	0.319	❌ Filtered (< 0.35)
`"should I respond to this message or stay quiet"`	Group chat participation rules	0.275*	0.193	❌ Filtered (< 0.35)
`"where to send urgent notifications"`	7-bot notification routing	0.447*	0.313	❌ Filtered (< 0.35)
`"how should I behave when talking to people who aren't Michael"`	Audience adaptation for non-Michael users	0.497*	0.348	❌ Filtered (< 0.35)

Config	Hit rate	Files passing (≥3/5 queries)
Default (hybrid enabled, minScore 0.35)	40.4% (23/57)	4/11
Hybrid disabled, minScore 0.25, maxResults 10	94.7% (54/57)	11/11

Configuration	Hits	Hit Rate	Files passing
Default hybrid (before)	23/57	40.4%	4/11
Vector-only + lower threshold (after)	54/57	94.7%	11/11

Uh oh!

memory_search: hybrid scoring bugs reduce hit rate from 95% to 40% #16021

Description

memory_search: hybrid scoring bugs reduce hit rate from 95% to 40%

Summary

Environment

Reproduction

Steps

Representative failing queries

Expected vs actual

Root Cause Analysis

Bug 1: AND-joined FTS queries kill recall

Bug 2: Weighted-average fusion penalizes FTS misses

Bug 3: bm25RankToScore produces binary scores

How these compound

Proposed Fix

Fix 1: Switch FTS query joining from AND to OR

Fix 2: Replace weighted-average fusion with Reciprocal Rank Fusion (RRF)

Fix 3: Fix BM25 score normalization

Acceptance criteria

Reference Implementation

RRF fusion — replacement for mergeHybridResults()

OR-joined FTS query builder — replacement for buildFtsQuery()

BM25 normalization — replacement for bm25RankToScore()

Workaround

Validation

Test methodology

Results

Failure pattern

Only 3 queries still fail after workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`memory_search`: hybrid scoring bugs reduce hit rate from 95% to 40%

Bug 3: `bm25RankToScore` produces binary scores

RRF fusion — replacement for `mergeHybridResults()`

OR-joined FTS query builder — replacement for `buildFtsQuery()`

BM25 normalization — replacement for `bm25RankToScore()`