Skip to content

memory_search: hybrid scoring bugs reduce hit rate from 95% to 40% #16021

@michael0903

Description

@michael0903

memory_search: hybrid scoring bugs reduce hit rate from 95% to 40%

Summary

Three compounding bugs in the hybrid search scoring pipeline cause memory_search to filter out valid semantic matches when hybrid.enabled is true (the default). The weighted-average fusion formula applies a 0.7× penalty to vector scores when BM25 returns nothing, AND-joined FTS queries ensure BM25 returns nothing for most natural language queries, and the bm25RankToScore function clamps all BM25 scores to a single value due to incorrect handling of negative FTS5 ranks. Together, these bugs raise the effective minScore threshold from 0.35 to ~0.50, filtering out the majority of semantically relevant results. In a 57-query test across 11 memory files, the hit rate drops from 94.7% (vector-only) to 40.4% (hybrid). The fix involves switching FTS queries from AND-join to OR-join, replacing the weighted-average fusion with Reciprocal Rank Fusion (RRF), and normalizing BM25 scores relative to the result set rather than using absolute rank.

Environment

Component Value
OpenClaw version 2026.2.13
Node.js v25.5.0
OS macOS 15 (Darwin 25.2.0, arm64)
Embedding model text-embedding-3-large (3072 dims)
Embedding provider OpenAI-compatible API
Config Default — no query or hybrid overrides
Memory corpus 45 files, 564 chunks, avg ~1500 chars/chunk

Relevant default config (from src/agents/memory-search.ts, lines 81–86):

DEFAULT_MAX_RESULTS = 6
DEFAULT_MIN_SCORE = 0.35
DEFAULT_HYBRID_ENABLED = true
DEFAULT_HYBRID_VECTOR_WEIGHT = 0.7
DEFAULT_HYBRID_TEXT_WEIGHT = 0.3
DEFAULT_HYBRID_CANDIDATE_MULTIPLIER = 4

Reproduction

Steps

  1. Use default config (no query.hybrid overrides — hybrid search is enabled by default)
  2. Index several markdown memory files containing varied content
  3. Run memory_search with natural language queries

Representative failing queries

These queries return "No matches" with default hybrid config but succeed with hybrid.enabled: false:

Query Target content Vector similarity Hybrid score (0.7×) Result
"how should I evaluate if a claim is actually true" Falsification-first reasoning protocol 0.438 0.307 ❌ Filtered (< 0.35)
"how to think about problems and verify information" Reasoning & verification methodology 0.505 0.353 ❌ Filtered (< 0.35)
"what to do when a task seems impossible" Tenacity & persistence protocol 0.400 0.280 ❌ Filtered (< 0.35)
"tone adjustment strangers conversation style" Audience calibration rules 0.484 0.339 ❌ Filtered (< 0.35)
"what does Michael value and how does he think" User personality profile 0.456* 0.319 ❌ Filtered (< 0.35)
"should I respond to this message or stay quiet" Group chat participation rules 0.275* 0.193 ❌ Filtered (< 0.35)
"where to send urgent notifications" 7-bot notification routing 0.447* 0.313 ❌ Filtered (< 0.35)
"how should I behave when talking to people who aren't Michael" Audience adaptation for non-Michael users 0.497* 0.348 ❌ Filtered (< 0.35)

*Scores marked with * are from post-fix vector-only results; the direct cosine similarities follow the same pattern.

Pattern: Every natural language query that lacks exact keyword overlap with the target document gets a BM25 score of 0, which causes the hybrid formula to multiply the vector score by 0.7, dropping it below the 0.35 threshold. Direct terminology queries (e.g., "falsification-first reasoning protocol" → 0.664) succeed because they happen to match FTS keywords as well.

Expected vs actual

Config Hit rate Files passing (≥3/5 queries)
Default (hybrid enabled, minScore 0.35) 40.4% (23/57) 4/11
Hybrid disabled, minScore 0.25, maxResults 10 94.7% (54/57) 11/11

Root Cause Analysis

Three bugs compound to systematically suppress natural language queries. Each one is modest in isolation; together they raise the effective threshold from 0.35 to approximately 0.50.

Bug 1: AND-joined FTS queries kill recall

File: src/memory/hybrid.ts, line 23 — buildFtsQuery()

export function buildFtsQuery(raw: string): string | null {
  const tokens = raw.match(/[A-Za-z0-9_]+/g)?.map((t) => t.trim()).filter(Boolean) ?? [];
  if (tokens.length === 0) return null;
  const quoted = tokens.map((t) => `"${t.replaceAll('"', "")}"`);
  return quoted.join(" AND ");  // ← AND requires ALL tokens in a single chunk
}

For the query "how should I evaluate if a claim is actually true", this produces:

"how" AND "should" AND "I" AND "evaluate" AND "if" AND "a" AND "claim" AND "is" AND "actually" AND "true"

All 10 words must appear verbatim in a single ~1500-character chunk. For natural language queries, this almost never matches, so FTS returns zero results and every document gets textScore = 0.

Impact: The FTS component becomes effectively dead for natural language queries, guaranteeing Bug 2 fires on every search.

Bug 2: Weighted-average fusion penalizes FTS misses

File: src/memory/hybrid.ts, line 41 — mergeHybridResults()

const score = params.vectorWeight * entry.vectorScore + params.textWeight * entry.textScore;
// With defaults: score = 0.7 × vectorScore + 0.3 × 0 = 0.7 × vectorScore

When the FTS component returns nothing (due to Bug 1), textScore = 0 for all results. The fusion formula then reduces to score = 0.7 × vectorScore. This means:

  • A document needs a vector similarity > 0.50 to pass the 0.35 minScore threshold (0.35 ÷ 0.7 = 0.50)
  • Most natural language queries against technical content produce vector similarities in the 0.35–0.50 range — good matches that get filtered out
  • The 0.3 textWeight allocation is permanently "wasted," acting as a penalty rather than a boost

Impact: The effective minScore threshold rises from 0.35 to 0.50 for any query where FTS returns nothing.

Bug 3: bm25RankToScore produces binary scores

File: src/memory/hybrid.ts, line 36 — bm25RankToScore()

export function bm25RankToScore(rank: number): number {
  const normalized = Number.isFinite(rank) ? Math.max(0, rank) : 999;
  return 1 / (1 + normalized);
}

SQLite FTS5's bm25() function returns negative values (more negative = better match). The Math.max(0, rank) call clamps all negative values to 0, so 1 / (1 + 0) = 1.0 for every match. BM25 becomes effectively binary: match = 1.0, no match = 0.0, with no gradient between them.

Even when FTS does return results (i.e., Bug 1 doesn't fire), all matches receive the same textScore of 1.0, providing no ranking signal. The 1 / (1 + rank) formula using absolute rank also decays extremely rapidly — rank 1 → 0.5, rank 2 → 0.33 — even if all ranks are conceptually close in quality.

Impact: BM25 provides no useful ranking information even in the rare cases where AND-joined FTS matches succeed.

How these compound

Bug 1 (AND-join) → FTS returns nothing for natural language queries
       ↓
Bug 2 (weighted average) → score = 0.7 × vectorScore (30% penalty)
       ↓
Effective threshold: 0.35 / 0.7 = 0.50 (raised from 0.35)
       ↓
Most semantic matches (0.35–0.50 cosine sim) are filtered out
       ↓
Hit rate: 40% instead of 95%

When Bug 1 doesn't fire (exact terminology query), Bug 3 means the BM25 boost is always the maximum value, so the hybrid score is 0.7 × vectorScore + 0.3 × 1.0. This inflates scores for keyword matches but provides no gradient. The system works for exact terminology but fails for semantic queries — exactly backwards from what hybrid search is supposed to achieve.

Proposed Fix

Fix 1: Switch FTS query joining from AND to OR

What: Change buildFtsQuery() to join tokens with OR instead of AND, with proper FTS5 special character escaping.

Why: OR-joining means any matching token surfaces the document. BM25 naturally ranks documents matching more terms higher, so multi-word matches still score better than single-word matches — no explicit AND required.

Spec:

  • Tokenize the query string (current regex is fine: /[A-Za-z0-9_]+/g)
  • Escape FTS5 special characters in tokens (double-quote escaping, etc.)
  • Join with OR instead of AND
  • Return null for empty token lists (same as current behavior)

Edge cases:

  • Single-token queries: Behavior unchanged (no join operator needed)
  • Queries with FTS5 special characters (", *, -): Must be properly escaped
  • Very long queries: May return too-broad FTS results, but BM25 ranking handles this
  • Stop words: FTS5 handles these natively; no explicit filtering needed

Fix 2: Replace weighted-average fusion with Reciprocal Rank Fusion (RRF)

What: Replace the mergeHybridResults() formula from score = vectorWeight × vectorScore + textWeight × textScore to RRF: score = Σ(weight_i / (k + rank_i)).

Why: RRF is rank-based, not score-based. When a document is missing from one list, its contribution is 0 (additive) rather than causing a multiplicative penalty. This is the standard fusion algorithm used in production search systems.

Spec:

  • For each document, compute: rrfScore = vectorWeight / (k + vectorRank) + textWeight / (k + keywordRank)
  • k = 60 (standard constant from the original RRF paper)
  • If a document appears only in the vector list, keywordRank contribution is 0 (not penalized)
  • If a document appears only in the keyword list, vectorRank contribution is 0
  • Ranks are 1-indexed positions within each sorted result list
  • The vectorWeight and textWeight config values can still be used as multipliers on each component

Note on minScore: RRF scores are in a much smaller range (roughly 0–0.033) than the current weighted-average scores (0–1.0). The minScore threshold would need to be adjusted or the RRF scores would need to be normalized. One approach: normalize RRF scores to 0–1 by dividing by the maximum possible score (vectorWeight + textWeight) / (k + 1). Another: use a separate minScore default when RRF is active. This is a design decision best left to the maintainers.

Edge cases:

  • All results from vector only (FTS returns nothing): Equivalent to vector-only ranking, no penalty
  • All results from FTS only (vector returns nothing): Equivalent to keyword-only ranking
  • Duplicate documents in both lists: Scores add, correctly boosting documents found by both methods
  • k value sensitivity: 60 is well-tested in literature; could be made configurable

Fix 3: Fix BM25 score normalization

What: Replace bm25RankToScore() to correctly handle FTS5's negative rank values and provide a gradient across results.

Why: FTS5 bm25() returns negative values where more negative = better. The current Math.max(0, rank) clamps all values to 0, making every match score 1.0.

Spec:

  • Use Math.abs(rank) instead of Math.max(0, rank) for the basic fix, or
  • Use min-max normalization across the result set: normalizedScore = (maxRank - rank) / (maxRank - minRank) to produce a meaningful 0–1 gradient
  • Single-result sets should get score 1.0 (no range to normalize)

Note: If Fix 2 (RRF) is adopted, this fix becomes less critical since RRF uses rank positions rather than scores. However, the bm25RankToScore function is still called in searchKeyword() (in manager-search.ts, assigned to score and textScore), so fixing it would improve keyword-only search quality as well.

Edge cases:

  • Single FTS result: Should score 1.0 (currently does, and normalization should preserve this)
  • FTS5 returning 0.0 rank: Handle gracefully (edge case with no term frequency data)

Acceptance criteria

  1. Natural language queries return relevant results: Queries like "what to do when a task seems impossible" should return documents about persistence/tenacity methodology with hybrid enabled.
  2. No regression on exact terminology: Queries with exact document terms (e.g., "falsification-first reasoning protocol") should continue to work and ideally rank higher (boosted by both vector and keyword signals).
  3. FTS misses don't suppress vector matches: A document with a strong vector score (e.g., 0.45) should not be filtered out just because it didn't match the FTS query.
  4. BM25 provides ranking gradient: When multiple documents match the FTS query, they should receive differentiated scores rather than all receiving 1.0.
  5. Hit rate ≥ 90% on a test suite of natural language, contextual, and exact terminology queries against a representative memory corpus.

Reference Implementation

Note: These are reference implementations for the proposed approach. The maintainers know the codebase better and may choose a different implementation strategy. These are provided to show that the approach works, not to prescribe a specific solution.

RRF fusion — replacement for mergeHybridResults()

Click to expand: rrf-fusion.ts
/**
 * Reciprocal Rank Fusion — drop-in replacement for mergeHybridResults()
 *
 * RRF Formula: RRF_score(d) = Σ(weight_i / (k + rank_i(d)))
 * k=60 (standard from original RRF paper)
 *
 * Key difference from weighted average:
 * - Weighted avg: score = 0.7 × vectorScore + 0.3 × 0 = 0.7 × vectorScore (suppressed!)
 * - RRF: score = vectorWeight/(k+vectorRank) + 0 = vectorWeight/(k+vectorRank) (no suppression)
 */

const DEFAULT_K = 60;

export function mergeHybridResults(params: {
  vector: HybridVectorResult[];
  keyword: HybridKeywordResult[];
  vectorWeight: number;
  textWeight: number;
  k?: number;
}): Array<{
  path: string;
  startLine: number;
  endLine: number;
  score: number;
  snippet: string;
  source: HybridSource;
}> {
  const k = params.k ?? DEFAULT_K;

  const byId = new Map<string, {
    id: string; path: string; startLine: number; endLine: number;
    source: HybridSource; snippet: string;
    vectorRank?: number; keywordRank?: number;
  }>();

  // Track rank positions (1-indexed)
  for (let i = 0; i < params.vector.length; i++) {
    const r = params.vector[i];
    byId.set(r.id, {
      id: r.id, path: r.path, startLine: r.startLine, endLine: r.endLine,
      source: r.source, snippet: r.snippet, vectorRank: i + 1,
    });
  }

  for (let i = 0; i < params.keyword.length; i++) {
    const r = params.keyword[i];
    const existing = byId.get(r.id);
    if (existing) {
      existing.keywordRank = i + 1;
      if (r.snippet?.length) existing.snippet = r.snippet;
    } else {
      byId.set(r.id, {
        id: r.id, path: r.path, startLine: r.startLine, endLine: r.endLine,
        source: r.source, snippet: r.snippet, keywordRank: i + 1,
      });
    }
  }

  const merged = Array.from(byId.values()).map((entry) => {
    const vectorContrib = entry.vectorRank != null
      ? params.vectorWeight / (k + entry.vectorRank) : 0;
    const keywordContrib = entry.keywordRank != null
      ? params.textWeight / (k + entry.keywordRank) : 0;
    return {
      path: entry.path, startLine: entry.startLine, endLine: entry.endLine,
      score: vectorContrib + keywordContrib,
      snippet: entry.snippet, source: entry.source,
    };
  });

  return merged.toSorted((a, b) => b.score - a.score);
}

OR-joined FTS query builder — replacement for buildFtsQuery()

Click to expand: fts-query.ts
/**
 * OR-joined FTS5 query builder with proper escaping.
 *
 * OR-joining means any matching token surfaces the document.
 * BM25 naturally ranks multi-word matches higher.
 */

const FTS5_SPECIAL_CHARS = /["'(){}[\]:^*+\-~]/g;

export function buildFtsQuery(raw: string): string | null {
  const tokens = raw.match(/[A-Za-z0-9_]+/g)?.map((t) => t.trim()).filter(Boolean) ?? [];
  if (tokens.length === 0) return null;

  return tokens.map((token) => {
    const escaped = token.replace(/"/g, '""');
    return FTS5_SPECIAL_CHARS.test(token) || token.includes(' ')
      ? `"${escaped}"` : escaped;
  }).join(' OR ');
}

BM25 normalization — replacement for bm25RankToScore()

Click to expand: bm25-normalize.ts
/**
 * Normalize BM25 scores to 0–1 using min-max normalization.
 *
 * FTS5 bm25() returns negative values (more negative = better match).
 * Current implementation clamps negatives to 0, making all scores = 1.0.
 *
 * This normalizes relative to the result set:
 * best match → 1.0, worst match → ~0.0
 */
export function normalizeBm25Scores<T extends { rank: number }>(
  results: T[]
): (T & { normalizedScore: number })[] {
  if (results.length === 0) return [];

  const ranks = results.map((r) => r.rank);
  const minRank = Math.min(...ranks); // Best (most negative)
  const maxRank = Math.max(...ranks); // Worst (least negative)
  const range = maxRank - minRank;

  return results.map((result) => ({
    ...result,
    normalizedScore: range === 0 ? 1.0 : (maxRank - result.rank) / range,
  }));
}

/**
 * Minimal fix for bm25RankToScore (if batch normalization isn't desired):
 * Use Math.abs() instead of Math.max(0, ...) to handle negative FTS5 ranks.
 */
export function bm25RankToScore(rank: number): number {
  const normalized = Number.isFinite(rank) ? Math.abs(rank) : 999;
  return 1 / (1 + normalized);
}

Workaround

Until this is fixed upstream, the following config achieves 94.7% hit rate by bypassing hybrid search entirely:

{
  "agents": {
    "defaults": {
      "memorySearch": {
        "query": {
          "hybrid": {
            "enabled": false
          },
          "minScore": 0.25,
          "maxResults": 10
        }
      }
    }
  }
}

This disables the hybrid pipeline, restores the raw cosine similarity as the score, lowers the threshold to catch moderate-similarity matches, and increases the result count. The trade-off is losing keyword matching entirely — there's no BM25 boost for exact terminology matches.

Validation

Test methodology

  • 57 queries across 11 memory files covering diverse content types (behavioral protocols, technical reference, personality profiles, routing rules)
  • 5–6 queries per file using varied phrasings:
    • Direct terminology — exact terms from the document
    • Natural language — how a user/agent would actually phrase the question
    • Adjacent vocabulary — synonyms and related terms
    • Vague/abstract — conceptual queries without specific terms
    • Cross-domain — terms from a different field that map to the same concept
  • Hit criteria: Target file appears in results with score ≥ minScore

Results

Configuration Hits Hit Rate Files passing
Default hybrid (before) 23/57 40.4% 4/11
Vector-only + lower threshold (after) 54/57 94.7% 11/11

Failure pattern

The 32 queries that failed under default hybrid config all share the same characteristic: natural language phrasing with no exact keyword overlap with the target document. Direct terminology queries succeeded in every case. This is consistent with the root cause — AND-joined FTS returns nothing → weighted average applies 0.7× penalty → score drops below threshold.

Only 3 queries still fail after workaround

  1. "quality assurance checklist before making claims" — too generic, no semantic overlap with target
  2. "responsive design mobile-first" — completely different domain vocabulary
  3. "which telegram bot should I send this to" — too conversational for vector similarity

These are genuine edge cases rather than systematic failures.


Thank you for building OpenClaw — the memory search architecture is well-designed and the hybrid approach is the right idea. These bugs are subtle (especially the compounding effect) and easy to miss. Happy to provide more details, run additional tests, or help in any way that's useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions