memory_search: hybrid scoring bugs reduce hit rate from 95% to 40%
Summary
Three compounding bugs in the hybrid search scoring pipeline cause memory_search to filter out valid semantic matches when hybrid.enabled is true (the default). The weighted-average fusion formula applies a 0.7× penalty to vector scores when BM25 returns nothing, AND-joined FTS queries ensure BM25 returns nothing for most natural language queries, and the bm25RankToScore function clamps all BM25 scores to a single value due to incorrect handling of negative FTS5 ranks. Together, these bugs raise the effective minScore threshold from 0.35 to ~0.50, filtering out the majority of semantically relevant results. In a 57-query test across 11 memory files, the hit rate drops from 94.7% (vector-only) to 40.4% (hybrid). The fix involves switching FTS queries from AND-join to OR-join, replacing the weighted-average fusion with Reciprocal Rank Fusion (RRF), and normalizing BM25 scores relative to the result set rather than using absolute rank.
Environment
| Component |
Value |
| OpenClaw version |
2026.2.13 |
| Node.js |
v25.5.0 |
| OS |
macOS 15 (Darwin 25.2.0, arm64) |
| Embedding model |
text-embedding-3-large (3072 dims) |
| Embedding provider |
OpenAI-compatible API |
| Config |
Default — no query or hybrid overrides |
| Memory corpus |
45 files, 564 chunks, avg ~1500 chars/chunk |
Relevant default config (from src/agents/memory-search.ts, lines 81–86):
DEFAULT_MAX_RESULTS = 6
DEFAULT_MIN_SCORE = 0.35
DEFAULT_HYBRID_ENABLED = true
DEFAULT_HYBRID_VECTOR_WEIGHT = 0.7
DEFAULT_HYBRID_TEXT_WEIGHT = 0.3
DEFAULT_HYBRID_CANDIDATE_MULTIPLIER = 4
Reproduction
Steps
- Use default config (no
query.hybrid overrides — hybrid search is enabled by default)
- Index several markdown memory files containing varied content
- Run
memory_search with natural language queries
Representative failing queries
These queries return "No matches" with default hybrid config but succeed with hybrid.enabled: false:
| Query |
Target content |
Vector similarity |
Hybrid score (0.7×) |
Result |
"how should I evaluate if a claim is actually true" |
Falsification-first reasoning protocol |
0.438 |
0.307 |
❌ Filtered (< 0.35) |
"how to think about problems and verify information" |
Reasoning & verification methodology |
0.505 |
0.353 |
❌ Filtered (< 0.35) |
"what to do when a task seems impossible" |
Tenacity & persistence protocol |
0.400 |
0.280 |
❌ Filtered (< 0.35) |
"tone adjustment strangers conversation style" |
Audience calibration rules |
0.484 |
0.339 |
❌ Filtered (< 0.35) |
"what does Michael value and how does he think" |
User personality profile |
0.456* |
0.319 |
❌ Filtered (< 0.35) |
"should I respond to this message or stay quiet" |
Group chat participation rules |
0.275* |
0.193 |
❌ Filtered (< 0.35) |
"where to send urgent notifications" |
7-bot notification routing |
0.447* |
0.313 |
❌ Filtered (< 0.35) |
"how should I behave when talking to people who aren't Michael" |
Audience adaptation for non-Michael users |
0.497* |
0.348 |
❌ Filtered (< 0.35) |
*Scores marked with * are from post-fix vector-only results; the direct cosine similarities follow the same pattern.
Pattern: Every natural language query that lacks exact keyword overlap with the target document gets a BM25 score of 0, which causes the hybrid formula to multiply the vector score by 0.7, dropping it below the 0.35 threshold. Direct terminology queries (e.g., "falsification-first reasoning protocol" → 0.664) succeed because they happen to match FTS keywords as well.
Expected vs actual
| Config |
Hit rate |
Files passing (≥3/5 queries) |
| Default (hybrid enabled, minScore 0.35) |
40.4% (23/57) |
4/11 |
| Hybrid disabled, minScore 0.25, maxResults 10 |
94.7% (54/57) |
11/11 |
Root Cause Analysis
Three bugs compound to systematically suppress natural language queries. Each one is modest in isolation; together they raise the effective threshold from 0.35 to approximately 0.50.
Bug 1: AND-joined FTS queries kill recall
File: src/memory/hybrid.ts, line 23 — buildFtsQuery()
export function buildFtsQuery(raw: string): string | null {
const tokens = raw.match(/[A-Za-z0-9_]+/g)?.map((t) => t.trim()).filter(Boolean) ?? [];
if (tokens.length === 0) return null;
const quoted = tokens.map((t) => `"${t.replaceAll('"', "")}"`);
return quoted.join(" AND "); // ← AND requires ALL tokens in a single chunk
}
For the query "how should I evaluate if a claim is actually true", this produces:
"how" AND "should" AND "I" AND "evaluate" AND "if" AND "a" AND "claim" AND "is" AND "actually" AND "true"
All 10 words must appear verbatim in a single ~1500-character chunk. For natural language queries, this almost never matches, so FTS returns zero results and every document gets textScore = 0.
Impact: The FTS component becomes effectively dead for natural language queries, guaranteeing Bug 2 fires on every search.
Bug 2: Weighted-average fusion penalizes FTS misses
File: src/memory/hybrid.ts, line 41 — mergeHybridResults()
const score = params.vectorWeight * entry.vectorScore + params.textWeight * entry.textScore;
// With defaults: score = 0.7 × vectorScore + 0.3 × 0 = 0.7 × vectorScore
When the FTS component returns nothing (due to Bug 1), textScore = 0 for all results. The fusion formula then reduces to score = 0.7 × vectorScore. This means:
- A document needs a vector similarity > 0.50 to pass the 0.35 minScore threshold (0.35 ÷ 0.7 = 0.50)
- Most natural language queries against technical content produce vector similarities in the 0.35–0.50 range — good matches that get filtered out
- The 0.3 textWeight allocation is permanently "wasted," acting as a penalty rather than a boost
Impact: The effective minScore threshold rises from 0.35 to 0.50 for any query where FTS returns nothing.
Bug 3: bm25RankToScore produces binary scores
File: src/memory/hybrid.ts, line 36 — bm25RankToScore()
export function bm25RankToScore(rank: number): number {
const normalized = Number.isFinite(rank) ? Math.max(0, rank) : 999;
return 1 / (1 + normalized);
}
SQLite FTS5's bm25() function returns negative values (more negative = better match). The Math.max(0, rank) call clamps all negative values to 0, so 1 / (1 + 0) = 1.0 for every match. BM25 becomes effectively binary: match = 1.0, no match = 0.0, with no gradient between them.
Even when FTS does return results (i.e., Bug 1 doesn't fire), all matches receive the same textScore of 1.0, providing no ranking signal. The 1 / (1 + rank) formula using absolute rank also decays extremely rapidly — rank 1 → 0.5, rank 2 → 0.33 — even if all ranks are conceptually close in quality.
Impact: BM25 provides no useful ranking information even in the rare cases where AND-joined FTS matches succeed.
How these compound
Bug 1 (AND-join) → FTS returns nothing for natural language queries
↓
Bug 2 (weighted average) → score = 0.7 × vectorScore (30% penalty)
↓
Effective threshold: 0.35 / 0.7 = 0.50 (raised from 0.35)
↓
Most semantic matches (0.35–0.50 cosine sim) are filtered out
↓
Hit rate: 40% instead of 95%
When Bug 1 doesn't fire (exact terminology query), Bug 3 means the BM25 boost is always the maximum value, so the hybrid score is 0.7 × vectorScore + 0.3 × 1.0. This inflates scores for keyword matches but provides no gradient. The system works for exact terminology but fails for semantic queries — exactly backwards from what hybrid search is supposed to achieve.
Proposed Fix
Fix 1: Switch FTS query joining from AND to OR
What: Change buildFtsQuery() to join tokens with OR instead of AND, with proper FTS5 special character escaping.
Why: OR-joining means any matching token surfaces the document. BM25 naturally ranks documents matching more terms higher, so multi-word matches still score better than single-word matches — no explicit AND required.
Spec:
- Tokenize the query string (current regex is fine:
/[A-Za-z0-9_]+/g)
- Escape FTS5 special characters in tokens (double-quote escaping, etc.)
- Join with
OR instead of AND
- Return
null for empty token lists (same as current behavior)
Edge cases:
- Single-token queries: Behavior unchanged (no join operator needed)
- Queries with FTS5 special characters (
", *, -): Must be properly escaped
- Very long queries: May return too-broad FTS results, but BM25 ranking handles this
- Stop words: FTS5 handles these natively; no explicit filtering needed
Fix 2: Replace weighted-average fusion with Reciprocal Rank Fusion (RRF)
What: Replace the mergeHybridResults() formula from score = vectorWeight × vectorScore + textWeight × textScore to RRF: score = Σ(weight_i / (k + rank_i)).
Why: RRF is rank-based, not score-based. When a document is missing from one list, its contribution is 0 (additive) rather than causing a multiplicative penalty. This is the standard fusion algorithm used in production search systems.
Spec:
- For each document, compute:
rrfScore = vectorWeight / (k + vectorRank) + textWeight / (k + keywordRank)
k = 60 (standard constant from the original RRF paper)
- If a document appears only in the vector list,
keywordRank contribution is 0 (not penalized)
- If a document appears only in the keyword list,
vectorRank contribution is 0
- Ranks are 1-indexed positions within each sorted result list
- The
vectorWeight and textWeight config values can still be used as multipliers on each component
Note on minScore: RRF scores are in a much smaller range (roughly 0–0.033) than the current weighted-average scores (0–1.0). The minScore threshold would need to be adjusted or the RRF scores would need to be normalized. One approach: normalize RRF scores to 0–1 by dividing by the maximum possible score (vectorWeight + textWeight) / (k + 1). Another: use a separate minScore default when RRF is active. This is a design decision best left to the maintainers.
Edge cases:
- All results from vector only (FTS returns nothing): Equivalent to vector-only ranking, no penalty
- All results from FTS only (vector returns nothing): Equivalent to keyword-only ranking
- Duplicate documents in both lists: Scores add, correctly boosting documents found by both methods
k value sensitivity: 60 is well-tested in literature; could be made configurable
Fix 3: Fix BM25 score normalization
What: Replace bm25RankToScore() to correctly handle FTS5's negative rank values and provide a gradient across results.
Why: FTS5 bm25() returns negative values where more negative = better. The current Math.max(0, rank) clamps all values to 0, making every match score 1.0.
Spec:
- Use
Math.abs(rank) instead of Math.max(0, rank) for the basic fix, or
- Use min-max normalization across the result set:
normalizedScore = (maxRank - rank) / (maxRank - minRank) to produce a meaningful 0–1 gradient
- Single-result sets should get score 1.0 (no range to normalize)
Note: If Fix 2 (RRF) is adopted, this fix becomes less critical since RRF uses rank positions rather than scores. However, the bm25RankToScore function is still called in searchKeyword() (in manager-search.ts, assigned to score and textScore), so fixing it would improve keyword-only search quality as well.
Edge cases:
- Single FTS result: Should score 1.0 (currently does, and normalization should preserve this)
- FTS5 returning 0.0 rank: Handle gracefully (edge case with no term frequency data)
Acceptance criteria
- Natural language queries return relevant results: Queries like
"what to do when a task seems impossible" should return documents about persistence/tenacity methodology with hybrid enabled.
- No regression on exact terminology: Queries with exact document terms (e.g.,
"falsification-first reasoning protocol") should continue to work and ideally rank higher (boosted by both vector and keyword signals).
- FTS misses don't suppress vector matches: A document with a strong vector score (e.g., 0.45) should not be filtered out just because it didn't match the FTS query.
- BM25 provides ranking gradient: When multiple documents match the FTS query, they should receive differentiated scores rather than all receiving 1.0.
- Hit rate ≥ 90% on a test suite of natural language, contextual, and exact terminology queries against a representative memory corpus.
Reference Implementation
Note: These are reference implementations for the proposed approach. The maintainers know the codebase better and may choose a different implementation strategy. These are provided to show that the approach works, not to prescribe a specific solution.
RRF fusion — replacement for mergeHybridResults()
Click to expand: rrf-fusion.ts
/**
* Reciprocal Rank Fusion — drop-in replacement for mergeHybridResults()
*
* RRF Formula: RRF_score(d) = Σ(weight_i / (k + rank_i(d)))
* k=60 (standard from original RRF paper)
*
* Key difference from weighted average:
* - Weighted avg: score = 0.7 × vectorScore + 0.3 × 0 = 0.7 × vectorScore (suppressed!)
* - RRF: score = vectorWeight/(k+vectorRank) + 0 = vectorWeight/(k+vectorRank) (no suppression)
*/
const DEFAULT_K = 60;
export function mergeHybridResults(params: {
vector: HybridVectorResult[];
keyword: HybridKeywordResult[];
vectorWeight: number;
textWeight: number;
k?: number;
}): Array<{
path: string;
startLine: number;
endLine: number;
score: number;
snippet: string;
source: HybridSource;
}> {
const k = params.k ?? DEFAULT_K;
const byId = new Map<string, {
id: string; path: string; startLine: number; endLine: number;
source: HybridSource; snippet: string;
vectorRank?: number; keywordRank?: number;
}>();
// Track rank positions (1-indexed)
for (let i = 0; i < params.vector.length; i++) {
const r = params.vector[i];
byId.set(r.id, {
id: r.id, path: r.path, startLine: r.startLine, endLine: r.endLine,
source: r.source, snippet: r.snippet, vectorRank: i + 1,
});
}
for (let i = 0; i < params.keyword.length; i++) {
const r = params.keyword[i];
const existing = byId.get(r.id);
if (existing) {
existing.keywordRank = i + 1;
if (r.snippet?.length) existing.snippet = r.snippet;
} else {
byId.set(r.id, {
id: r.id, path: r.path, startLine: r.startLine, endLine: r.endLine,
source: r.source, snippet: r.snippet, keywordRank: i + 1,
});
}
}
const merged = Array.from(byId.values()).map((entry) => {
const vectorContrib = entry.vectorRank != null
? params.vectorWeight / (k + entry.vectorRank) : 0;
const keywordContrib = entry.keywordRank != null
? params.textWeight / (k + entry.keywordRank) : 0;
return {
path: entry.path, startLine: entry.startLine, endLine: entry.endLine,
score: vectorContrib + keywordContrib,
snippet: entry.snippet, source: entry.source,
};
});
return merged.toSorted((a, b) => b.score - a.score);
}
OR-joined FTS query builder — replacement for buildFtsQuery()
Click to expand: fts-query.ts
/**
* OR-joined FTS5 query builder with proper escaping.
*
* OR-joining means any matching token surfaces the document.
* BM25 naturally ranks multi-word matches higher.
*/
const FTS5_SPECIAL_CHARS = /["'(){}[\]:^*+\-~]/g;
export function buildFtsQuery(raw: string): string | null {
const tokens = raw.match(/[A-Za-z0-9_]+/g)?.map((t) => t.trim()).filter(Boolean) ?? [];
if (tokens.length === 0) return null;
return tokens.map((token) => {
const escaped = token.replace(/"/g, '""');
return FTS5_SPECIAL_CHARS.test(token) || token.includes(' ')
? `"${escaped}"` : escaped;
}).join(' OR ');
}
BM25 normalization — replacement for bm25RankToScore()
Click to expand: bm25-normalize.ts
/**
* Normalize BM25 scores to 0–1 using min-max normalization.
*
* FTS5 bm25() returns negative values (more negative = better match).
* Current implementation clamps negatives to 0, making all scores = 1.0.
*
* This normalizes relative to the result set:
* best match → 1.0, worst match → ~0.0
*/
export function normalizeBm25Scores<T extends { rank: number }>(
results: T[]
): (T & { normalizedScore: number })[] {
if (results.length === 0) return [];
const ranks = results.map((r) => r.rank);
const minRank = Math.min(...ranks); // Best (most negative)
const maxRank = Math.max(...ranks); // Worst (least negative)
const range = maxRank - minRank;
return results.map((result) => ({
...result,
normalizedScore: range === 0 ? 1.0 : (maxRank - result.rank) / range,
}));
}
/**
* Minimal fix for bm25RankToScore (if batch normalization isn't desired):
* Use Math.abs() instead of Math.max(0, ...) to handle negative FTS5 ranks.
*/
export function bm25RankToScore(rank: number): number {
const normalized = Number.isFinite(rank) ? Math.abs(rank) : 999;
return 1 / (1 + normalized);
}
Workaround
Until this is fixed upstream, the following config achieves 94.7% hit rate by bypassing hybrid search entirely:
{
"agents": {
"defaults": {
"memorySearch": {
"query": {
"hybrid": {
"enabled": false
},
"minScore": 0.25,
"maxResults": 10
}
}
}
}
}
This disables the hybrid pipeline, restores the raw cosine similarity as the score, lowers the threshold to catch moderate-similarity matches, and increases the result count. The trade-off is losing keyword matching entirely — there's no BM25 boost for exact terminology matches.
Validation
Test methodology
- 57 queries across 11 memory files covering diverse content types (behavioral protocols, technical reference, personality profiles, routing rules)
- 5–6 queries per file using varied phrasings:
- Direct terminology — exact terms from the document
- Natural language — how a user/agent would actually phrase the question
- Adjacent vocabulary — synonyms and related terms
- Vague/abstract — conceptual queries without specific terms
- Cross-domain — terms from a different field that map to the same concept
- Hit criteria: Target file appears in results with score ≥ minScore
Results
| Configuration |
Hits |
Hit Rate |
Files passing |
| Default hybrid (before) |
23/57 |
40.4% |
4/11 |
| Vector-only + lower threshold (after) |
54/57 |
94.7% |
11/11 |
Failure pattern
The 32 queries that failed under default hybrid config all share the same characteristic: natural language phrasing with no exact keyword overlap with the target document. Direct terminology queries succeeded in every case. This is consistent with the root cause — AND-joined FTS returns nothing → weighted average applies 0.7× penalty → score drops below threshold.
Only 3 queries still fail after workaround
"quality assurance checklist before making claims" — too generic, no semantic overlap with target
"responsive design mobile-first" — completely different domain vocabulary
"which telegram bot should I send this to" — too conversational for vector similarity
These are genuine edge cases rather than systematic failures.
Thank you for building OpenClaw — the memory search architecture is well-designed and the hybrid approach is the right idea. These bugs are subtle (especially the compounding effect) and easy to miss. Happy to provide more details, run additional tests, or help in any way that's useful.
memory_search: hybrid scoring bugs reduce hit rate from 95% to 40%Summary
Three compounding bugs in the hybrid search scoring pipeline cause
memory_searchto filter out valid semantic matches whenhybrid.enabledistrue(the default). The weighted-average fusion formula applies a 0.7× penalty to vector scores when BM25 returns nothing, AND-joined FTS queries ensure BM25 returns nothing for most natural language queries, and thebm25RankToScorefunction clamps all BM25 scores to a single value due to incorrect handling of negative FTS5 ranks. Together, these bugs raise the effectiveminScorethreshold from 0.35 to ~0.50, filtering out the majority of semantically relevant results. In a 57-query test across 11 memory files, the hit rate drops from 94.7% (vector-only) to 40.4% (hybrid). The fix involves switching FTS queries from AND-join to OR-join, replacing the weighted-average fusion with Reciprocal Rank Fusion (RRF), and normalizing BM25 scores relative to the result set rather than using absolute rank.Environment
2026.2.13v25.5.0text-embedding-3-large(3072 dims)queryorhybridoverridesRelevant default config (from
src/agents/memory-search.ts, lines 81–86):Reproduction
Steps
query.hybridoverrides — hybrid search is enabled by default)memory_searchwith natural language queriesRepresentative failing queries
These queries return "No matches" with default hybrid config but succeed with
hybrid.enabled: false:"how should I evaluate if a claim is actually true""how to think about problems and verify information""what to do when a task seems impossible""tone adjustment strangers conversation style""what does Michael value and how does he think""should I respond to this message or stay quiet""where to send urgent notifications""how should I behave when talking to people who aren't Michael"*Scores marked with * are from post-fix vector-only results; the direct cosine similarities follow the same pattern.
Pattern: Every natural language query that lacks exact keyword overlap with the target document gets a BM25 score of 0, which causes the hybrid formula to multiply the vector score by 0.7, dropping it below the 0.35 threshold. Direct terminology queries (e.g.,
"falsification-first reasoning protocol"→ 0.664) succeed because they happen to match FTS keywords as well.Expected vs actual
Root Cause Analysis
Three bugs compound to systematically suppress natural language queries. Each one is modest in isolation; together they raise the effective threshold from 0.35 to approximately 0.50.
Bug 1: AND-joined FTS queries kill recall
File:
src/memory/hybrid.ts, line 23 —buildFtsQuery()For the query
"how should I evaluate if a claim is actually true", this produces:All 10 words must appear verbatim in a single ~1500-character chunk. For natural language queries, this almost never matches, so FTS returns zero results and every document gets
textScore = 0.Impact: The FTS component becomes effectively dead for natural language queries, guaranteeing Bug 2 fires on every search.
Bug 2: Weighted-average fusion penalizes FTS misses
File:
src/memory/hybrid.ts, line 41 —mergeHybridResults()When the FTS component returns nothing (due to Bug 1),
textScore = 0for all results. The fusion formula then reduces toscore = 0.7 × vectorScore. This means:Impact: The effective minScore threshold rises from 0.35 to 0.50 for any query where FTS returns nothing.
Bug 3:
bm25RankToScoreproduces binary scoresFile:
src/memory/hybrid.ts, line 36 —bm25RankToScore()SQLite FTS5's
bm25()function returns negative values (more negative = better match). TheMath.max(0, rank)call clamps all negative values to 0, so1 / (1 + 0) = 1.0for every match. BM25 becomes effectively binary: match = 1.0, no match = 0.0, with no gradient between them.Even when FTS does return results (i.e., Bug 1 doesn't fire), all matches receive the same textScore of 1.0, providing no ranking signal. The
1 / (1 + rank)formula using absolute rank also decays extremely rapidly — rank 1 → 0.5, rank 2 → 0.33 — even if all ranks are conceptually close in quality.Impact: BM25 provides no useful ranking information even in the rare cases where AND-joined FTS matches succeed.
How these compound
When Bug 1 doesn't fire (exact terminology query), Bug 3 means the BM25 boost is always the maximum value, so the hybrid score is
0.7 × vectorScore + 0.3 × 1.0. This inflates scores for keyword matches but provides no gradient. The system works for exact terminology but fails for semantic queries — exactly backwards from what hybrid search is supposed to achieve.Proposed Fix
Fix 1: Switch FTS query joining from AND to OR
What: Change
buildFtsQuery()to join tokens withORinstead ofAND, with proper FTS5 special character escaping.Why: OR-joining means any matching token surfaces the document. BM25 naturally ranks documents matching more terms higher, so multi-word matches still score better than single-word matches — no explicit AND required.
Spec:
/[A-Za-z0-9_]+/g)ORinstead ofANDnullfor empty token lists (same as current behavior)Edge cases:
",*,-): Must be properly escapedFix 2: Replace weighted-average fusion with Reciprocal Rank Fusion (RRF)
What: Replace the
mergeHybridResults()formula fromscore = vectorWeight × vectorScore + textWeight × textScoreto RRF:score = Σ(weight_i / (k + rank_i)).Why: RRF is rank-based, not score-based. When a document is missing from one list, its contribution is 0 (additive) rather than causing a multiplicative penalty. This is the standard fusion algorithm used in production search systems.
Spec:
rrfScore = vectorWeight / (k + vectorRank) + textWeight / (k + keywordRank)k = 60(standard constant from the original RRF paper)keywordRankcontribution is 0 (not penalized)vectorRankcontribution is 0vectorWeightandtextWeightconfig values can still be used as multipliers on each componentNote on minScore: RRF scores are in a much smaller range (roughly 0–0.033) than the current weighted-average scores (0–1.0). The
minScorethreshold would need to be adjusted or the RRF scores would need to be normalized. One approach: normalize RRF scores to 0–1 by dividing by the maximum possible score(vectorWeight + textWeight) / (k + 1). Another: use a separate minScore default when RRF is active. This is a design decision best left to the maintainers.Edge cases:
kvalue sensitivity: 60 is well-tested in literature; could be made configurableFix 3: Fix BM25 score normalization
What: Replace
bm25RankToScore()to correctly handle FTS5's negative rank values and provide a gradient across results.Why: FTS5
bm25()returns negative values where more negative = better. The currentMath.max(0, rank)clamps all values to 0, making every match score 1.0.Spec:
Math.abs(rank)instead ofMath.max(0, rank)for the basic fix, ornormalizedScore = (maxRank - rank) / (maxRank - minRank)to produce a meaningful 0–1 gradientNote: If Fix 2 (RRF) is adopted, this fix becomes less critical since RRF uses rank positions rather than scores. However, the
bm25RankToScorefunction is still called insearchKeyword()(inmanager-search.ts, assigned toscoreandtextScore), so fixing it would improve keyword-only search quality as well.Edge cases:
Acceptance criteria
"what to do when a task seems impossible"should return documents about persistence/tenacity methodology with hybrid enabled."falsification-first reasoning protocol") should continue to work and ideally rank higher (boosted by both vector and keyword signals).Reference Implementation
RRF fusion — replacement for
mergeHybridResults()Click to expand:
rrf-fusion.tsOR-joined FTS query builder — replacement for
buildFtsQuery()Click to expand:
fts-query.tsBM25 normalization — replacement for
bm25RankToScore()Click to expand:
bm25-normalize.tsWorkaround
Until this is fixed upstream, the following config achieves 94.7% hit rate by bypassing hybrid search entirely:
{ "agents": { "defaults": { "memorySearch": { "query": { "hybrid": { "enabled": false }, "minScore": 0.25, "maxResults": 10 } } } } }This disables the hybrid pipeline, restores the raw cosine similarity as the score, lowers the threshold to catch moderate-similarity matches, and increases the result count. The trade-off is losing keyword matching entirely — there's no BM25 boost for exact terminology matches.
Validation
Test methodology
Results
Failure pattern
The 32 queries that failed under default hybrid config all share the same characteristic: natural language phrasing with no exact keyword overlap with the target document. Direct terminology queries succeeded in every case. This is consistent with the root cause — AND-joined FTS returns nothing → weighted average applies 0.7× penalty → score drops below threshold.
Only 3 queries still fail after workaround
"quality assurance checklist before making claims"— too generic, no semantic overlap with target"responsive design mobile-first"— completely different domain vocabulary"which telegram bot should I send this to"— too conversational for vector similarityThese are genuine edge cases rather than systematic failures.
Thank you for building OpenClaw — the memory search architecture is well-designed and the hybrid approach is the right idea. These bugs are subtle (especially the compounding effect) and easy to miss. Happy to provide more details, run additional tests, or help in any way that's useful.