Skip to content

Active-memory embedded memory_search intermittently loses embedding provider and falls back to FTS-only #89691

@joeykrug

Description

@joeykrug

Summary

active-memory's embedded recall path can intermittently run memory_search with no visible memory embedding provider, even though the Gateway immediately reports the configured OpenAI memory embedding provider as loaded, active, and healthy.

When this happens, memory-core logs:

[memory] search: embeddings unavailable; using keyword-only results: Cannot embed query in FTS-only mode (no embedding provider)

The user-facing turn then shows an active-memory timeout, for example:

Active Memory: status=timeout elapsed=30.0s query=recent

This is related to, but not the same as, #89651 / #89652. That issue/PR addresses the startup gap where agents.defaults.memorySearch.provider = "openai" did not load the plugin owning contracts.memoryEmbeddingProviders: ["openai"]. The behavior here still appears after the OpenAI plugin is loaded and health checks are green, so it looks like a runtime provider hydration / provider registry visibility race in the embedded active-memory search path.

Why this is bad

active-memory is a blocking before_prompt_build hook. A single providerless memory_search can delay normal replies for the full active-memory watchdog window.

In the observed setup:

  • active-memory.timeoutMs = 30000
  • active-memory.circuitBreakerMaxTimeouts = 3
  • active-memory.circuitBreakerCooldownMs = 180000

So the Gateway can repeatedly spend ~30 seconds on hidden recall before skipping future active-memory attempts.

Also, timeout_partial is not triggered in this case because the hidden active-memory subagent has not produced any assistant summary text yet. It is still waiting on memory_search, so the timeout result has no partial answer to recover.

Environment / config shape

Version:

OpenClaw 2026.5.28 (e932160)

Relevant config:

{
  "plugins": {
    "entries": {
      "active-memory": {
        "enabled": true,
        "config": {
          "enabled": true,
          "agents": ["main"],
          "allowedChatTypes": ["direct"],
          "queryMode": "recent",
          "promptStyle": "balanced",
          "timeoutMs": 30000,
          "maxSummaryChars": 800,
          "recentUserTurns": 3,
          "recentAssistantTurns": 2,
          "cacheTtlMs": 30000,
          "circuitBreakerMaxTimeouts": 3,
          "circuitBreakerCooldownMs": 180000,
          "model": "anthropic/claude-sonnet-4-6"
        }
      },
      "memory-core": { "enabled": true },
      "openai": { "enabled": true }
    }
  },
  "agents": {
    "defaults": {
      "memorySearch": {
        "enabled": true,
        "sources": ["memory", "sessions"],
        "experimental": { "sessionMemory": true },
        "provider": "openai",
        "model": "text-embedding-3-large",
        "query": {
          "hybrid": {
            "enabled": true,
            "vectorWeight": 0.7,
            "textWeight": 0.3,
            "candidateMultiplier": 4,
            "mmr": { "enabled": true, "lambda": 0.7 }
          }
        }
      }
    }
  }
}

Immediately after the failure, openclaw memory status --deep --json reported the main agent memory backend as healthy:

{
  "provider": "openai",
  "model": "text-embedding-3-large",
  "requestedProvider": "openai",
  "sources": ["memory", "sessions"],
  "fts": { "enabled": true, "available": true },
  "vector": {
    "enabled": true,
    "storeAvailable": true,
    "semanticAvailable": true,
    "available": true,
    "dims": 3072
  },
  "custom": {
    "searchMode": "hybrid",
    "providerState": { "mode": "active", "providerId": "openai" }
  },
  "embeddingProbe": { "ok": true }
}

And openclaw plugins inspect openai --json reported the OpenAI plugin as loaded/activated and advertising/registering the memory embedding contract:

{
  "plugin": {
    "id": "openai",
    "enabled": true,
    "activated": true,
    "status": "loaded",
    "memoryEmbeddingProviderIds": ["openai"],
    "contracts": {
      "memoryEmbeddingProviders": ["openai"]
    }
  }
}

Observed behavior

Timeline from a real run:

  1. active-memory begins an embedded recall for queryMode=recent.
  2. The hidden recall subagent calls memory_search.
  3. memory-core tries to run hybrid search, loads keyword candidates, then attempts to embed the query.
  4. In this embedded path, this.provider is null, so embedQueryWithRetry() throws:
    Cannot embed query in FTS-only mode (no embedding provider)
    
  5. The search path catches that and logs keyword-only fallback:
    memory search: embeddings unavailable; using keyword-only results: Cannot embed query in FTS-only mode (no embedding provider)
    
  6. The hidden recall does not produce assistant summary text before the active-memory watchdog fires.
  7. The user-facing turn gets only:
    Active Memory: status=timeout elapsed=30.0s query=recent
    
  8. Health checks immediately after show OpenAI memory embeddings active and probeable.

Source pointers

The failure string is thrown when the memory manager has no provider handle:

// extensions/memory-core/src/memory/manager-embedding-ops.ts
protected async embedQueryWithRetry(text: string): Promise<number[]> {
  const provider = this.provider;
  if (!provider) {
    throw new Error("Cannot embed query in FTS-only mode (no embedding provider)");
  }
  ...
}

The hybrid search path catches that and deliberately falls back to keyword-only results if FTS is available:

// extensions/memory-core/src/memory/manager.ts
try {
  queryVec = await this.embedQueryWithRetry(cleaned);
} catch (err) {
  ...
  if (activatedFallback) {
    ...
  } else if (!this.provider && this.fts.enabled && this.fts.available) {
    log.warn(`memory search: embeddings unavailable; using keyword-only results: ${message}`);
    return this.selectScoredResults(keywordResults, maxResults, minScore, 0);
  } else {
    throw err;
  }
}

That fallback is reasonable when the runtime is genuinely FTS-only. The bug is that this path is being entered while the configured provider is openai and the Gateway reports that provider as loaded/active/healthy before and after the failed active-memory turn.

active-memory then races the recall subagent against its watchdog:

// extensions/active-memory/index.ts
const controller = new AbortController();
const timeoutId = setTimeout(() => {
  controller.abort(new Error(`active-memory timeout after ${watchdogTimeoutMs}ms`));
}, watchdogTimeoutMs);
...
const raceResult = await Promise.race([
  subagentPromise,
  timeoutPromise,
  terminalMemorySearchWatch.promise,
]);

If no assistant text has been written yet, the timeout result is plain status: "timeout", not timeout_partial:

// extensions/active-memory/index.ts
const summary = truncateSummary(normalizeActiveSummary(rawReply ?? "") ?? "", params.maxSummaryChars);
if (summary.length === 0) {
  return { status: "timeout", elapsedMs: params.elapsedMs, summary: null, searchDebug };
}
return { status: "timeout_partial", elapsedMs: params.elapsedMs, summary, searchDebug };

Expected behavior

If agents.defaults.memorySearch.provider = "openai" and the OpenAI plugin is loaded/activated with a registered memoryEmbeddingProviderIds: ["openai"], every memory_search invocation from active-memory should see the same provider registry and use vector/hybrid search.

If the configured provider is temporarily not visible, the system should fail fast with a structured provider-unavailable result and enough diagnostics to identify the runtime/provider scope, rather than silently doing a slow FTS-only fallback inside a blocking pre-prompt hook.

Actual behavior

One embedded active-memory recall can see this.provider === null and enter FTS-only/keyword-only fallback, while separate live checks show:

  • provider=openai
  • model=text-embedding-3-large
  • providerState=active
  • vector semantic search available
  • embedding probe OK
  • OpenAI plugin loaded and activated
  • memoryEmbeddingProviderIds=["openai"]

Suspected root cause

Likely a provider hydration / runtime registry visibility race specific to embedded active-memory memory tool execution. Possibilities:

  • the memory search manager instance is created before plugin-owned memory embedding providers are visible and is later reused in providerless state;
  • the embedded recall runtime has a different runtime config / provider registry snapshot than the main Gateway runtime;
  • active-memory's hidden subagent/tool context gets memory_search before capability provider hydration completes;
  • experimental.sessionMemory=true / sources=["memory", "sessions"] increases the cost of the FTS-only fallback enough that the active-memory watchdog consistently wins the race;
  • parent abort propagation does not fully cancel or drain the in-flight memory_search, so the child operation can keep running until its own tool watchdog.

Suggested fix shape

  1. Add regression coverage for active-memory invoking memory_search with:

    • agents.defaults.memorySearch.provider = "openai"
    • plugin-owned memoryEmbeddingProviders: ["openai"]
    • sources=["memory", "sessions"]
    • experimental.sessionMemory=true
    • the OpenAI memory embedding provider already loaded/registered.
  2. Ensure the memory manager used by dynamic/embedded tool calls rehydrates or rechecks the configured provider before accepting FTS-only fallback when the configured provider is non-local and non-none.

  3. Add diagnostics on the FTS-only fallback branch:

    • requested provider id
    • available registered memory embedding provider ids
    • plugin ids loaded in the active runtime
    • agent id / session key / embedded-active-memory marker
    • whether the manager was reused from cache and when its provider was resolved.
  4. Treat provider-null fallback differently from normal zero-hit FTS results in active-memory. A configured-provider-missing result should become a structured memory_search unavailable result quickly, so the active-memory hook can fast-fail instead of waiting for the full watchdog.

  5. Make active-memory timeout abort propagate down to in-flight memory_search and embedding/query work, or ensure late child work is cancelled/drained immediately after the parent timeout.

Related work

This issue is the follow-up runtime case: the provider can be loaded and healthy, but an embedded active-memory memory_search call still sometimes observes no provider and drops into FTS-only mode.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions