Reliability: active-memory blocks replies and QMD boot initialization can overload multi-agent gateways

# Reliability suggestion: active-memory blocks replies and QMD boot initialization can overload gateway in 2026.4.24

## Summary

On a multi-agent OpenClaw gateway, enabling the official `active-memory` plugin can make normal replies slow or unreliable. The main issue is that `active-memory` currently runs a full embedded agent/model call inside `before_prompt_build`, using the active conversation model by default, and waits for it before the actual user reply is built.

In the same environment, QMD memory startup initialization arms memory managers for all configured agents on gateway boot. Each manager can start its own boot update. This is useful, but when many agents are configured it can create a burst of QMD work at startup. Combined with active-memory running per user message, the gateway can experience high CPU, long response latency, and timeout cascades.

This is not a transcript duplication issue. It is a separate reliability/defaults concern.

## Environment

- OpenClaw version: `2026.4.24`
- Platform: macOS ARM64
- Runtime: Gateway with multiple agents
- Active model observed: `openai-codex/gpt-5.5`
- QMD backend enabled for memory search
- `memory.qmd.update.onBoot = true`
- `memory.qmd.update.embedInterval = 30m`
- `memory.qmd.limits.timeoutMs = 40000`
- `active-memory.timeoutMs = 30000`
- `active-memory.queryMode = recent`
- `active-memory.maxSummaryChars = 220`

## Observed behavior

When `active-memory` was enabled, every eligible interactive user message triggered lines like:

```text
active-memory: agent=tino session=... activeProvider=openai-codex activeModel=gpt-5.5 start timeoutMs=30000 queryChars=1505
active-memory: agent=tino session=... activeProvider=openai-codex activeModel=gpt-5.5 done status=timeout elapsedMs=40808 summaryChars=0
active-memory: agent=tino session=... activeProvider=openai-codex activeModel=gpt-5.5 done status=timeout elapsedMs=63548 summaryChars=0
active-memory: agent=tino session=... activeProvider=openai-codex activeModel=gpt-5.5 done status=empty elapsedMs=29864 summaryChars=0
```

So a 30s active-memory timeout produced 30–60s of extra latency and often returned no usable memory (`summaryChars=0`).

Separately, on gateway boot/restart, logs repeatedly showed:

```text
qmd memory startup initialization armed for 10 agents: "tino", "jonathan", "betalpha-social", "jccat", "analyst", "news", "reporter", "social", "travel-pm", "family-pm"
```

This means boot initialization is not limited to the currently active agent/session. It initializes QMD memory for every configured agent with memory search enabled.

## Current implementation notes

### active-memory blocks prompt construction

`extensions/active-memory/index.js` registers:

```js
api.on("before_prompt_build", async (event, ctx) => {
  ...
  const result = await maybeResolveActiveRecall(...);
  if (!result.summary) return;
  return { prependContext: promptPrefix };
});
```

`maybeResolveActiveRecall(...)` calls `runRecallSubagent(...)`.

`runRecallSubagent(...)` uses:

```js
params.api.runtime.agent.runEmbeddedPiAgent({
  provider: modelRef.provider,
  model: modelRef.model,
  timeoutMs: params.config.timeoutMs,
  toolsAllow: ["memory_search", "memory_get"],
  bootstrapContextMode: "lightweight",
  silentExpected: true,
  ...
});
```

The model defaults to the current run model / agent primary model unless `active-memory.config.model` is explicitly set. In this environment that meant `openai-codex/gpt-5.5` was used as a per-message memory recall subagent.

### QMD onBoot initializes all agents

`server-startup-memory-kT6lKCrb.js` does:

```js
const agentIds = listAgentIds(params.cfg);
for (const agentId of agentIds) {
  if (!resolveMemorySearchConfig(params.cfg, agentId)) continue;
  const resolved = resolveActiveMemoryBackendConfig({ cfg: params.cfg, agentId });
  if (resolved.backend !== "qmd") continue;
  await getActiveMemorySearchManager({ cfg: params.cfg, agentId });
  armedAgentIds.push(agentId);
}
```

Each QMD manager can then run boot update because `qmd.update.onBoot` is true. In `qmd-manager-LLKxprVD.js`, `initialize(...)` starts:

```js
if (this.qmd.update.onBoot) {
  const bootRun = this.runUpdate("boot", true);
  ...
}
```

QMD update queueing is per `qmdDir`, so different agents can still start separate boot updates. The embed lock is global, but the update phase can still create a startup burst across agents.

## Impact

- User replies are delayed before the actual model run begins.
- A failed/empty memory lookup can add tens of seconds while providing no useful context.
- Gateway CPU can spike during boot or restart when many agents arm QMD memory managers.
- Active-memory and QMD startup work can overlap, creating timeout cascades.
- Operators may think normal chat or model latency is broken, when the delay is pre-prompt memory recall.

## Why this is surprising

The feature name suggests a lightweight memory retrieval layer, but the default behavior is closer to: run another full embedded LLM turn before each eligible user reply, using the same active model unless configured otherwise.

That may be powerful, but it is unsafe as a default for slow/expensive models or high-traffic agents.

## Suggested fixes for next release

### 1. Make active-memory fail-open and non-blocking by default

Do not block the actual user reply on active-memory unless explicitly configured.

Possible modes:

- `mode: "nonblocking"` default: start recall opportunistically; only inject if it returns very quickly.
- `mode: "blocking"` opt-in: current behavior for operators who want maximum recall.
- `deadlineMs`: hard budget for pre-prompt recall, default maybe 1000–3000ms.

If recall misses the deadline, skip injection and let the user reply proceed.

### 2. Use a cheap/fast recall model by default, not the current conversation model

If no `active-memory.config.model` is set, default to a lightweight model profile rather than `ctx.modelProviderId/ctx.modelId` or the agent primary model.

At minimum, warn loudly when active-memory inherits a slow/high-cost model.

### 3. Enforce hard timeout cancellation

Observed elapsed time exceeded configured `timeoutMs` substantially (`30000` configured, `40808` / `63548` observed). The abort signal may not stop the embedded run promptly.

The active-memory timeout should be a hard wall-clock budget for the pre-prompt hook.

### 4. Add concurrency limits for active-memory recall

Per agent/session limits would prevent multiple simultaneous recall subagents from stacking during active chat bursts.

Suggested defaults:

- one active-memory recall per agent
- one active-memory recall per session
- drop or reuse cached result when a recall is already running

### 5. Add QMD boot concurrency control / startup jitter across agents

`qmd memory startup initialization` should avoid boot-time bursts across all configured agents.

Possible approaches:

- global max concurrent QMD boot updates, default 1
- jitter per agent
- lazy-initialize QMD manager on first memory search instead of arming every agent on boot
- separate `onBootAgents` allowlist or `onBoot: "activeOnly"`

### 6. Make the operational cost visible in status/doctor

`openclaw status` / doctor could flag:

- active-memory is enabled and inherits a slow primary model
- active-memory timeout is high
- QMD onBoot is enabled for many agents
- active-memory is returning mostly timeout/empty results

## Recommended safer default profile

For multi-agent gateways, a safer profile might be:

```json
{
  "plugins": {
    "entries": {
      "active-memory": {
        "enabled": true,
        "config": {
          "enabled": true,
          "mode": "nonblocking",
          "timeoutMs": 2000,
          "model": "<fast-cheap-default>",
          "queryMode": "message",
          "recentUserTurns": 1,
          "recentAssistantTurns": 0,
          "cacheTtlMs": 60000
        }
      }
    }
  },
  "memory": {
    "qmd": {
      "update": {
        "onBoot": "activeOnly",
        "embedInterval": "60m",
        "maxConcurrentBootUpdates": 1
      }
    }
  }
}
```

Exact schema names can differ; the point is safer semantics.

## Workaround used locally

Active-memory was disabled and memory-core dreaming was kept disabled. With those disabled, gateway CPU and reply latency returned to a stable baseline, while manual `memory_search` / QMD queries still worked.

This suggests the issue is not QMD search itself, but the combination of blocking per-message active-memory embedded LLM recall and broad QMD boot/update work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reliability: active-memory blocks replies and QMD boot initialization can overload multi-agent gateways #72015

Reliability suggestion: active-memory blocks replies and QMD boot initialization can overload gateway in 2026.4.24

Summary

Environment

Observed behavior

Current implementation notes

active-memory blocks prompt construction

QMD onBoot initializes all agents

Impact

Why this is surprising

Suggested fixes for next release

1. Make active-memory fail-open and non-blocking by default

2. Use a cheap/fast recall model by default, not the current conversation model

3. Enforce hard timeout cancellation

4. Add concurrency limits for active-memory recall

5. Add QMD boot concurrency control / startup jitter across agents

6. Make the operational cost visible in status/doctor

Recommended safer default profile

Workaround used locally

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Reliability: active-memory blocks replies and QMD boot initialization can overload multi-agent gateways #72015

Description

Reliability suggestion: active-memory blocks replies and QMD boot initialization can overload gateway in 2026.4.24

Summary

Environment

Observed behavior

Current implementation notes

active-memory blocks prompt construction

QMD onBoot initializes all agents

Impact

Why this is surprising

Suggested fixes for next release

1. Make active-memory fail-open and non-blocking by default

2. Use a cheap/fast recall model by default, not the current conversation model

3. Enforce hard timeout cancellation

4. Add concurrency limits for active-memory recall

5. Add QMD boot concurrency control / startup jitter across agents

6. Make the operational cost visible in status/doctor

Recommended safer default profile

Workaround used locally

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions