Skip to content

Bound active-memory recall latency and jitter QMD startup#73667

Closed
0xCNAI wants to merge 1 commit into
openclaw:mainfrom
0xCNAI:fix/active-memory-deadline-jitter
Closed

Bound active-memory recall latency and jitter QMD startup#73667
0xCNAI wants to merge 1 commit into
openclaw:mainfrom
0xCNAI:fix/active-memory-deadline-jitter

Conversation

@0xCNAI

@0xCNAI 0xCNAI commented Apr 28, 2026

Copy link
Copy Markdown

Summary

This patch applies the active-memory reliability approach proposed by @kinthaiofficial in #72015:

  • reduce Active Memory's default recall deadline from 15s to 3s
  • keep the before_prompt_build hook deadline-bounded so memory recall degrades instead of blocking normal replies
  • remove the default 30s setup grace from the critical path
  • stagger QMD startup initialization with a small jitter so multi-agent boot does not stampede memory backends
  • update Active Memory docs to recommend low-latency recall budgets

The existing QMD search manager cache is already keyed per agent, so this keeps the change small and focuses on the blocking failure mode plus startup herd behavior.

Tests

  • pnpm exec vitest run extensions/active-memory/index.test.ts src/gateway/server-startup-memory.test.ts
  • pnpm tsgo:extensions:test
  • pnpm tsgo:core:test

Fixes #72015

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime size: XS labels Apr 28, 2026
@greptile-apps

greptile-apps Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR reduces the Active Memory recall deadline from 15 s to 3 s, removes the 30 s setup grace from the critical path, and adds per-agent jitter to stagger QMD startup. The jitter and timeout-reduction changes are sound, but the new beforePromptBuildTimeoutMs formula hard-codes DEFAULT_TIMEOUT_MS (3 s) instead of the maximum allowed recall timeout (120 s).

  • Any agent configured with config.timeoutMs > ~3000 — including the "full" query mode that the docs in this same PR recommend at "5,000 ms or higher" — will have its before_prompt_build hook cancelled at 4 s, silently dropping memory recall, before the internal watchdog fires.
  • The test was updated to assert the new 4,000 ms value but does not cover custom timeoutMs values above the default, masking the regression.

Confidence Score: 3/5

Not safe to merge as-is — the hook envelope silently drops recall for any agent using a custom timeout above 3 s.

A single P1 defect in index.ts means the hook timeout regression will affect any user who follows the docs guidance for full query mode (5,000 ms+), making recall silently fail. The latency reduction and jitter logic are otherwise correct, but the P1 pulls the score below the 4/5 ceiling.

extensions/active-memory/index.ts line 2434 — beforePromptBuildTimeoutMs computation needs to account for the maximum configured timeoutMs.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/active-memory/index.ts
Line: 2434

Comment:
**Hook timeout uses `DEFAULT_TIMEOUT_MS` instead of the per-agent configured value**

`beforePromptBuildTimeoutMs` is now `DEFAULT_TIMEOUT_MS + setupGraceTimeoutMs + 1_000 = 4_000 ms`. The internal recall watchdog fires at `config.timeoutMs + setupGraceTimeoutMs` (line 2193), which uses the _per-agent_ `config.timeoutMs`, clamped up to 120,000 ms. Any agent configured with `timeoutMs > 3_000` (e.g., the docs update in this same PR recommends `5_000` or higher for `full` query mode) will have its hook cancelled at 4 s before the recall's own watchdog can fire, causing silent recall loss. The previous formula `120_000 + setupGraceTimeoutMs` was sized to always outlast the maximum allowed recall timeout.

To preserve the invariant, the hook envelope should wrap around the maximum possible recall duration, not the default:
```
// keep the hook envelope at the cap so per-agent timeoutMs up to 120 s still fits
const beforePromptBuildTimeoutMs = 120_000 + setupGraceTimeoutMs + 1_000;
```
If the goal is to reduce the _default_ latency without breaking custom high-timeout configurations, consider computing this per-session from the resolved `config.timeoutMs`, or at least use `Math.max(resolvedAgentTimeoutMs, DEFAULT_TIMEOUT_MS)` before adding the buffer.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/gateway/server-startup-memory.ts
Line: 51

Comment:
**`<= 0` guard is misleading for an array index**

`params.index` is the `.map()` callback index, which is always `>= 0`, so the `<= 0` condition is functionally identical to `=== 0` but implies negative indices are possible. Using strict equality makes the intent clearer.

```suggestion
  if (params.index === 0 || params.total <= 1) {
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(active-memory): bound recall latency..." | Re-trigger Greptile

});

const beforePromptBuildTimeoutMs = 120_000 + setupGraceTimeoutMs;
const beforePromptBuildTimeoutMs = DEFAULT_TIMEOUT_MS + setupGraceTimeoutMs + 1_000;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hook timeout uses DEFAULT_TIMEOUT_MS instead of the per-agent configured value

beforePromptBuildTimeoutMs is now DEFAULT_TIMEOUT_MS + setupGraceTimeoutMs + 1_000 = 4_000 ms. The internal recall watchdog fires at config.timeoutMs + setupGraceTimeoutMs (line 2193), which uses the per-agent config.timeoutMs, clamped up to 120,000 ms. Any agent configured with timeoutMs > 3_000 (e.g., the docs update in this same PR recommends 5_000 or higher for full query mode) will have its hook cancelled at 4 s before the recall's own watchdog can fire, causing silent recall loss. The previous formula 120_000 + setupGraceTimeoutMs was sized to always outlast the maximum allowed recall timeout.

To preserve the invariant, the hook envelope should wrap around the maximum possible recall duration, not the default:

// keep the hook envelope at the cap so per-agent timeoutMs up to 120 s still fits
const beforePromptBuildTimeoutMs = 120_000 + setupGraceTimeoutMs + 1_000;

If the goal is to reduce the default latency without breaking custom high-timeout configurations, consider computing this per-session from the resolved config.timeoutMs, or at least use Math.max(resolvedAgentTimeoutMs, DEFAULT_TIMEOUT_MS) before adding the buffer.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/active-memory/index.ts
Line: 2434

Comment:
**Hook timeout uses `DEFAULT_TIMEOUT_MS` instead of the per-agent configured value**

`beforePromptBuildTimeoutMs` is now `DEFAULT_TIMEOUT_MS + setupGraceTimeoutMs + 1_000 = 4_000 ms`. The internal recall watchdog fires at `config.timeoutMs + setupGraceTimeoutMs` (line 2193), which uses the _per-agent_ `config.timeoutMs`, clamped up to 120,000 ms. Any agent configured with `timeoutMs > 3_000` (e.g., the docs update in this same PR recommends `5_000` or higher for `full` query mode) will have its hook cancelled at 4 s before the recall's own watchdog can fire, causing silent recall loss. The previous formula `120_000 + setupGraceTimeoutMs` was sized to always outlast the maximum allowed recall timeout.

To preserve the invariant, the hook envelope should wrap around the maximum possible recall duration, not the default:
```
// keep the hook envelope at the cap so per-agent timeoutMs up to 120 s still fits
const beforePromptBuildTimeoutMs = 120_000 + setupGraceTimeoutMs + 1_000;
```
If the goal is to reduce the _default_ latency without breaking custom high-timeout configurations, consider computing this per-session from the resolved `config.timeoutMs`, or at least use `Math.max(resolvedAgentTimeoutMs, DEFAULT_TIMEOUT_MS)` before adding the buffer.

How can I resolve this? If you propose a fix, please make it concise.

}

function resolveStartupJitterMs(params: { index: number; total: number }): number {
if (params.index <= 0 || params.total <= 1) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 <= 0 guard is misleading for an array index

params.index is the .map() callback index, which is always >= 0, so the <= 0 condition is functionally identical to === 0 but implies negative indices are possible. Using strict equality makes the intent clearer.

Suggested change
if (params.index <= 0 || params.total <= 1) {
if (params.index === 0 || params.total <= 1) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/gateway/server-startup-memory.ts
Line: 51

Comment:
**`<= 0` guard is misleading for an array index**

`params.index` is the `.map()` callback index, which is always `>= 0`, so the `<= 0` condition is functionally identical to `=== 0` but implies negative indices are possible. Using strict equality makes the intent clearer.

```suggestion
  if (params.index === 0 || params.total <= 1) {
```

How can I resolve this? If you propose a fix, please make it concise.

@clawsweeper

clawsweeper Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Summary
The PR lowers Active Memory timeout/setup-grace defaults, resizes the before_prompt_build hook timeout, adds jittered parallel QMD startup initialization, and updates Active Memory docs/tests.

Reproducibility: yes. for the PR regression by source inspection: configure Active Memory on the branch with timeoutMs above 3000 ms and the fixed 4000 ms hook timeout can fire before the configured recall budget. I did not run a live gateway canary in this read-only review.

Real behavior proof
Needs real behavior proof before merge: The PR body lists tests/typechecks only; the contributor should add redacted terminal output, logs, live output, a linked artifact, or a recording from a real Active Memory/QMD setup before merge, then update the PR body to trigger re-review or ask for @clawsweeper re-review.

Next step before merge
Manual review is needed because the external PR is draft/conflicting, lacks real behavior proof, includes product/default timing policy, and has a concrete timeout regression.

Security
Cleared: The diff only changes Active Memory/Gateway timing behavior, docs, and tests; it adds no dependencies, workflows, scripts, permissions, downloads, package resolution, publishing, or secret handling.

Review findings

  • [P1] Size the hook timeout from the configured recall budget — extensions/active-memory/index.ts:2434
Review details

Best possible solution:

Rework or replace the branch so hook budgets preserve configured recall timeouts, build on current-main setup-grace and QMD lazy-start semantics, and leave broader nonblocking/default-policy work on the canonical reliability tracker.

Do we have a high-confidence way to reproduce the issue?

Yes, for the PR regression by source inspection: configure Active Memory on the branch with timeoutMs above 3000 ms and the fixed 4000 ms hook timeout can fire before the configured recall budget. I did not run a live gateway canary in this read-only review.

Is this the best way to solve the issue?

No, not as proposed. The reliability direction is plausible, but the hook envelope must be sized from the resolved configured timeout or cap, and the QMD/default-latency policy should be reconciled with current main before merge.

Full review comments:

  • [P1] Size the hook timeout from the configured recall budget — extensions/active-memory/index.ts:2434
    This branch registers the Active Memory before_prompt_build hook with a default-derived 4s timeout while config.timeoutMs still accepts larger values and the docs recommend budgets above the default. The hook runner enforces the registration timeout before the plugin watchdog can return, so custom high-timeout recall can be silently dropped; derive the envelope from the resolved configured timeout or cap and cover a custom timeout above the default.
    Confidence: 0.93

Overall correctness: patch is incorrect
Overall confidence: 0.91

What I checked:

  • PR diff timeout regression: The PR changes beforePromptBuildTimeoutMs to DEFAULT_TIMEOUT_MS + setupGraceTimeoutMs + 1_000, which becomes 4s after lowering the default to 3s while per-agent timeoutMs remains configurable above that. (extensions/active-memory/index.ts:2434, 0071f6ca13f7)
  • Current main preserves configured hook budget: Current main registers before_prompt_build with config.timeoutMs + config.setupGraceTimeoutMs, so the hook envelope tracks the resolved configured recall budget rather than a fixed default-derived value. (extensions/active-memory/index.ts:3026, 48b4e5b3614f)
  • Hook runner enforces registration timeout: Modifying hooks use withHookTimeout around the handler promise, so an undersized before_prompt_build registration timeout can skip the Active Memory result before the plugin's own watchdog resolves. (src/plugins/hooks.ts:590, 48b4e5b3614f)
  • Current tests cover high configured budgets: Current main asserts a 90,000 ms Active Memory timeout registers the hook with 90,000 ms and that setupGraceTimeoutMs is added when configured. (extensions/active-memory/index.test.ts:337, 48b4e5b3614f)
  • Current QMD startup behavior differs materially: Current main gates QMD boot sync behind startup opt-in, defers non-explicit agents in multi-agent configs, and initializes/syncs managers sequentially with purpose "cli" rather than the PR's Promise.all jitter path. (src/gateway/server-startup-memory.ts:56, 48b4e5b3614f)
  • Live PR state and proof gap: Live GitHub metadata shows the PR is open, draft, mergeable=CONFLICTING, references the still-open reliability issue, and the body lists tests/typechecks but no real Active Memory/QMD runtime proof. (0071f6ca13f7)

Likely related people:

  • Takhoffman: Recent merged history includes Active Memory custom recall tools and bundled recall-tool work on the same plugin/docs/test surface. (role: Active Memory recall contributor; confidence: high; commits: 2f2602508511, f256eeba431b; files: extensions/active-memory/index.ts, extensions/active-memory/index.test.ts, docs/concepts/active-memory.md)
  • steipete: Recent commits touch Active Memory timeout/setup-grace behavior, tests, and QMD gateway startup lazy/boot-sync behavior central to this PR. (role: recent Active Memory and QMD startup contributor; confidence: high; commits: 59449d7f19cc, 2b811fe6d9e3, 30e2654ac2c8; files: extensions/active-memory/index.ts, extensions/active-memory/index.test.ts, src/gateway/server-startup-memory.ts)
  • vincentkoc: History links this account to deferring implicit QMD startup and Active Memory docs/runtime-config work near the same reliability surface. (role: adjacent timeout and QMD startup contributor; confidence: medium; commits: 732a5842ee99, c357235fe6e6; files: src/gateway/server-startup-memory.ts, docs/concepts/active-memory.md)
  • vignesh07: Earlier QMD backend hardening overlaps the boot-storm side of the PR's startup behavior. (role: QMD backend and startup contributor; confidence: medium; commits: a305dfe62601; files: src/gateway/server-startup-memory.ts)

Remaining risk / open question:

  • The branch is draft and merge-conflicting, so conflict repair may materially change the reviewed diff.
  • The PR body has tests/typechecks only and no real Active Memory/QMD runtime proof.
  • The default timeout reduction and broad QMD startup scheduling behavior require maintainer product judgment beyond a mechanical bug fix.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 48b4e5b3614f.

@openclaw-barnacle

Copy link
Copy Markdown

This assigned pull request has been automatically marked as stale after being open for 27 days.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added stale Marked as stale due to inactivity triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 29, 2026
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 29, 2026
@barnacle-openclaw barnacle-openclaw Bot removed the stale Marked as stale due to inactivity label May 30, 2026
@barnacle-openclaw

Copy link
Copy Markdown

This assigned pull request has been automatically marked as stale after being open for 27 days.
Please add updates or it will be closed.

@barnacle-openclaw barnacle-openclaw Bot added the stale Marked as stale due to inactivity label May 30, 2026
@openclaw-barnacle

Copy link
Copy Markdown

Closing due to inactivity.
If you believe this PR should be revived, post in #clawtributors on Discord to talk to a maintainer.
That channel is the escape hatch for high-quality PRs that get auto-closed.

@openclaw-barnacle openclaw-barnacle Bot closed this Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation gateway Gateway runtime merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: XS stale Marked as stale due to inactivity status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reliability: active-memory blocks replies and QMD boot initialization can overload multi-agent gateways

2 participants