fix(memory): retry transient embedding transport failures by MrGeDiao · Pull Request #44167 · openclaw/openclaw

MrGeDiao · 2026-03-12T15:22:19Z

Summary

Problem: memory embedding batch sync could abort immediately when a remote embedding provider raised transient transport failures such as fetch failed, ECONNRESET, socket close, or undici socket/connect errors.
Why it matters: one temporary provider/network failure can prevent otherwise valid memory chunks from being indexed.
What changed: current extensions/memory-core now treats narrow transport failures as retryable, reuses the existing backoff loop, and recursively splits multi-item batches after retry exhaustion while preserving result order.
What did NOT change (scope boundary): no provider selection, config/schema, timeout constant, cache format, or non-retryable error behavior changes.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes memory reindex aborts on transient embedding transport errors instead of retrying or splitting the batch #44166
Related fix(memory): retry transient embedding transport failures #44167
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: the embedding retry classifier covered rate-limit and service-side failures, but not common transient transport failures from fetch/undici/network sockets.
Missing detection / guardrail: unit coverage did not lock in transport retry classification or the exhausted-retry batch split fallback.
Contributing context (if known): remote embedding providers can fail a whole batch with transient transport errors even when individual inputs are valid.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: extensions/memory-core/src/memory/manager-embedding-policy.test.ts
Scenario the test should lock in: transient transport errors are retryable; exhausted transport failures split multi-item batches recursively; exhausted service retry errors such as 429 do not split.
Why this is the smallest reliable guardrail: the retry classifier and split policy are deterministic and can be validated without live provider/network calls.
Existing test that already covers this (if any): existing tests covered rate-limit, 5xx, and tokens-per-day retry behavior, but not transport failures.
If no new test is added, why not: N/A

User-visible / Behavior Changes

Memory indexing should be more resilient to transient embedding provider transport failures. No config or UI changes.

Diagram (if applicable)

Before:
[embedding batch] -> [transient transport failure] -> [sync aborts]

After:
[embedding batch] -> [retry with backoff] -> [split batch after retry exhaustion] -> [index valid chunks]

Security Impact (required)

New permissions/capabilities? (Yes/No): No
Secrets/tokens handling changed? (Yes/No): No
New/changed network calls? (Yes/No): No
Command/tool execution surface changed? (Yes/No): No
Data access scope changed? (Yes/No): No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: macOS
Runtime/container: Node v22.18.0, pnpm 10.33.0
Model/provider: mocked embedding provider in unit tests
Integration/channel (if any): N/A
Relevant config (redacted): N/A

Steps

Run the focused memory embedding policy test.
Run the full extensions/memory-core test project.
Inspect the diff against current main to confirm only memory-core policy/ops/test files changed.

Expected

Transport failures are retried and, after retry exhaustion, multi-item batches split recursively.
Non-transport retryable service errors do not trigger batch splitting.
Single-item batches still fail normally after retry exhaustion.

Actual

Matches expected behavior in unit coverage.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Focused verification passed:

pnpm test extensions/memory-core/src/memory/manager-embedding-policy.test.ts
pnpm test extensions/memory-core

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: transport error classification, retry-before-split behavior, recursive split ordering, no split for exhausted 429/service retry errors.
Edge cases checked: empty batches still return []; single-item batch failures remain terminal; structured and text batch paths both use the same retry/split policy.
What you did not verify: a clean full pnpm check:changed run to completion. In this fork checkout it entered fail-safe all-lanes mode; typecheck, lint, and import-cycle checks passed, then the full test lane hit unrelated src/agents/tools/web-fetch.provider-fallback.test.ts SSRF failures outside the changed files.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes/No): Yes
Config/env changes? (Yes/No): No
Migration needed? (Yes/No): No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: broader retry classification could hide genuinely persistent provider/network problems for longer.
- Mitigation: only narrow transport patterns are classified as retryable; single-item batches still fail after retry exhaustion.
Risk: splitting batches increases provider calls after repeated transport failures.
- Mitigation: splitting only happens after the existing retry loop is exhausted and only for multi-item batches.

greptile-apps · 2026-03-12T15:27:04Z

Greptile Summary

This PR adds resilience to embedBatchWithRetry() by detecting transient transport-level errors (fetch failed, econnreset, etc.) and retrying them with the existing exponential-backoff loop; when retries are exhausted on a multi-item batch, the batch is recursively split in half until individual items can be retried independently or fail. The change is well-scoped, does not affect provider selection, configuration, or schema, and is covered by two new focused tests.

Key changes:

New private method isRetryableEmbeddingNetworkError with a regex matching common transport-level failure strings
New retry/split branch inserted into embedBatchWithRetry before the existing rate-limit retry path; single-item batches still throw after retry exhaustion
Two new test cases: single transient failure with successful retry, and persistent failures forcing full binary batch splitting

Issues found:

The terminated keyword in isRetryableEmbeddingNetworkError is broader than necessary and could match non-transient errors (e.g., worker/process termination); scoping it to socket terminated or documenting the targeted SDK error would be safer
The "splits batches" test only verifies the first and last batch sizes; asserting that the total count of single-item successful calls equals the number of chunks would more robustly validate that no items are silently dropped during splitting

Confidence Score: 4/5

This PR is safe to merge — the logic is correct, the change is narrow, and existing behavior for non-network errors is fully preserved.
The core retry/split logic is sound: the attempt counter is correctly shared, the split path only fires after retry exhaustion, single-item batches still propagate errors as intended, and recursive sub-batches each get a fresh attempt counter with proper backoff. The only concerns are a slightly over-broad regex pattern (terminated) and a test assertion that does not fully verify embedding completeness after splitting — both are minor and do not represent correctness bugs.
Pay attention to the isRetryableEmbeddingNetworkError regex in src/memory/manager-embedding-ops.ts — the terminated keyword may be too broad.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/memory/manager-embedding-ops.ts
Line: 622-626

Comment:
**`terminated` pattern is overly broad**

The word `terminated` can appear in non-network error messages, e.g. "Worker was terminated", "Request terminated by the user", or process-shutdown errors. Retrying those would be noise at best and mask real failures at worst.

Consider tightening it to the network-specific forms that are actually emitted by Node's `undici` / `fetch`:

```suggestion
  private isRetryableEmbeddingNetworkError(message: string): boolean {
    return /(fetch failed|network error|econnreset|econnrefused|connection reset|socket hang up|other side closed|socket terminated)/i.test(
      message,
    );
  }
```

Alternatively, if the bare `terminated` is needed for a specific SDK, a comment documenting which SDK / error string it covers would make the intent clear and prevent accidental broadening in future patches.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/memory/manager.embedding-batches.test.ts
Line: 178-182

Comment:
**Test only weakly validates split completeness**

The assertion only checks the first and last two batch sizes, but doesn't confirm that *every* chunk was eventually embedded. If the recursive split somehow drops items (e.g., an off-by-one in `splitAt`), the test would still pass as long as the last two sizes happen to be `[1, 1]`.

Adding a check on the total number of single-item calls (which must equal the total chunk count for all items to be covered) would make the test significantly more robust:

```ts
// after the sync call
const singleItemCalls = batchSizes.filter((s) => s === 1).length;
const expectedChunks = managerSmall.status().chunks;
expect(singleItemCalls).toBe(expectedChunks);
```

This ensures the splitting truly accounts for every chunk, not just that the tail ends at size 1.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 6017933}

MrGeDiao · 2026-03-12T15:38:01Z

Addressing the Greptile suggestions in the latest push:

tightened the transport-error matcher by replacing bare terminated with socket terminated
strengthened the split test by asserting that the number of single-item calls matches managerSmall.status().chunks, so the test now verifies full chunk coverage instead of only the tail shape

Re-ran the focused tests after the update:

corepack pnpm exec vitest run --config vitest.unit.config.ts \
  src/memory/manager.embedding-batches.test.ts \
  src/memory/manager.sync-errors-do-not-crash.test.ts

MrGeDiao · 2026-04-26T14:05:46Z

Updated this PR onto current main and moved the fix to the current extensions/memory-core paths.

I also addressed the previous bot feedback by:

avoiding a broad bare terminated retry pattern; only narrow transport/socket/undici-style failures are retryable
adding regression coverage for transport retry classification, retry exhaustion batch splitting, result ordering, and no split for exhausted 429/service retry errors

Local focused memory-core tests pass, and the GitHub PR checks are green.

clawsweeper · 2026-04-26T14:10:24Z

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR broadens memory-core embedding retry classification for transient transport errors and adds recursive split-after-exhaustion behavior for failed embedding batches.

Reproducibility: yes. at source level. Current main retries transient transport messages but still has no split-after-exhaustion helper, while the PR's policy test models the remaining multi-item batch behavior.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Summary: The PR is not merge-ready because real behavior proof is missing and the current patch has blocking timeout and cancellation regressions.

Rank-up moves:

Rebase and preserve runEmbeddingOperationWithTimeout plus AbortSignal propagation in the split helper.
Keep OpenClaw's own embedding timeout messages retryable after the regex refactor.
Add redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the refreshed patch.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body and comments show mocked/unit verification only; it still needs redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the patch. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

The branch is conflicting with current main, so the April green checks do not prove the exact merge result.
The PR can stop retrying OpenClaw's own memory embeddings batch timed out after ... errors that current main treats as retryable.
The split path drops AbortSignal propagation, so timed-out remote provider requests may continue while retries or split sub-batches issue more calls.
The PR body and comments contain mocked/unit verification only; there is no redacted real remote memory indexing proof after the patch.

Maintainer options:

Refresh around abortable split retries (recommended)
Rebase the branch and implement the split helper around runEmbeddingOperationWithTimeout so timeout retry matching and AbortSignal propagation remain intact.
Require real remote-provider proof
Ask for redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the refreshed patch.
Replace the stale branch if needed
If resolving the conflict is more churn than value, land an equivalent focused fix from a fresh branch and close this PR after the replacement exists.

Next step before merge
Contributor or maintainer action is needed because the branch is conflicting, has P1 code regressions, and lacks real-environment proof that automation cannot supply.

Security
Cleared: The diff is limited to memory-core retry policy and tests, with no dependency, workflow, secret, permission, package, or code-execution surface change.

Review findings

[P1] Preserve timeout retry matching — extensions/memory-core/src/memory/manager-embedding-policy.ts:83-86
[P1] Keep split retries abortable — extensions/memory-core/src/memory/manager-embedding-ops.ts:282-283

Review details

Best possible solution:

Refresh or replace the branch with a split-after-exhaustion implementation that preserves runEmbeddingOperationWithTimeout, AbortSignal propagation, and current timeout retry matching, then require redacted real remote-provider indexing proof before merge.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level. Current main retries transient transport messages but still has no split-after-exhaustion helper, while the PR's policy test models the remaining multi-item batch behavior.

Is this the best way to solve the issue?

No, not in the current branch shape. The split idea is reasonable, but the patch needs to preserve current timeout retry matching and abortable provider calls before it is the maintainable fix.

Label justifications:

P2: This is a normal-priority memory indexing reliability fix with limited blast radius but real impact for remote embedding users.
merge-risk: 🚨 compatibility: The PR can regress current retry behavior for configured embedding timeouts that existing remote-provider setups rely on.
merge-risk: 🚨 availability: Dropping AbortSignal propagation can leave timed-out provider calls running while retry or split work continues, increasing load under stuck providers.
rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🧂 unranked krab, and The PR is not merge-ready because real behavior proof is missing and the current patch has blocking timeout and cancellation regressions.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body and comments show mocked/unit verification only; it still needs redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the patch. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

[P1] Preserve timeout retry matching — extensions/memory-core/src/memory/manager-embedding-policy.ts:83-86
Current main retries OpenClaw's own wrapper errors such as memory embeddings batch timed out after ... because the predicate includes generic timed out. This PR splits the regex and only matches connection-specific timeout text, so slow remote batch calls that currently retry would fail after the first timeout.
Confidence: 0.88
[P1] Keep split retries abortable — extensions/memory-core/src/memory/manager-embedding-ops.ts:282-283
Current main wraps batch embedding with runEmbeddingOperationWithTimeout and passes an AbortSignal to the provider. The PR calls this.withTimeout(provider.embedBatch(...)) without a signal, so a timed-out request can keep running while retries or split sub-batches issue additional remote calls.
Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.9

Acceptance criteria:

node scripts/run-vitest.mjs extensions/memory-core/src/memory/manager-embedding-policy.test.ts extensions/memory-core/src/memory/manager-embedding-timeout.test.ts
Redacted real remote memory indexing run showing transient provider transport retry/split behavior after the refreshed patch

What I checked:

Current main already retries transport and timeout messages: Current main's retry predicate includes fetch/undici/socket transport failures and the generic timed out text used by OpenClaw's own embedding timeout wrapper. (extensions/memory-core/src/memory/manager-embedding-policy.ts:83, e399a92e6cc9)
Current main keeps batch embedding abortable: Batch and structured batch embedding calls run through runEmbeddingOperationWithTimeout, which passes an AbortSignal into provider calls. (extensions/memory-core/src/memory/manager-embedding-ops.ts:327, e399a92e6cc9)
Provider contract supports AbortSignal: The memory host SDK embedding provider contract accepts options.signal for embedBatch and embedBatchInputs, so dropping signal propagation changes the runtime contract current main uses. (packages/memory-host-sdk/src/host/embeddings.types.ts:4, e399a92e6cc9)
PR diff drops abortable timeout wrapper: The PR switches batch embedding to this.withTimeout(provider.embedBatch(...)) without passing an AbortSignal, so timed-out provider calls can continue while retries or split sub-batches start. (extensions/memory-core/src/memory/manager-embedding-ops.ts:282, 755ac6bb23e4)
PR diff narrows away current timeout retry text: The PR replaces the current single retry regex with service and transport regexes that do not match OpenClaw's own memory embeddings batch timed out after ... wrapper errors. (extensions/memory-core/src/memory/manager-embedding-policy.ts:83, 755ac6bb23e4)
Live PR state blocks merge confidence: GitHub reports the PR as CONFLICTING, and repeated Real behavior proof checks are failing; the PR body/comments show mocked or unit verification rather than a real remote memory indexing run. (755ac6bb23e4)

Likely related people:

steipete: Current blame for the retry predicate and abortable timeout helper points to Peter Steinberger, and shortlog shows him as the dominant recent contributor to the affected memory embedding files. (role: recent area contributor; confidence: high; commits: 48bb3b0a7476, d806682f782e, 3417dbabf43e; files: extensions/memory-core/src/memory/manager-embedding-ops.ts, extensions/memory-core/src/memory/manager-embedding-policy.ts)
vincentkoc: Vincent Koc recently changed the same embedding ops file for memory-core embedding cache typing, so he is a plausible adjacent reviewer for this surface. (role: adjacent area contributor; confidence: medium; commits: e92c2b63f9cc; files: extensions/memory-core/src/memory/manager-embedding-ops.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against e399a92e6cc9.

clawsweeper · 2026-05-20T10:56:46Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries. Fixes #71784 Fixes #44166 Supersedes #44167 Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>

steipete · 2026-05-31T11:31:06Z

Thanks for the focused fix path. This has landed via #88599 in 899dc5f.

I ported the useful split-after-retry behavior onto current main, kept timeout and AbortSignal behavior intact, narrowed splitting so endpoint outages and batch timeouts do not amplify into recursive retry storms, and preserved contributor credit in the squash commit.

Closing this PR as superseded by the landed maintainer PR.

Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries. Fixes openclaw#71784 Fixes openclaw#44166 Supersedes openclaw#44167 Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>

openclaw-barnacle Bot added the size: S label Mar 12, 2026

greptile-apps Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread src/memory/manager-embedding-ops.ts Outdated

Comment thread src/memory/manager.embedding-batches.test.ts Outdated

MrGeDiao force-pushed the codex/memory-fetch-retry branch from 6017933 to 09d2bab Compare March 12, 2026 15:37

clawsweeper Bot mentioned this pull request Apr 26, 2026

Bug: memory search live embedding fails ~20–40% with fetch failed | other side closed (provider-agnostic; upstream healthy) #71784

Closed

fix(memory): retry transient embedding transport failures

755ac6b

MrGeDiao force-pushed the codex/memory-fetch-retry branch from 09d2bab to 755ac6b Compare April 26, 2026 13:59

openclaw-barnacle Bot added the extensions: memory-core Extension: memory-core label Apr 26, 2026

MrGeDiao changed the title ~~fix(memory): retry transient embedding fetch failures~~ fix(memory): retry transient embedding transport failures Apr 26, 2026

clawsweeper Bot added the P2 Normal backlog priority with limited blast radius. label May 16, 2026

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 16, 2026

steipete mentioned this pull request May 31, 2026

fix(memory): retry transient embedding failures #88599

Merged

steipete closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(memory): retry transient embedding transport failures#44167

fix(memory): retry transient embedding transport failures#44167
MrGeDiao wants to merge 1 commit into
openclaw:mainfrom
MrGeDiao:codex/memory-fetch-retry

MrGeDiao commented Mar 12, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

MrGeDiao commented Mar 12, 2026

Uh oh!

MrGeDiao commented Apr 26, 2026

Uh oh!

clawsweeper Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 20, 2026

Uh oh!

steipete commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

MrGeDiao commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Mar 12, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

Uh oh!

Uh oh!

MrGeDiao commented Mar 12, 2026

Uh oh!

MrGeDiao commented Apr 26, 2026

Uh oh!

clawsweeper Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 20, 2026

Uh oh!

steipete commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MrGeDiao commented Mar 12, 2026 •

edited

Loading

clawsweeper Bot commented Apr 26, 2026 •

edited

Loading