Skip to content

fix(memory): retry transient embedding transport failures#44167

Closed
MrGeDiao wants to merge 1 commit into
openclaw:mainfrom
MrGeDiao:codex/memory-fetch-retry
Closed

fix(memory): retry transient embedding transport failures#44167
MrGeDiao wants to merge 1 commit into
openclaw:mainfrom
MrGeDiao:codex/memory-fetch-retry

Conversation

@MrGeDiao

@MrGeDiao MrGeDiao commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: memory embedding batch sync could abort immediately when a remote embedding provider raised transient transport failures such as fetch failed, ECONNRESET, socket close, or undici socket/connect errors.
  • Why it matters: one temporary provider/network failure can prevent otherwise valid memory chunks from being indexed.
  • What changed: current extensions/memory-core now treats narrow transport failures as retryable, reuses the existing backoff loop, and recursively splits multi-item batches after retry exhaustion while preserving result order.
  • What did NOT change (scope boundary): no provider selection, config/schema, timeout constant, cache format, or non-retryable error behavior changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: the embedding retry classifier covered rate-limit and service-side failures, but not common transient transport failures from fetch/undici/network sockets.
  • Missing detection / guardrail: unit coverage did not lock in transport retry classification or the exhausted-retry batch split fallback.
  • Contributing context (if known): remote embedding providers can fail a whole batch with transient transport errors even when individual inputs are valid.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/memory-core/src/memory/manager-embedding-policy.test.ts
  • Scenario the test should lock in: transient transport errors are retryable; exhausted transport failures split multi-item batches recursively; exhausted service retry errors such as 429 do not split.
  • Why this is the smallest reliable guardrail: the retry classifier and split policy are deterministic and can be validated without live provider/network calls.
  • Existing test that already covers this (if any): existing tests covered rate-limit, 5xx, and tokens-per-day retry behavior, but not transport failures.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Memory indexing should be more resilient to transient embedding provider transport failures. No config or UI changes.

Diagram (if applicable)

Before:
[embedding batch] -> [transient transport failure] -> [sync aborts]

After:
[embedding batch] -> [retry with backoff] -> [split batch after retry exhaustion] -> [index valid chunks]

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node v22.18.0, pnpm 10.33.0
  • Model/provider: mocked embedding provider in unit tests
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Run the focused memory embedding policy test.
  2. Run the full extensions/memory-core test project.
  3. Inspect the diff against current main to confirm only memory-core policy/ops/test files changed.

Expected

  • Transport failures are retried and, after retry exhaustion, multi-item batches split recursively.
  • Non-transport retryable service errors do not trigger batch splitting.
  • Single-item batches still fail normally after retry exhaustion.

Actual

  • Matches expected behavior in unit coverage.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Focused verification passed:

pnpm test extensions/memory-core/src/memory/manager-embedding-policy.test.ts
pnpm test extensions/memory-core

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: transport error classification, retry-before-split behavior, recursive split ordering, no split for exhausted 429/service retry errors.
  • Edge cases checked: empty batches still return []; single-item batch failures remain terminal; structured and text batch paths both use the same retry/split policy.
  • What you did not verify: a clean full pnpm check:changed run to completion. In this fork checkout it entered fail-safe all-lanes mode; typecheck, lint, and import-cycle checks passed, then the full test lane hit unrelated src/agents/tools/web-fetch.provider-fallback.test.ts SSRF failures outside the changed files.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: broader retry classification could hide genuinely persistent provider/network problems for longer.
    • Mitigation: only narrow transport patterns are classified as retryable; single-item batches still fail after retry exhaustion.
  • Risk: splitting batches increases provider calls after repeated transport failures.
    • Mitigation: splitting only happens after the existing retry loop is exhausted and only for multi-item batches.

@greptile-apps

greptile-apps Bot commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds resilience to embedBatchWithRetry() by detecting transient transport-level errors (fetch failed, econnreset, etc.) and retrying them with the existing exponential-backoff loop; when retries are exhausted on a multi-item batch, the batch is recursively split in half until individual items can be retried independently or fail. The change is well-scoped, does not affect provider selection, configuration, or schema, and is covered by two new focused tests.

Key changes:

  • New private method isRetryableEmbeddingNetworkError with a regex matching common transport-level failure strings
  • New retry/split branch inserted into embedBatchWithRetry before the existing rate-limit retry path; single-item batches still throw after retry exhaustion
  • Two new test cases: single transient failure with successful retry, and persistent failures forcing full binary batch splitting

Issues found:

  • The terminated keyword in isRetryableEmbeddingNetworkError is broader than necessary and could match non-transient errors (e.g., worker/process termination); scoping it to socket terminated or documenting the targeted SDK error would be safer
  • The "splits batches" test only verifies the first and last batch sizes; asserting that the total count of single-item successful calls equals the number of chunks would more robustly validate that no items are silently dropped during splitting

Confidence Score: 4/5

  • This PR is safe to merge — the logic is correct, the change is narrow, and existing behavior for non-network errors is fully preserved.
  • The core retry/split logic is sound: the attempt counter is correctly shared, the split path only fires after retry exhaustion, single-item batches still propagate errors as intended, and recursive sub-batches each get a fresh attempt counter with proper backoff. The only concerns are a slightly over-broad regex pattern (terminated) and a test assertion that does not fully verify embedding completeness after splitting — both are minor and do not represent correctness bugs.
  • Pay attention to the isRetryableEmbeddingNetworkError regex in src/memory/manager-embedding-ops.ts — the terminated keyword may be too broad.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/memory/manager-embedding-ops.ts
Line: 622-626

Comment:
**`terminated` pattern is overly broad**

The word `terminated` can appear in non-network error messages, e.g. "Worker was terminated", "Request terminated by the user", or process-shutdown errors. Retrying those would be noise at best and mask real failures at worst.

Consider tightening it to the network-specific forms that are actually emitted by Node's `undici` / `fetch`:

```suggestion
  private isRetryableEmbeddingNetworkError(message: string): boolean {
    return /(fetch failed|network error|econnreset|econnrefused|connection reset|socket hang up|other side closed|socket terminated)/i.test(
      message,
    );
  }
```

Alternatively, if the bare `terminated` is needed for a specific SDK, a comment documenting which SDK / error string it covers would make the intent clear and prevent accidental broadening in future patches.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/memory/manager.embedding-batches.test.ts
Line: 178-182

Comment:
**Test only weakly validates split completeness**

The assertion only checks the first and last two batch sizes, but doesn't confirm that *every* chunk was eventually embedded. If the recursive split somehow drops items (e.g., an off-by-one in `splitAt`), the test would still pass as long as the last two sizes happen to be `[1, 1]`.

Adding a check on the total number of single-item calls (which must equal the total chunk count for all items to be covered) would make the test significantly more robust:

```ts
// after the sync call
const singleItemCalls = batchSizes.filter((s) => s === 1).length;
const expectedChunks = managerSmall.status().chunks;
expect(singleItemCalls).toBe(expectedChunks);
```

This ensures the splitting truly accounts for every chunk, not just that the tail ends at size 1.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 6017933

Comment thread src/memory/manager-embedding-ops.ts Outdated
Comment thread src/memory/manager.embedding-batches.test.ts Outdated
@MrGeDiao MrGeDiao force-pushed the codex/memory-fetch-retry branch from 6017933 to 09d2bab Compare March 12, 2026 15:37
@MrGeDiao

Copy link
Copy Markdown
Contributor Author

Addressing the Greptile suggestions in the latest push:

  • tightened the transport-error matcher by replacing bare terminated with socket terminated
  • strengthened the split test by asserting that the number of single-item calls matches managerSmall.status().chunks, so the test now verifies full chunk coverage instead of only the tail shape

Re-ran the focused tests after the update:

corepack pnpm exec vitest run --config vitest.unit.config.ts \
  src/memory/manager.embedding-batches.test.ts \
  src/memory/manager.sync-errors-do-not-crash.test.ts

@MrGeDiao MrGeDiao force-pushed the codex/memory-fetch-retry branch from 09d2bab to 755ac6b Compare April 26, 2026 13:59
@openclaw-barnacle openclaw-barnacle Bot added the extensions: memory-core Extension: memory-core label Apr 26, 2026
@MrGeDiao MrGeDiao changed the title fix(memory): retry transient embedding fetch failures fix(memory): retry transient embedding transport failures Apr 26, 2026

Copy link
Copy Markdown
Contributor Author

Updated this PR onto current main and moved the fix to the current extensions/memory-core paths.

I also addressed the previous bot feedback by:

  • avoiding a broad bare terminated retry pattern; only narrow transport/socket/undici-style failures are retryable
  • adding regression coverage for transport retry classification, retry exhaustion batch splitting, result ordering, and no split for exhausted 429/service retry errors

Local focused memory-core tests pass, and the GitHub PR checks are green.

@clawsweeper

clawsweeper Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR broadens memory-core embedding retry classification for transient transport errors and adds recursive split-after-exhaustion behavior for failed embedding batches.

Reproducibility: yes. at source level. Current main retries transient transport messages but still has no split-after-exhaustion helper, while the PR's policy test models the remaining multi-item batch behavior.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🧂 unranked krab
Summary: The PR is not merge-ready because real behavior proof is missing and the current patch has blocking timeout and cancellation regressions.

Rank-up moves:

  • Rebase and preserve runEmbeddingOperationWithTimeout plus AbortSignal propagation in the split helper.
  • Keep OpenClaw's own embedding timeout messages retryable after the regex refactor.
  • Add redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the refreshed patch.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body and comments show mocked/unit verification only; it still needs redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the patch. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • The branch is conflicting with current main, so the April green checks do not prove the exact merge result.
  • The PR can stop retrying OpenClaw's own memory embeddings batch timed out after ... errors that current main treats as retryable.
  • The split path drops AbortSignal propagation, so timed-out remote provider requests may continue while retries or split sub-batches issue more calls.
  • The PR body and comments contain mocked/unit verification only; there is no redacted real remote memory indexing proof after the patch.

Maintainer options:

  1. Refresh around abortable split retries (recommended)
    Rebase the branch and implement the split helper around runEmbeddingOperationWithTimeout so timeout retry matching and AbortSignal propagation remain intact.
  2. Require real remote-provider proof
    Ask for redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the refreshed patch.
  3. Replace the stale branch if needed
    If resolving the conflict is more churn than value, land an equivalent focused fix from a fresh branch and close this PR after the replacement exists.

Next step before merge
Contributor or maintainer action is needed because the branch is conflicting, has P1 code regressions, and lacks real-environment proof that automation cannot supply.

Security
Cleared: The diff is limited to memory-core retry policy and tests, with no dependency, workflow, secret, permission, package, or code-execution surface change.

Review findings

  • [P1] Preserve timeout retry matching — extensions/memory-core/src/memory/manager-embedding-policy.ts:83-86
  • [P1] Keep split retries abortable — extensions/memory-core/src/memory/manager-embedding-ops.ts:282-283
Review details

Best possible solution:

Refresh or replace the branch with a split-after-exhaustion implementation that preserves runEmbeddingOperationWithTimeout, AbortSignal propagation, and current timeout retry matching, then require redacted real remote-provider indexing proof before merge.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level. Current main retries transient transport messages but still has no split-after-exhaustion helper, while the PR's policy test models the remaining multi-item batch behavior.

Is this the best way to solve the issue?

No, not in the current branch shape. The split idea is reasonable, but the patch needs to preserve current timeout retry matching and abortable provider calls before it is the maintainable fix.

Label justifications:

  • P2: This is a normal-priority memory indexing reliability fix with limited blast radius but real impact for remote embedding users.
  • merge-risk: 🚨 compatibility: The PR can regress current retry behavior for configured embedding timeouts that existing remote-provider setups rely on.
  • merge-risk: 🚨 availability: Dropping AbortSignal propagation can leave timed-out provider calls running while retry or split work continues, increasing load under stuck providers.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🧂 unranked krab, and The PR is not merge-ready because real behavior proof is missing and the current patch has blocking timeout and cancellation regressions.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body and comments show mocked/unit verification only; it still needs redacted terminal output, logs, live output, or a linked artifact from a real remote memory indexing run after the patch. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

  • [P1] Preserve timeout retry matching — extensions/memory-core/src/memory/manager-embedding-policy.ts:83-86
    Current main retries OpenClaw's own wrapper errors such as memory embeddings batch timed out after ... because the predicate includes generic timed out. This PR splits the regex and only matches connection-specific timeout text, so slow remote batch calls that currently retry would fail after the first timeout.
    Confidence: 0.88
  • [P1] Keep split retries abortable — extensions/memory-core/src/memory/manager-embedding-ops.ts:282-283
    Current main wraps batch embedding with runEmbeddingOperationWithTimeout and passes an AbortSignal to the provider. The PR calls this.withTimeout(provider.embedBatch(...)) without a signal, so a timed-out request can keep running while retries or split sub-batches issue additional remote calls.
    Confidence: 0.9

Overall correctness: patch is incorrect
Overall confidence: 0.9

Acceptance criteria:

  • node scripts/run-vitest.mjs extensions/memory-core/src/memory/manager-embedding-policy.test.ts extensions/memory-core/src/memory/manager-embedding-timeout.test.ts
  • Redacted real remote memory indexing run showing transient provider transport retry/split behavior after the refreshed patch

What I checked:

Likely related people:

  • steipete: Current blame for the retry predicate and abortable timeout helper points to Peter Steinberger, and shortlog shows him as the dominant recent contributor to the affected memory embedding files. (role: recent area contributor; confidence: high; commits: 48bb3b0a7476, d806682f782e, 3417dbabf43e; files: extensions/memory-core/src/memory/manager-embedding-ops.ts, extensions/memory-core/src/memory/manager-embedding-policy.ts)
  • vincentkoc: Vincent Koc recently changed the same embedding ops file for memory-core embedding cache typing, so he is a plausible adjacent reviewer for this surface. (role: adjacent area contributor; confidence: medium; commits: e92c2b63f9cc; files: extensions/memory-core/src/memory/manager-embedding-ops.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against e399a92e6cc9.

@clawsweeper clawsweeper Bot added the P2 Normal backlog priority with limited blast radius. label May 16, 2026
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 16, 2026
@clawsweeper clawsweeper Bot added impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. and removed impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 17, 2026
@clawsweeper

clawsweeper Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

steipete added a commit that referenced this pull request May 31, 2026
Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries.

Fixes #71784
Fixes #44166
Supersedes #44167

Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
@steipete

Copy link
Copy Markdown
Contributor

Thanks for the focused fix path. This has landed via #88599 in 899dc5f.

I ported the useful split-after-retry behavior onto current main, kept timeout and AbortSignal behavior intact, narrowed splitting so endpoint outages and batch timeouts do not amplify into recursive retry storms, and preserved contributor credit in the squash commit.

Closing this PR as superseded by the landed maintainer PR.

@steipete steipete closed this May 31, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request Jun 1, 2026
Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries.

Fixes openclaw#71784
Fixes openclaw#44166
Supersedes openclaw#44167

Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries.

Fixes openclaw#71784
Fixes openclaw#44166
Supersedes openclaw#44167

Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries.

Fixes openclaw#71784
Fixes openclaw#44166
Supersedes openclaw#44167

Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extensions: memory-core Extension: memory-core merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P2 Normal backlog priority with limited blast radius. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: S status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory reindex aborts on transient embedding transport errors instead of retrying or splitting the batch

2 participants