Skip to content

fix(memory): retry transient embedding failures#88599

Merged
steipete merged 1 commit into
mainfrom
fix/memory-embedding-retries
May 31, 2026
Merged

fix(memory): retry transient embedding failures#88599
steipete merged 1 commit into
mainfrom
fix/memory-embedding-retries

Conversation

@steipete

Copy link
Copy Markdown
Contributor

Summary

  • Retry live memory-search query embeddings on transient provider transport failures instead of failing the search after one socket hiccup.
  • Split eligible text and structured embedding batches after bounded retries are exhausted, preserving result order so valid chunks can still index.
  • Keep service outages, endpoint-refused/unreachable failures, service throttles, and OpenClaw's own batch timeouts as retry-only failures so batch splitting does not amplify outages.

Fixes #71784
Fixes #44166
Supersedes #44167

Verification

  • node scripts/run-vitest.mjs extensions/memory-core/src/memory/manager-embedding-policy.test.ts extensions/memory-core/src/memory/index.test.ts passed, 34 tests.
  • env -u OPENCLAW_TESTBOX pnpm check:changed passed.
  • .agents/skills/autoreview/scripts/autoreview --mode local clean.

Real behavior proof

Behavior addressed: memory_search and memory reindex no longer fail immediately on transient embedding socket failures.
Real environment tested: local OpenClaw source checkout with mocked memory-core embedding providers.
Exact steps or command run after this patch: node scripts/run-vitest.mjs extensions/memory-core/src/memory/manager-embedding-policy.test.ts extensions/memory-core/src/memory/index.test.ts; env -u OPENCLAW_TESTBOX pnpm check:changed; .agents/skills/autoreview/scripts/autoreview --mode local.
Evidence after fix: focused Vitest passed 34 tests; changed gate passed; autoreview reported no accepted/actionable findings.
Observed result after fix: query embeddings retry after fetch failed | other side closed; exhausted splittable batch socket failures split recursively and preserve embedding order; ECONNREFUSED, unreachable hosts, service throttles, and manager batch timeouts do not split.
What was not tested: live external embedding provider outage behavior; coverage uses deterministic mocked provider failures.

Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries.

Fixes #71784

Fixes #44166

Supersedes #44167

Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
@openclaw-barnacle openclaw-barnacle Bot added extensions: memory-core Extension: memory-core size: M maintainer Maintainer-authored PR labels May 31, 2026
@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper status: review started.

I am starting a fresh review of this pull request: fix(memory): retry transient embedding failures This is item 1/1 in the current shard. Shard 0/1.

This placeholder means the worker is alive and reading the current context. I will edit this same comment with the actual review when the claws are done clicking.

Crustacean status: shell secured, claws on keyboard, evidence pebbles being sorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extensions: memory-core Extension: memory-core maintainer Maintainer-authored PR size: M

Projects

None yet

1 participant