-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
memory reindex aborts on transient embedding transport errors instead of retrying or splitting the batch #44166
Copy link
Copy link
Closed
BingqingLyu/openclaw
#545Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
MemoryManagerEmbeddingOps.embedBatchWithRetry()currently retries rate-limit style failures, but it does not treat transient transport failures as retryable.In practice, longer remote memory reindex runs can fail with errors like:
TypeError: fetch failedECONNRESETsocket hang upterminatedother side closedWhen that happens, the whole memory sync aborts even though retrying the same batch often succeeds.
Why this matters
This shows up during larger remote embedding runs, especially when indexing many documents over network-backed providers.
The current failure mode is costly:
So the system is already resilient to rate limits, but still brittle against transient transport failures.
Expected behavior
For transient transport errors during batch embedding:
Reproduction
A focused unit-test repro is straightforward:
embedBatch()to fail once withTypeError("fetch failed"), then succeedembedBatch()to keep failing withfetch failedfortexts.length > 1, but succeed for single-item batchesScope of a safe fix
This can stay intentionally narrow:
embedBatchWithRetry()needs to changeA small targeted retry + split fallback should make remote memory reindex much more resilient without changing the normal success path.