Skip to content

memory reindex aborts on transient embedding transport errors instead of retrying or splitting the batch #44166

@MrGeDiao

Description

@MrGeDiao

Summary

MemoryManagerEmbeddingOps.embedBatchWithRetry() currently retries rate-limit style failures, but it does not treat transient transport failures as retryable.

In practice, longer remote memory reindex runs can fail with errors like:

  • TypeError: fetch failed
  • ECONNRESET
  • socket hang up
  • terminated
  • other side closed

When that happens, the whole memory sync aborts even though retrying the same batch often succeeds.

Why this matters

This shows up during larger remote embedding runs, especially when indexing many documents over network-backed providers.

The current failure mode is costly:

  • the whole reindex fails
  • already-processed chunks are wasted
  • rerunning often succeeds without any input change

So the system is already resilient to rate limits, but still brittle against transient transport failures.

Expected behavior

For transient transport errors during batch embedding:

  1. Retry a few times with the existing backoff behavior.
  2. If retries are exhausted and the batch has multiple items, split the batch and continue recursively.
  3. Only fail immediately for non-retryable errors or single-item batches that still fail after retries.

Reproduction

A focused unit-test repro is straightforward:

  • mock embedBatch() to fail once with TypeError("fetch failed"), then succeed
  • mock embedBatch() to keep failing with fetch failed for texts.length > 1, but succeed for single-item batches

Scope of a safe fix

This can stay intentionally narrow:

  • only embedBatchWithRetry() needs to change
  • no provider-specific branching
  • no config/schema changes
  • no timeout constant changes

A small targeted retry + split fallback should make remote memory reindex much more resilient without changing the normal success path.

Metadata

Metadata

Assignees

Labels

P2Normal backlog priority with limited blast radius.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions