Skip to content

fix(memory): retry transient embedding transport failures#545

Open
BingqingLyu wants to merge 1 commit into
mainfrom
fork-pr-44167-codex-memory-fetch-retry
Open

fix(memory): retry transient embedding transport failures#545
BingqingLyu wants to merge 1 commit into
mainfrom
fork-pr-44167-codex-memory-fetch-retry

Conversation

@BingqingLyu

@BingqingLyu BingqingLyu commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • Problem: memory embedding batch sync could abort immediately when a remote embedding provider raised transient transport failures such as fetch failed, ECONNRESET, socket close, or undici socket/connect errors.
  • Why it matters: one temporary provider/network failure can prevent otherwise valid memory chunks from being indexed.
  • What changed: current extensions/memory-core now treats narrow transport failures as retryable, reuses the existing backoff loop, and recursively splits multi-item batches after retry exhaustion while preserving result order.
  • What did NOT change (scope boundary): no provider selection, config/schema, timeout constant, cache format, or non-retryable error behavior changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: the embedding retry classifier covered rate-limit and service-side failures, but not common transient transport failures from fetch/undici/network sockets.
  • Missing detection / guardrail: unit coverage did not lock in transport retry classification or the exhausted-retry batch split fallback.
  • Contributing context (if known): remote embedding providers can fail a whole batch with transient transport errors even when individual inputs are valid.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: extensions/memory-core/src/memory/manager-embedding-policy.test.ts
  • Scenario the test should lock in: transient transport errors are retryable; exhausted transport failures split multi-item batches recursively; exhausted service retry errors such as 429 do not split.
  • Why this is the smallest reliable guardrail: the retry classifier and split policy are deterministic and can be validated without live provider/network calls.
  • Existing test that already covers this (if any): existing tests covered rate-limit, 5xx, and tokens-per-day retry behavior, but not transport failures.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Memory indexing should be more resilient to transient embedding provider transport failures. No config or UI changes.

Diagram (if applicable)

Before:
[embedding batch] -> [transient transport failure] -> [sync aborts]

After:
[embedding batch] -> [retry with backoff] -> [split batch after retry exhaustion] -> [index valid chunks]

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node v22.18.0, pnpm 10.33.0
  • Model/provider: mocked embedding provider in unit tests
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Run the focused memory embedding policy test.
  2. Run the full extensions/memory-core test project.
  3. Inspect the diff against current main to confirm only memory-core policy/ops/test files changed.

Expected

  • Transport failures are retried and, after retry exhaustion, multi-item batches split recursively.
  • Non-transport retryable service errors do not trigger batch splitting.
  • Single-item batches still fail normally after retry exhaustion.

Actual

  • Matches expected behavior in unit coverage.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Focused verification passed:

pnpm test extensions/memory-core/src/memory/manager-embedding-policy.test.ts
pnpm test extensions/memory-core

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: transport error classification, retry-before-split behavior, recursive split ordering, no split for exhausted 429/service retry errors.
  • Edge cases checked: empty batches still return []; single-item batch failures remain terminal; structured and text batch paths both use the same retry/split policy.
  • What you did not verify: a clean full pnpm check:changed run to completion. In this fork checkout it entered fail-safe all-lanes mode; typecheck, lint, and import-cycle checks passed, then the full test lane hit unrelated src/agents/tools/web-fetch.provider-fallback.test.ts SSRF failures outside the changed files.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: broader retry classification could hide genuinely persistent provider/network problems for longer.
    • Mitigation: only narrow transport patterns are classified as retryable; single-item batches still fail after retry exhaustion.
  • Risk: splitting batches increases provider calls after repeated transport failures.
    • Mitigation: splitting only happens after the existing retry loop is exhausted and only for multi-item batches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory reindex aborts on transient embedding transport errors instead of retrying or splitting the batch

2 participants