fix(memory): retry transient embedding transport failures#44167
fix(memory): retry transient embedding transport failures#44167MrGeDiao wants to merge 1 commit into
Conversation
Greptile SummaryThis PR adds resilience to Key changes:
Issues found:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/memory/manager-embedding-ops.ts
Line: 622-626
Comment:
**`terminated` pattern is overly broad**
The word `terminated` can appear in non-network error messages, e.g. "Worker was terminated", "Request terminated by the user", or process-shutdown errors. Retrying those would be noise at best and mask real failures at worst.
Consider tightening it to the network-specific forms that are actually emitted by Node's `undici` / `fetch`:
```suggestion
private isRetryableEmbeddingNetworkError(message: string): boolean {
return /(fetch failed|network error|econnreset|econnrefused|connection reset|socket hang up|other side closed|socket terminated)/i.test(
message,
);
}
```
Alternatively, if the bare `terminated` is needed for a specific SDK, a comment documenting which SDK / error string it covers would make the intent clear and prevent accidental broadening in future patches.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/memory/manager.embedding-batches.test.ts
Line: 178-182
Comment:
**Test only weakly validates split completeness**
The assertion only checks the first and last two batch sizes, but doesn't confirm that *every* chunk was eventually embedded. If the recursive split somehow drops items (e.g., an off-by-one in `splitAt`), the test would still pass as long as the last two sizes happen to be `[1, 1]`.
Adding a check on the total number of single-item calls (which must equal the total chunk count for all items to be covered) would make the test significantly more robust:
```ts
// after the sync call
const singleItemCalls = batchSizes.filter((s) => s === 1).length;
const expectedChunks = managerSmall.status().chunks;
expect(singleItemCalls).toBe(expectedChunks);
```
This ensures the splitting truly accounts for every chunk, not just that the tail ends at size 1.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 6017933 |
6017933 to
09d2bab
Compare
|
Addressing the Greptile suggestions in the latest push:
Re-ran the focused tests after the update: corepack pnpm exec vitest run --config vitest.unit.config.ts \
src/memory/manager.embedding-batches.test.ts \
src/memory/manager.sync-errors-do-not-crash.test.ts |
09d2bab to
755ac6b
Compare
|
Updated this PR onto current I also addressed the previous bot feedback by:
Local focused memory-core tests pass, and the GitHub PR checks are green. |
|
Codex review: needs real behavior proof before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. at source level. Current main retries transient transport messages but still has no split-after-exhaustion helper, while the PR's policy test models the remaining multi-item batch behavior. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Refresh or replace the branch with a split-after-exhaustion implementation that preserves Do we have a high-confidence way to reproduce the issue? Yes, at source level. Current main retries transient transport messages but still has no split-after-exhaustion helper, while the PR's policy test models the remaining multi-item batch behavior. Is this the best way to solve the issue? No, not in the current branch shape. The split idea is reasonable, but the patch needs to preserve current timeout retry matching and abortable provider calls before it is the maintainable fix. Label justifications:
Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e399a92e6cc9. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
|
Thanks for the focused fix path. This has landed via #88599 in 899dc5f. I ported the useful split-after-retry behavior onto current main, kept timeout and AbortSignal behavior intact, narrowed splitting so endpoint outages and batch timeouts do not amplify into recursive retry storms, and preserved contributor credit in the squash commit. Closing this PR as superseded by the landed maintainer PR. |
Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries. Fixes openclaw#71784 Fixes openclaw#44166 Supersedes openclaw#44167 Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries. Fixes openclaw#71784 Fixes openclaw#44166 Supersedes openclaw#44167 Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
Retry live query embeddings on transient provider transport failures and split eligible batch embedding socket failures after bounded retries. Fixes openclaw#71784 Fixes openclaw#44166 Supersedes openclaw#44167 Co-authored-by: MrGeDiao <MrGeDiao@users.noreply.github.com>
Summary
fetch failed,ECONNRESET, socket close, or undici socket/connect errors.extensions/memory-corenow treats narrow transport failures as retryable, reuses the existing backoff loop, and recursively splits multi-item batches after retry exhaustion while preserving result order.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause (if applicable)
Regression Test Plan (if applicable)
extensions/memory-core/src/memory/manager-embedding-policy.test.tsUser-visible / Behavior Changes
Memory indexing should be more resilient to transient embedding provider transport failures. No config or UI changes.
Diagram (if applicable)
Security Impact (required)
Yes/No): NoYes/No): NoYes/No): NoYes/No): NoYes/No): NoYes, explain risk + mitigation: N/ARepro + Verification
Environment
Steps
extensions/memory-coretest project.mainto confirm only memory-core policy/ops/test files changed.Expected
Actual
Evidence
Focused verification passed:
Human Verification (required)
What you personally verified (not just CI), and how:
[]; single-item batch failures remain terminal; structured and text batch paths both use the same retry/split policy.pnpm check:changedrun to completion. In this fork checkout it entered fail-safe all-lanes mode; typecheck, lint, and import-cycle checks passed, then the full test lane hit unrelatedsrc/agents/tools/web-fetch.provider-fallback.test.tsSSRF failures outside the changed files.Review Conversations
Compatibility / Migration
Yes/No): YesYes/No): NoYes/No): NoRisks and Mitigations