Description
Memory search embedding reindex consistently fails with TypeError: fetch failed after successfully indexing a significant number of chunks (~41K out of estimated ~45K). The .tmp file is deleted on failure (runSafeReindex rollback), so all progress is lost and the next attempt starts from scratch — creating an infinite failure loop.
Environment
- OpenClaw version: v2026.3.24 (cff6dc9)
- Node.js: v25.8.2
- OS: Linux 6.8.0-88-generic (x64), 123GB RAM
- Embedding provider: SiliconFlow API (
Pro/BAAI/bge-m3, 1024d, OpenAI-compatible endpoint at https://api.siliconflow.cn/v1/)
- Files: 272
.md files (~187MB) under workspace memory/ directory
- Config:
memorySearch.remote.batch.concurrency: 2, default retry settings (3 attempts, 500ms/8000ms backoff)
- main agent with the same SiliconFlow config successfully indexed 4 chunks — issue is specific to large-scale reindex
Reproduction Steps
- Configure an agent with
memorySearch.enabled: true
- Place ~270 large
.md files (100KB-1MB each) in the workspace memory/ directory
- Use a remote embedding provider (SiliconFlow, OpenAI-compatible)
- Trigger
memory_search which initiates runSafeReindex
- Observe: tmp file grows to ~2GB, ~41K chunks indexed
- After ~1 hour:
memory sync failed: TypeError: fetch failed
- tmp is deleted, sqlite remains empty → next trigger restarts from scratch
Error Log
{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"} // once during indexing
{"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"}
{"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}
No stack trace is included — TypeError: fetch failed is logged without the underlying cause (DNS, timeout, connection reset, etc.).
Observations
- The embedding API itself is stable — manual test with 10 concurrent requests to SiliconFlow: 0 failures, ~300-400ms each
- Not a 429/rate-limit issue — only one rate-limit warning in the entire run
- Not an OOM issue — 123GB RAM, no swap pressure
- Not concurrency-dependent — fails with both concurrency=2 and concurrency=4
- Not specific to this provider — same failure pattern occurred with Alibaba DashScope (text-embedding-v4) before switching to SiliconFlow
- Progress loss is the critical issue —
runSafeReindex deletes the .tmp on any failure, meaning ~1 hour of API calls is wasted every time
- No stack trace makes it impossible to determine if the root cause is: undici connection pool reuse of dead connections, TLS session timeout, DNS resolution failure, or something else
Suggested Improvements
- Include full stack trace in the
TypeError: fetch failed log so the root cause can be identified
- Partial progress preservation — instead of deleting
.tmp on failure, consider checkpointing or resuming from the last successful batch
- Connection health checks — validate embedding API connectivity before starting a long reindex, or periodically during the process
- Graceful degradation — if one batch fails, skip it and continue instead of aborting the entire reindex
Description
Memory search embedding reindex consistently fails with
TypeError: fetch failedafter successfully indexing a significant number of chunks (~41K out of estimated ~45K). The.tmpfile is deleted on failure (runSafeReindex rollback), so all progress is lost and the next attempt starts from scratch — creating an infinite failure loop.Environment
Pro/BAAI/bge-m3, 1024d, OpenAI-compatible endpoint athttps://api.siliconflow.cn/v1/).mdfiles (~187MB) under workspacememory/directorymemorySearch.remote.batch.concurrency: 2, default retry settings (3 attempts, 500ms/8000ms backoff)Reproduction Steps
memorySearch.enabled: true.mdfiles (100KB-1MB each) in the workspacememory/directorymemory_searchwhich initiatesrunSafeReindexmemory sync failed: TypeError: fetch failedError Log
No stack trace is included —
TypeError: fetch failedis logged without the underlying cause (DNS, timeout, connection reset, etc.).Observations
runSafeReindexdeletes the.tmpon any failure, meaning ~1 hour of API calls is wasted every timeSuggested Improvements
TypeError: fetch failedlog so the root cause can be identified.tmpon failure, consider checkpointing or resuming from the last successful batch