Skip to content

memorySearch: embedding reindex fails with 'TypeError: fetch failed' after indexing ~40K chunks #56815

@dionysoslin615

Description

@dionysoslin615

Description

Memory search embedding reindex consistently fails with TypeError: fetch failed after successfully indexing a significant number of chunks (~41K out of estimated ~45K). The .tmp file is deleted on failure (runSafeReindex rollback), so all progress is lost and the next attempt starts from scratch — creating an infinite failure loop.

Environment

  • OpenClaw version: v2026.3.24 (cff6dc9)
  • Node.js: v25.8.2
  • OS: Linux 6.8.0-88-generic (x64), 123GB RAM
  • Embedding provider: SiliconFlow API (Pro/BAAI/bge-m3, 1024d, OpenAI-compatible endpoint at https://api.siliconflow.cn/v1/)
  • Files: 272 .md files (~187MB) under workspace memory/ directory
  • Config: memorySearch.remote.batch.concurrency: 2, default retry settings (3 attempts, 500ms/8000ms backoff)
  • main agent with the same SiliconFlow config successfully indexed 4 chunks — issue is specific to large-scale reindex

Reproduction Steps

  1. Configure an agent with memorySearch.enabled: true
  2. Place ~270 large .md files (100KB-1MB each) in the workspace memory/ directory
  3. Use a remote embedding provider (SiliconFlow, OpenAI-compatible)
  4. Trigger memory_search which initiates runSafeReindex
  5. Observe: tmp file grows to ~2GB, ~41K chunks indexed
  6. After ~1 hour: memory sync failed: TypeError: fetch failed
  7. tmp is deleted, sqlite remains empty → next trigger restarts from scratch

Error Log

{"subsystem":"memory","level":"warn","msg":"memory embeddings rate limited; retrying in 530ms"}  // once during indexing
{"subsystem":"memory","level":"warn","msg":"memory sync failed (session-start): TypeError: fetch failed"}
{"subsystem":"memory","level":"warn","msg":"memory sync failed (search): TypeError: fetch failed"}

No stack trace is included — TypeError: fetch failed is logged without the underlying cause (DNS, timeout, connection reset, etc.).

Observations

  1. The embedding API itself is stable — manual test with 10 concurrent requests to SiliconFlow: 0 failures, ~300-400ms each
  2. Not a 429/rate-limit issue — only one rate-limit warning in the entire run
  3. Not an OOM issue — 123GB RAM, no swap pressure
  4. Not concurrency-dependent — fails with both concurrency=2 and concurrency=4
  5. Not specific to this provider — same failure pattern occurred with Alibaba DashScope (text-embedding-v4) before switching to SiliconFlow
  6. Progress loss is the critical issuerunSafeReindex deletes the .tmp on any failure, meaning ~1 hour of API calls is wasted every time
  7. No stack trace makes it impossible to determine if the root cause is: undici connection pool reuse of dead connections, TLS session timeout, DNS resolution failure, or something else

Suggested Improvements

  1. Include full stack trace in the TypeError: fetch failed log so the root cause can be identified
  2. Partial progress preservation — instead of deleting .tmp on failure, consider checkpointing or resuming from the last successful batch
  3. Connection health checks — validate embedding API connectivity before starting a long reindex, or periodically during the process
  4. Graceful degradation — if one batch fails, skip it and continue instead of aborting the entire reindex

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions