Skip to content

Bug: memory search live embedding fails ~20–40% with fetch failed | other side closed (provider-agnostic; upstream healthy) #71784

@kevinheinrichs

Description

@kevinheinrichs

Bug: Memory search transient fetch failed | other side closed / Client network socket disconnected before secure TLS connection was established for live embedding queries (provider-agnostic)

Summary

Live memory search queries fail intermittently (~20–40% of calls) with one of two transient TLS/socket errors when using any remote embedding provider (OpenAI, Gemini). The same endpoint works perfectly via curl and via a plain Node.js fetch() from the same host, so the upstream API is healthy. The failure originates inside OpenClaw's internal SSRF-guarded fetch path.

Bulk reindex via the batch endpoint is not affected. Only the per-query single-embed path used by openclaw memory search (and presumably the in-conversation memory recall path) shows the issue.

This makes semantic memory recall unreliable in interactive sessions even though openclaw memory status reports Embeddings: ready.


Environment

Item Value
OpenClaw version 2026.4.24 (cbcfdf6)
Node.js v24.14.1
OS Ubuntu 24.04 LTS, kernel 6.8.0-110-generic (x86_64)
Network direct outbound, no proxy, IPv4+IPv6 both working
memory.backend builtin
sqlite-vec enabled, vec0.so loaded, Vector dims: 3072, FTS: ready

Reproduced on two different remote embedding providers configured via openclaw config set:

  • openai / text-embedding-3-large (3072-dim, ~30 KB response)
  • gemini / gemini-embedding-2-preview (3072-dim, ~60 KB response)

Both fail with the same intermittent socket error. The Gemini case fails more often, consistent with a payload-size correlation, but OpenAI also fails repeatably.


Repro

1. Configure a remote embedding provider

openclaw config set memory.backend builtin
openclaw config set agents.defaults.memorySearch.provider openai
openclaw config set agents.defaults.memorySearch.model text-embedding-3-large
openclaw config set models.providers.openai \
  '{"baseUrl":"https://api.openai.com/v1","apiKey":"sk-...","models":[]}' --strict-json
openclaw gateway restart

2. Reindex (works fine, uses batch endpoint)

openclaw memory index --force --agent main
# → Memory index updated (main).

openclaw memory status --deep --agent main then reports:

Provider: openai (requested: openai)
Model: text-embedding-3-large
Vector: ready
Vector dims: 3072
FTS: ready
Embeddings: ready

3. Run live queries (fails ~20–40% of the time)

for i in 1 2 3 4 5 6 7 8 9 10; do
  result=$(openclaw memory search "pool stress test query $i" --agent main 2>&1 | tail -3)
  if echo "$result" | grep -qE "fetch failed|other side closed|socket disconnected"; then
    echo "Q$i: FAIL"
  else
    echo "Q$i: OK"
  fi
done

Observed output (idle gateway):

Q1: OK
Q2: OK
Q3: OK
Q4: OK
Q5: FAIL
Q6: FAIL
Q7: OK
Q8: OK
Q9: OK
Q10: OK
→ OK: 8 / FAIL: 2

Under concurrent load (background reindex of other agents running):

→ OK: 6 / FAIL: 4

4. Two distinct error messages observed

From the gateway log (/tmp/openclaw/openclaw-<date>.log):

ERROR Memory search failed: fetch failed | other side closed
ERROR Memory search failed: fetch failed | Client network socket disconnected before secure TLS connection was established

Both originate from dist/subsystem-CWI_MDy_.js:161 (search subsystem) wrapping a lower-level error from dist/engine-embeddings-DVkdyn0v.jswithRemoteHttpResponsefetchWithSsrFGuard → undici dispatcher.

The two strings correspond to undici error causes:

  • other side closed → server closed the keep-alive socket between requests, request reused a dead socket.
  • Client network socket disconnected before secure TLS connection was established → TLS handshake aborted on a fresh socket (typical for pinned-DNS + Agent reuse with broken keep-alive).

Both are classic symptoms of a misconfigured / overly aggressive HTTP keep-alive pool.


Why this is not the upstream API

Same host, same network, same time:

# Direct curl to OpenAI: 100% success
curl -sS -o /dev/null -w "HTTP %{http_code} time=%{time_total}\n" -X POST \
  https://api.openai.com/v1/embeddings \
  -H "Authorization: Bearer sk-..." -H "Content-Type: application/json" \
  -d '{"input":"transient pool test","model":"text-embedding-3-large"}'
# → HTTP 200 time=0.477410
# Native Node.js fetch to Gemini: 100% success, full 3072-dim payload returned
node -e "
fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-2-preview:embedContent', {
  method:'POST',
  headers:{'Content-Type':'application/json','x-goog-api-key':'***'},
  body:JSON.stringify({content:{parts:[{text:'test'}]},taskType:'RETRIEVAL_QUERY',outputDimensionality:3072})
}).then(r=>r.json()).then(j=>console.log('OK dims=',(j.embedding?.values||[]).length))
  .catch(e=>console.error('FAIL:', e.message, e.cause?.message));
"
# → OK dims= 3072

Repeated curl and Node fetch runs against both endpoints from the same machine never reproduce the disconnect. The failure is specific to OpenClaw's internal fetch path.


Why it is not provider-specific

Provider Model Response size Failure rate observed
OpenAI text-embedding-3-large (3072-dim) ~30 KB ~20–40%
Google gemini-embedding-2-preview (3072-dim) ~60 KB ~80–100%

Same host, same gateway version, same code path (withRemoteHttpResponsefetchWithSsrFGuard). Switching provider does not eliminate the bug, only changes its frequency. Larger response bodies / longer-held sockets correlate with higher failure rates, which strongly suggests a connection-pool / keep-alive issue rather than a per-provider authentication or URL bug.


Suspected root cause

Looking at the bundled code paths in 2026.4.24 (cbcfdf6):

  • dist/extensions/google/embedding-provider.js and the corresponding OpenAI path both call withRemoteHttpResponse({ url, ssrfPolicy, init }).
  • dist/engine-embeddings-DVkdyn0v.js defines withRemoteHttpResponsefetchWithSsrFGuard.
  • dist/fetch-guard-DKbwHPzH.js instantiates per-call undici dispatchers via:
    • createPolicyDispatcherWithoutPinnedDns(...) for direct mode, or
    • createPinnedDispatcher(await resolvePinnedHostnameWithPolicy(...)) for the SSRF-pinned path,
  • backed by createHttp1Agent / createHttp1EnvHttpProxyAgent / createHttp1ProxyAgent from dist/undici-runtime-x3fQiq5e.js, with a global stream timeout from dist/undici-global-dispatcher-KzKcGOUY.js.

The user-visible error patterns (other side closed, Client network socket disconnected before secure TLS connection was established) are the classic undici socket-reuse-on-dead-keepalive failure mode. The per-call dispatcher / pinned-DNS approach appears to either:

  1. share a connection pool across calls without reliably retiring sockets that the upstream has already half-closed,
  2. or interact badly with undici keep-alive defaults (keepAliveTimeout, keepAliveMaxTimeout, pipelining) for high-latency TLS endpoints like api.openai.com and generativelanguage.googleapis.com,
  3. or close/release the dispatcher (release(dispatcher)closeDispatcher) in a way that leaves an in-flight socket reusable for the next call.

A single retry on UND_ERR_SOCKET / ECONNRESET / TLS-handshake-aborted errors at the withRemoteHttpResponse layer would mask this for users, but the underlying pool behavior likely deserves a fix.


Impact

  • Semantic recall is unreliable in interactive sessions despite Embeddings: ready reporting healthy.
  • Users see "no matches" results or hard Memory search failed: fetch failed | … errors at a measurable rate (~20–40% in this environment, higher under concurrent load).
  • Active-memory plugin recall similarly degrades.
  • openclaw doctor does not surface this — memory status reports the provider as ready because the readiness probe happens to pass.

Workarounds tried

Action Result
Switch provider OpenAI ↔ Gemini Same bug, different frequency.
Use gemini-embedding-001 instead of 2-preview Same bug.
Reduce outputDimensionality (3072 → default) Helps slightly (smaller payload) but does not eliminate.
gateway restart No effect; reproduces immediately.
Direct curl / native Node fetch from same host Always succeeds — confirms not a network/upstream issue.

No workaround at the user-config level reliably eliminates the failures.


Suggested fixes

  1. Add a bounded retry (e.g. 1–2 retries with short backoff) around withRemoteHttpResponse for embedding calls, scoped to undici/TLS connection-reset error classes (UND_ERR_SOCKET, ECONNRESET, EPIPE, Client network socket disconnected before secure TLS connection was established, other side closed). This alone would make the user-visible behavior reliable.
  2. Tune the undici dispatcher for the embedding pool: explicit keepAliveTimeout / keepAliveMaxTimeout lower than the typical Google/OpenAI server-side keep-alive idle (e.g. 4 s), and pipelining: 0. Right now the symptoms are fully consistent with reusing a socket the server has already half-closed.
  3. Surface the error class better in openclaw memory status --deep so users can distinguish "auth misconfig" vs. "transient socket pool failure". Currently both look the same as Embeddings error: fetch failed | other side closed.
  4. Optional: a memorySearch.remote.retry config knob ({enabled: true, maxAttempts: 2, backoffMs: 250}) so users can opt in/out without code changes.

Additional notes

  • This affects memory.backend: builtin. QMD-backed workspaces are not affected because they do not exercise the same per-query fetch path.
  • The bundled embedding-provider.js already uses executeWithApiKeyRotation, but rotation only kicks in for API-key-level failures, not for transient network/socket errors, so it does not help here.
  • Happy to provide more detailed undici-level logs if a debug flag is available — please point me at the right env var (NODE_DEBUG=undici was tried but the gateway buffers its own logger).

TL;DR

openclaw memory search (live single-embed path) fails ~20–40% of the time with fetch failed | other side closed or … socket disconnected before secure TLS connection was established, while the exact same upstream endpoint works 100% via curl and native Node fetch from the same host. Affects all remote embedding providers, gets worse with bigger responses and concurrent load. Looks like a keep-alive / pool-reuse bug in the SSRF-guarded fetch path; a retry layer + dispatcher tuning should fix the user-visible symptom.

Metadata

Metadata

Assignees

Labels

P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions