Bug: Memory search transient fetch failed | other side closed / Client network socket disconnected before secure TLS connection was established for live embedding queries (provider-agnostic)
Summary
Live memory search queries fail intermittently (~20–40% of calls) with one of two transient TLS/socket errors when using any remote embedding provider (OpenAI, Gemini). The same endpoint works perfectly via curl and via a plain Node.js fetch() from the same host, so the upstream API is healthy. The failure originates inside OpenClaw's internal SSRF-guarded fetch path.
Bulk reindex via the batch endpoint is not affected. Only the per-query single-embed path used by openclaw memory search (and presumably the in-conversation memory recall path) shows the issue.
This makes semantic memory recall unreliable in interactive sessions even though openclaw memory status reports Embeddings: ready.
Environment
| Item |
Value |
| OpenClaw version |
2026.4.24 (cbcfdf6) |
| Node.js |
v24.14.1 |
| OS |
Ubuntu 24.04 LTS, kernel 6.8.0-110-generic (x86_64) |
| Network |
direct outbound, no proxy, IPv4+IPv6 both working |
memory.backend |
builtin |
| sqlite-vec |
enabled, vec0.so loaded, Vector dims: 3072, FTS: ready |
Reproduced on two different remote embedding providers configured via openclaw config set:
openai / text-embedding-3-large (3072-dim, ~30 KB response)
gemini / gemini-embedding-2-preview (3072-dim, ~60 KB response)
Both fail with the same intermittent socket error. The Gemini case fails more often, consistent with a payload-size correlation, but OpenAI also fails repeatably.
Repro
1. Configure a remote embedding provider
openclaw config set memory.backend builtin
openclaw config set agents.defaults.memorySearch.provider openai
openclaw config set agents.defaults.memorySearch.model text-embedding-3-large
openclaw config set models.providers.openai \
'{"baseUrl":"https://api.openai.com/v1","apiKey":"sk-...","models":[]}' --strict-json
openclaw gateway restart
2. Reindex (works fine, uses batch endpoint)
openclaw memory index --force --agent main
# → Memory index updated (main).
openclaw memory status --deep --agent main then reports:
Provider: openai (requested: openai)
Model: text-embedding-3-large
Vector: ready
Vector dims: 3072
FTS: ready
Embeddings: ready
3. Run live queries (fails ~20–40% of the time)
for i in 1 2 3 4 5 6 7 8 9 10; do
result=$(openclaw memory search "pool stress test query $i" --agent main 2>&1 | tail -3)
if echo "$result" | grep -qE "fetch failed|other side closed|socket disconnected"; then
echo "Q$i: FAIL"
else
echo "Q$i: OK"
fi
done
Observed output (idle gateway):
Q1: OK
Q2: OK
Q3: OK
Q4: OK
Q5: FAIL
Q6: FAIL
Q7: OK
Q8: OK
Q9: OK
Q10: OK
→ OK: 8 / FAIL: 2
Under concurrent load (background reindex of other agents running):
4. Two distinct error messages observed
From the gateway log (/tmp/openclaw/openclaw-<date>.log):
ERROR Memory search failed: fetch failed | other side closed
ERROR Memory search failed: fetch failed | Client network socket disconnected before secure TLS connection was established
Both originate from dist/subsystem-CWI_MDy_.js:161 (search subsystem) wrapping a lower-level error from dist/engine-embeddings-DVkdyn0v.js → withRemoteHttpResponse → fetchWithSsrFGuard → undici dispatcher.
The two strings correspond to undici error causes:
other side closed → server closed the keep-alive socket between requests, request reused a dead socket.
Client network socket disconnected before secure TLS connection was established → TLS handshake aborted on a fresh socket (typical for pinned-DNS + Agent reuse with broken keep-alive).
Both are classic symptoms of a misconfigured / overly aggressive HTTP keep-alive pool.
Why this is not the upstream API
Same host, same network, same time:
# Direct curl to OpenAI: 100% success
curl -sS -o /dev/null -w "HTTP %{http_code} time=%{time_total}\n" -X POST \
https://api.openai.com/v1/embeddings \
-H "Authorization: Bearer sk-..." -H "Content-Type: application/json" \
-d '{"input":"transient pool test","model":"text-embedding-3-large"}'
# → HTTP 200 time=0.477410
# Native Node.js fetch to Gemini: 100% success, full 3072-dim payload returned
node -e "
fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-2-preview:embedContent', {
method:'POST',
headers:{'Content-Type':'application/json','x-goog-api-key':'***'},
body:JSON.stringify({content:{parts:[{text:'test'}]},taskType:'RETRIEVAL_QUERY',outputDimensionality:3072})
}).then(r=>r.json()).then(j=>console.log('OK dims=',(j.embedding?.values||[]).length))
.catch(e=>console.error('FAIL:', e.message, e.cause?.message));
"
# → OK dims= 3072
Repeated curl and Node fetch runs against both endpoints from the same machine never reproduce the disconnect. The failure is specific to OpenClaw's internal fetch path.
Why it is not provider-specific
| Provider |
Model |
Response size |
Failure rate observed |
| OpenAI |
text-embedding-3-large (3072-dim) |
~30 KB |
~20–40% |
| Google |
gemini-embedding-2-preview (3072-dim) |
~60 KB |
~80–100% |
Same host, same gateway version, same code path (withRemoteHttpResponse → fetchWithSsrFGuard). Switching provider does not eliminate the bug, only changes its frequency. Larger response bodies / longer-held sockets correlate with higher failure rates, which strongly suggests a connection-pool / keep-alive issue rather than a per-provider authentication or URL bug.
Suspected root cause
Looking at the bundled code paths in 2026.4.24 (cbcfdf6):
dist/extensions/google/embedding-provider.js and the corresponding OpenAI path both call withRemoteHttpResponse({ url, ssrfPolicy, init }).
dist/engine-embeddings-DVkdyn0v.js defines withRemoteHttpResponse → fetchWithSsrFGuard.
dist/fetch-guard-DKbwHPzH.js instantiates per-call undici dispatchers via:
createPolicyDispatcherWithoutPinnedDns(...) for direct mode, or
createPinnedDispatcher(await resolvePinnedHostnameWithPolicy(...)) for the SSRF-pinned path,
- backed by
createHttp1Agent / createHttp1EnvHttpProxyAgent / createHttp1ProxyAgent from dist/undici-runtime-x3fQiq5e.js, with a global stream timeout from dist/undici-global-dispatcher-KzKcGOUY.js.
The user-visible error patterns (other side closed, Client network socket disconnected before secure TLS connection was established) are the classic undici socket-reuse-on-dead-keepalive failure mode. The per-call dispatcher / pinned-DNS approach appears to either:
- share a connection pool across calls without reliably retiring sockets that the upstream has already half-closed,
- or interact badly with undici keep-alive defaults (
keepAliveTimeout, keepAliveMaxTimeout, pipelining) for high-latency TLS endpoints like api.openai.com and generativelanguage.googleapis.com,
- or close/release the dispatcher (
release(dispatcher) → closeDispatcher) in a way that leaves an in-flight socket reusable for the next call.
A single retry on UND_ERR_SOCKET / ECONNRESET / TLS-handshake-aborted errors at the withRemoteHttpResponse layer would mask this for users, but the underlying pool behavior likely deserves a fix.
Impact
- Semantic recall is unreliable in interactive sessions despite
Embeddings: ready reporting healthy.
- Users see "no matches" results or hard
Memory search failed: fetch failed | … errors at a measurable rate (~20–40% in this environment, higher under concurrent load).
- Active-memory plugin recall similarly degrades.
openclaw doctor does not surface this — memory status reports the provider as ready because the readiness probe happens to pass.
Workarounds tried
| Action |
Result |
| Switch provider OpenAI ↔ Gemini |
Same bug, different frequency. |
Use gemini-embedding-001 instead of 2-preview |
Same bug. |
Reduce outputDimensionality (3072 → default) |
Helps slightly (smaller payload) but does not eliminate. |
gateway restart |
No effect; reproduces immediately. |
Direct curl / native Node fetch from same host |
Always succeeds — confirms not a network/upstream issue. |
No workaround at the user-config level reliably eliminates the failures.
Suggested fixes
- Add a bounded retry (e.g. 1–2 retries with short backoff) around
withRemoteHttpResponse for embedding calls, scoped to undici/TLS connection-reset error classes (UND_ERR_SOCKET, ECONNRESET, EPIPE, Client network socket disconnected before secure TLS connection was established, other side closed). This alone would make the user-visible behavior reliable.
- Tune the undici dispatcher for the embedding pool: explicit
keepAliveTimeout / keepAliveMaxTimeout lower than the typical Google/OpenAI server-side keep-alive idle (e.g. 4 s), and pipelining: 0. Right now the symptoms are fully consistent with reusing a socket the server has already half-closed.
- Surface the error class better in
openclaw memory status --deep so users can distinguish "auth misconfig" vs. "transient socket pool failure". Currently both look the same as Embeddings error: fetch failed | other side closed.
- Optional: a
memorySearch.remote.retry config knob ({enabled: true, maxAttempts: 2, backoffMs: 250}) so users can opt in/out without code changes.
Additional notes
- This affects
memory.backend: builtin. QMD-backed workspaces are not affected because they do not exercise the same per-query fetch path.
- The bundled
embedding-provider.js already uses executeWithApiKeyRotation, but rotation only kicks in for API-key-level failures, not for transient network/socket errors, so it does not help here.
- Happy to provide more detailed undici-level logs if a debug flag is available — please point me at the right env var (
NODE_DEBUG=undici was tried but the gateway buffers its own logger).
TL;DR
openclaw memory search (live single-embed path) fails ~20–40% of the time with fetch failed | other side closed or … socket disconnected before secure TLS connection was established, while the exact same upstream endpoint works 100% via curl and native Node fetch from the same host. Affects all remote embedding providers, gets worse with bigger responses and concurrent load. Looks like a keep-alive / pool-reuse bug in the SSRF-guarded fetch path; a retry layer + dispatcher tuning should fix the user-visible symptom.
Bug: Memory search transient
fetch failed | other side closed/Client network socket disconnected before secure TLS connection was establishedfor live embedding queries (provider-agnostic)Summary
Live memory search queries fail intermittently (~20–40% of calls) with one of two transient TLS/socket errors when using any remote embedding provider (OpenAI, Gemini). The same endpoint works perfectly via
curland via a plain Node.jsfetch()from the same host, so the upstream API is healthy. The failure originates inside OpenClaw's internal SSRF-guarded fetch path.Bulk reindex via the batch endpoint is not affected. Only the per-query single-embed path used by
openclaw memory search(and presumably the in-conversation memory recall path) shows the issue.This makes semantic memory recall unreliable in interactive sessions even though
openclaw memory statusreportsEmbeddings: ready.Environment
2026.4.24 (cbcfdf6)v24.14.16.8.0-110-generic(x86_64)memory.backendbuiltinvec0.soloaded,Vector dims: 3072,FTS: readyReproduced on two different remote embedding providers configured via
openclaw config set:openai/text-embedding-3-large(3072-dim, ~30 KB response)gemini/gemini-embedding-2-preview(3072-dim, ~60 KB response)Both fail with the same intermittent socket error. The Gemini case fails more often, consistent with a payload-size correlation, but OpenAI also fails repeatably.
Repro
1. Configure a remote embedding provider
2. Reindex (works fine, uses batch endpoint)
openclaw memory index --force --agent main # → Memory index updated (main).openclaw memory status --deep --agent mainthen reports:3. Run live queries (fails ~20–40% of the time)
Observed output (idle gateway):
Under concurrent load (background reindex of other agents running):
4. Two distinct error messages observed
From the gateway log (
/tmp/openclaw/openclaw-<date>.log):Both originate from
dist/subsystem-CWI_MDy_.js:161(search subsystem) wrapping a lower-level error fromdist/engine-embeddings-DVkdyn0v.js→withRemoteHttpResponse→fetchWithSsrFGuard→ undici dispatcher.The two strings correspond to undici error causes:
other side closed→ server closed the keep-alive socket between requests, request reused a dead socket.Client network socket disconnected before secure TLS connection was established→ TLS handshake aborted on a fresh socket (typical for pinned-DNS +Agentreuse with broken keep-alive).Both are classic symptoms of a misconfigured / overly aggressive HTTP keep-alive pool.
Why this is not the upstream API
Same host, same network, same time:
Repeated
curland Nodefetchruns against both endpoints from the same machine never reproduce the disconnect. The failure is specific to OpenClaw's internal fetch path.Why it is not provider-specific
text-embedding-3-large(3072-dim)gemini-embedding-2-preview(3072-dim)Same host, same gateway version, same code path (
withRemoteHttpResponse→fetchWithSsrFGuard). Switching provider does not eliminate the bug, only changes its frequency. Larger response bodies / longer-held sockets correlate with higher failure rates, which strongly suggests a connection-pool / keep-alive issue rather than a per-provider authentication or URL bug.Suspected root cause
Looking at the bundled code paths in
2026.4.24 (cbcfdf6):dist/extensions/google/embedding-provider.jsand the corresponding OpenAI path both callwithRemoteHttpResponse({ url, ssrfPolicy, init }).dist/engine-embeddings-DVkdyn0v.jsdefineswithRemoteHttpResponse→fetchWithSsrFGuard.dist/fetch-guard-DKbwHPzH.jsinstantiates per-call undici dispatchers via:createPolicyDispatcherWithoutPinnedDns(...)for direct mode, orcreatePinnedDispatcher(await resolvePinnedHostnameWithPolicy(...))for the SSRF-pinned path,createHttp1Agent/createHttp1EnvHttpProxyAgent/createHttp1ProxyAgentfromdist/undici-runtime-x3fQiq5e.js, with a global stream timeout fromdist/undici-global-dispatcher-KzKcGOUY.js.The user-visible error patterns (
other side closed,Client network socket disconnected before secure TLS connection was established) are the classic undici socket-reuse-on-dead-keepalive failure mode. The per-call dispatcher / pinned-DNS approach appears to either:keepAliveTimeout,keepAliveMaxTimeout,pipelining) for high-latency TLS endpoints likeapi.openai.comandgenerativelanguage.googleapis.com,release(dispatcher)→closeDispatcher) in a way that leaves an in-flight socket reusable for the next call.A single retry on
UND_ERR_SOCKET/ECONNRESET/ TLS-handshake-aborted errors at thewithRemoteHttpResponselayer would mask this for users, but the underlying pool behavior likely deserves a fix.Impact
Embeddings: readyreporting healthy.Memory search failed: fetch failed | …errors at a measurable rate (~20–40% in this environment, higher under concurrent load).openclaw doctordoes not surface this —memory statusreports the provider as ready because the readiness probe happens to pass.Workarounds tried
gemini-embedding-001instead of2-previewoutputDimensionality(3072 → default)gateway restartcurl/ native Nodefetchfrom same hostNo workaround at the user-config level reliably eliminates the failures.
Suggested fixes
withRemoteHttpResponsefor embedding calls, scoped to undici/TLS connection-reset error classes (UND_ERR_SOCKET,ECONNRESET,EPIPE,Client network socket disconnected before secure TLS connection was established,other side closed). This alone would make the user-visible behavior reliable.keepAliveTimeout/keepAliveMaxTimeoutlower than the typical Google/OpenAI server-side keep-alive idle (e.g. 4 s), andpipelining: 0. Right now the symptoms are fully consistent with reusing a socket the server has already half-closed.openclaw memory status --deepso users can distinguish "auth misconfig" vs. "transient socket pool failure". Currently both look the same asEmbeddings error: fetch failed | other side closed.memorySearch.remote.retryconfig knob ({enabled: true, maxAttempts: 2, backoffMs: 250}) so users can opt in/out without code changes.Additional notes
memory.backend: builtin. QMD-backed workspaces are not affected because they do not exercise the same per-query fetch path.embedding-provider.jsalready usesexecuteWithApiKeyRotation, but rotation only kicks in for API-key-level failures, not for transient network/socket errors, so it does not help here.NODE_DEBUG=undiciwas tried but the gateway buffers its own logger).TL;DR
openclaw memory search(live single-embed path) fails ~20–40% of the time withfetch failed | other side closedor… socket disconnected before secure TLS connection was established, while the exact same upstream endpoint works 100% viacurland native Nodefetchfrom the same host. Affects all remote embedding providers, gets worse with bigger responses and concurrent load. Looks like a keep-alive / pool-reuse bug in the SSRF-guarded fetch path; a retry layer + dispatcher tuning should fix the user-visible symptom.