Bug: memory search live embedding fails ~20–40% with `fetch failed | other side closed` (provider-agnostic; upstream healthy)

# Bug: Memory search transient `fetch failed | other side closed` / `Client network socket disconnected before secure TLS connection was established` for live embedding queries (provider-agnostic)

## Summary

Live memory search queries fail intermittently (~20–40% of calls) with one of two transient TLS/socket errors when using **any** remote embedding provider (OpenAI, Gemini). The same endpoint works perfectly via `curl` and via a plain Node.js `fetch()` from the same host, so the upstream API is healthy. The failure originates inside OpenClaw's internal SSRF-guarded fetch path.

Bulk reindex via the batch endpoint is **not** affected. Only the per-query single-embed path used by `openclaw memory search` (and presumably the in-conversation memory recall path) shows the issue.

This makes semantic memory recall unreliable in interactive sessions even though `openclaw memory status` reports `Embeddings: ready`.

---

## Environment

| Item | Value |
| --- | --- |
| OpenClaw version | `2026.4.24 (cbcfdf6)` |
| Node.js | `v24.14.1` |
| OS | Ubuntu 24.04 LTS, kernel `6.8.0-110-generic` (x86_64) |
| Network | direct outbound, no proxy, IPv4+IPv6 both working |
| `memory.backend` | `builtin` |
| sqlite-vec | enabled, `vec0.so` loaded, `Vector dims: 3072`, `FTS: ready` |

Reproduced on two different remote embedding providers configured via `openclaw config set`:

- `openai` / `text-embedding-3-large` (3072-dim, ~30 KB response)
- `gemini` / `gemini-embedding-2-preview` (3072-dim, ~60 KB response)

Both fail with the same intermittent socket error. The Gemini case fails more often, consistent with a payload-size correlation, but OpenAI also fails repeatably.

---

## Repro

### 1. Configure a remote embedding provider

```bash
openclaw config set memory.backend builtin
openclaw config set agents.defaults.memorySearch.provider openai
openclaw config set agents.defaults.memorySearch.model text-embedding-3-large
openclaw config set models.providers.openai \
  '{"baseUrl":"https://api.openai.com/v1","apiKey":"sk-...","models":[]}' --strict-json
openclaw gateway restart
```

### 2. Reindex (works fine, uses batch endpoint)

```bash
openclaw memory index --force --agent main
# → Memory index updated (main).
```

`openclaw memory status --deep --agent main` then reports:

```
Provider: openai (requested: openai)
Model: text-embedding-3-large
Vector: ready
Vector dims: 3072
FTS: ready
Embeddings: ready
```

### 3. Run live queries (fails ~20–40% of the time)

```bash
for i in 1 2 3 4 5 6 7 8 9 10; do
  result=$(openclaw memory search "pool stress test query $i" --agent main 2>&1 | tail -3)
  if echo "$result" | grep -qE "fetch failed|other side closed|socket disconnected"; then
    echo "Q$i: FAIL"
  else
    echo "Q$i: OK"
  fi
done
```

Observed output (idle gateway):

```
Q1: OK
Q2: OK
Q3: OK
Q4: OK
Q5: FAIL
Q6: FAIL
Q7: OK
Q8: OK
Q9: OK
Q10: OK
→ OK: 8 / FAIL: 2
```

Under concurrent load (background reindex of other agents running):

```
→ OK: 6 / FAIL: 4
```

### 4. Two distinct error messages observed

From the gateway log (`/tmp/openclaw/openclaw-<date>.log`):

```
ERROR Memory search failed: fetch failed | other side closed
ERROR Memory search failed: fetch failed | Client network socket disconnected before secure TLS connection was established
```

Both originate from `dist/subsystem-CWI_MDy_.js:161` (search subsystem) wrapping a lower-level error from `dist/engine-embeddings-DVkdyn0v.js` → `withRemoteHttpResponse` → `fetchWithSsrFGuard` → undici dispatcher.

The two strings correspond to undici error causes:

- `other side closed` → server closed the keep-alive socket between requests, request reused a dead socket.
- `Client network socket disconnected before secure TLS connection was established` → TLS handshake aborted on a fresh socket (typical for pinned-DNS + `Agent` reuse with broken keep-alive).

Both are classic symptoms of a **misconfigured / overly aggressive HTTP keep-alive pool**.

---

## Why this is not the upstream API

Same host, same network, same time:

```bash
# Direct curl to OpenAI: 100% success
curl -sS -o /dev/null -w "HTTP %{http_code} time=%{time_total}\n" -X POST \
  https://api.openai.com/v1/embeddings \
  -H "Authorization: Bearer sk-..." -H "Content-Type: application/json" \
  -d '{"input":"transient pool test","model":"text-embedding-3-large"}'
# → HTTP 200 time=0.477410
```

```bash
# Native Node.js fetch to Gemini: 100% success, full 3072-dim payload returned
node -e "
fetch('https://generativelanguage.googleapis.com/v1beta/models/gemini-embedding-2-preview:embedContent', {
  method:'POST',
  headers:{'Content-Type':'application/json','x-goog-api-key':'***'},
  body:JSON.stringify({content:{parts:[{text:'test'}]},taskType:'RETRIEVAL_QUERY',outputDimensionality:3072})
}).then(r=>r.json()).then(j=>console.log('OK dims=',(j.embedding?.values||[]).length))
  .catch(e=>console.error('FAIL:', e.message, e.cause?.message));
"
# → OK dims= 3072
```

Repeated `curl` and Node `fetch` runs against both endpoints from the same machine never reproduce the disconnect. The failure is specific to OpenClaw's internal fetch path.

---

## Why it is not provider-specific

| Provider | Model | Response size | Failure rate observed |
| --- | --- | --- | --- |
| OpenAI | `text-embedding-3-large` (3072-dim) | ~30 KB | ~20–40% |
| Google | `gemini-embedding-2-preview` (3072-dim) | ~60 KB | ~80–100% |

Same host, same gateway version, same code path (`withRemoteHttpResponse` → `fetchWithSsrFGuard`). Switching provider does not eliminate the bug, only changes its frequency. Larger response bodies / longer-held sockets correlate with higher failure rates, which strongly suggests a connection-pool / keep-alive issue rather than a per-provider authentication or URL bug.

---

## Suspected root cause

Looking at the bundled code paths in `2026.4.24 (cbcfdf6)`:

- `dist/extensions/google/embedding-provider.js` and the corresponding OpenAI path both call `withRemoteHttpResponse({ url, ssrfPolicy, init })`.
- `dist/engine-embeddings-DVkdyn0v.js` defines `withRemoteHttpResponse` → `fetchWithSsrFGuard`.
- `dist/fetch-guard-DKbwHPzH.js` instantiates per-call undici dispatchers via:
  - `createPolicyDispatcherWithoutPinnedDns(...)` for direct mode, or
  - `createPinnedDispatcher(await resolvePinnedHostnameWithPolicy(...))` for the SSRF-pinned path,
- backed by `createHttp1Agent` / `createHttp1EnvHttpProxyAgent` / `createHttp1ProxyAgent` from `dist/undici-runtime-x3fQiq5e.js`, with a global stream timeout from `dist/undici-global-dispatcher-KzKcGOUY.js`.

The user-visible error patterns (`other side closed`, `Client network socket disconnected before secure TLS connection was established`) are the classic undici socket-reuse-on-dead-keepalive failure mode. The per-call dispatcher / pinned-DNS approach appears to either:

1. share a connection pool across calls without reliably retiring sockets that the upstream has already half-closed,
2. or interact badly with undici keep-alive defaults (`keepAliveTimeout`, `keepAliveMaxTimeout`, `pipelining`) for high-latency TLS endpoints like `api.openai.com` and `generativelanguage.googleapis.com`,
3. or close/release the dispatcher (`release(dispatcher)` → `closeDispatcher`) in a way that leaves an in-flight socket reusable for the next call.

A single retry on `UND_ERR_SOCKET` / `ECONNRESET` / TLS-handshake-aborted errors at the `withRemoteHttpResponse` layer would mask this for users, but the underlying pool behavior likely deserves a fix.

---

## Impact

- **Semantic recall is unreliable** in interactive sessions despite `Embeddings: ready` reporting healthy.
- Users see "no matches" results or hard `Memory search failed: fetch failed | …` errors at a measurable rate (~20–40% in this environment, higher under concurrent load).
- Active-memory plugin recall similarly degrades.
- `openclaw doctor` does not surface this — `memory status` reports the provider as ready because the readiness probe happens to pass.

---

## Workarounds tried

| Action | Result |
| --- | --- |
| Switch provider OpenAI ↔ Gemini | Same bug, different frequency. |
| Use `gemini-embedding-001` instead of `2-preview` | Same bug. |
| Reduce `outputDimensionality` (3072 → default) | Helps slightly (smaller payload) but does not eliminate. |
| `gateway restart` | No effect; reproduces immediately. |
| Direct `curl` / native Node `fetch` from same host | Always succeeds — confirms not a network/upstream issue. |

No workaround at the user-config level reliably eliminates the failures.

---

## Suggested fixes

1. **Add a bounded retry** (e.g. 1–2 retries with short backoff) around `withRemoteHttpResponse` for embedding calls, scoped to undici/TLS connection-reset error classes (`UND_ERR_SOCKET`, `ECONNRESET`, `EPIPE`, `Client network socket disconnected before secure TLS connection was established`, `other side closed`). This alone would make the user-visible behavior reliable.
2. **Tune the undici dispatcher** for the embedding pool: explicit `keepAliveTimeout` / `keepAliveMaxTimeout` lower than the typical Google/OpenAI server-side keep-alive idle (e.g. 4 s), and `pipelining: 0`. Right now the symptoms are fully consistent with reusing a socket the server has already half-closed.
3. **Surface the error class better** in `openclaw memory status --deep` so users can distinguish "auth misconfig" vs. "transient socket pool failure". Currently both look the same as `Embeddings error: fetch failed | other side closed`.
4. **Optional**: a `memorySearch.remote.retry` config knob (`{enabled: true, maxAttempts: 2, backoffMs: 250}`) so users can opt in/out without code changes.

---

## Additional notes

- This affects `memory.backend: builtin`. QMD-backed workspaces are not affected because they do not exercise the same per-query fetch path.
- The bundled `embedding-provider.js` already uses `executeWithApiKeyRotation`, but rotation only kicks in for API-key-level failures, not for transient network/socket errors, so it does not help here.
- Happy to provide more detailed undici-level logs if a debug flag is available — please point me at the right env var (`NODE_DEBUG=undici` was tried but the gateway buffers its own logger).

---

## TL;DR

`openclaw memory search` (live single-embed path) fails ~20–40% of the time with `fetch failed | other side closed` or `… socket disconnected before secure TLS connection was established`, while the exact same upstream endpoint works 100% via `curl` and native Node `fetch` from the same host. Affects all remote embedding providers, gets worse with bigger responses and concurrent load. Looks like a keep-alive / pool-reuse bug in the SSRF-guarded fetch path; a retry layer + dispatcher tuning should fix the user-visible symptom.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: memory search live embedding fails ~20–40% with `fetch failed | other side closed` (provider-agnostic; upstream healthy) #71784

Bug: Memory search transient `fetch failed | other side closed` / `Client network socket disconnected before secure TLS connection was established` for live embedding queries (provider-agnostic)

Summary

Environment

Repro

1. Configure a remote embedding provider

2. Reindex (works fine, uses batch endpoint)

3. Run live queries (fails ~20–40% of the time)

4. Two distinct error messages observed

Why this is not the upstream API

Why it is not provider-specific

Suspected root cause

Impact

Workarounds tried

Suggested fixes

Additional notes

TL;DR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Item	Value
OpenClaw version	`2026.4.24 (cbcfdf6)`
Node.js	`v24.14.1`
OS	Ubuntu 24.04 LTS, kernel `6.8.0-110-generic` (x86_64)
Network	direct outbound, no proxy, IPv4+IPv6 both working
`memory.backend`	`builtin`
sqlite-vec	enabled, `vec0.so` loaded, `Vector dims: 3072`, `FTS: ready`

Provider	Model	Response size	Failure rate observed
OpenAI	`text-embedding-3-large` (3072-dim)	~30 KB	~20–40%
Google	`gemini-embedding-2-preview` (3072-dim)	~60 KB	~80–100%

Action	Result
Switch provider OpenAI ↔ Gemini	Same bug, different frequency.
Use `gemini-embedding-001` instead of `2-preview`	Same bug.
Reduce `outputDimensionality` (3072 → default)	Helps slightly (smaller payload) but does not eliminate.
`gateway restart`	No effect; reproduces immediately.
Direct `curl` / native Node `fetch` from same host	Always succeeds — confirms not a network/upstream issue.

Uh oh!

Bug: memory search live embedding fails ~20–40% with fetch failed | other side closed (provider-agnostic; upstream healthy) #71784

Description

Bug: Memory search transient fetch failed | other side closed / Client network socket disconnected before secure TLS connection was established for live embedding queries (provider-agnostic)

Summary

Environment

Repro

1. Configure a remote embedding provider

2. Reindex (works fine, uses batch endpoint)

3. Run live queries (fails ~20–40% of the time)

4. Two distinct error messages observed

Why this is not the upstream API

Why it is not provider-specific

Suspected root cause

Impact

Workarounds tried

Suggested fixes

Additional notes

TL;DR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug: memory search live embedding fails ~20–40% with `fetch failed | other side closed` (provider-agnostic; upstream healthy) #71784

Bug: Memory search transient `fetch failed | other side closed` / `Client network socket disconnected before secure TLS connection was established` for live embedding queries (provider-agnostic)