[Feature]: bundled openai-compatible embedding provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)

## Summary

Add a bundled memory embedding provider adapter named `openai-compatible` that targets any local OpenAI-compatible HTTP embedding server (llama.cpp's `llama-server`, Ollama via its `/v1` surface, vLLM, TGI, LocalAI, llamafile, or any reverse-proxied internal instance), without any vendor-specific warmup probe and without inheriting from any global `models.providers.*` config.

## Problem to solve

Operators running a self-hosted OpenAI-compatible embeddings server today have two unsatisfying choices, both of which produce real operator pain.

1. **Point the bundled `lmstudio` adapter at the local server.** The /v1/embeddings call works fine, but the adapter's `ensureLmstudioModelLoaded` warmup calls an LMStudio-only "load model" endpoint that hangs against generic servers. On my machine running llama.cpp's `llama-server` with BGE-M3 on localhost:8081, this hang blocks the gateway event loop for ~30 seconds per memory-lancedb embedding-provider rebuild. The gateway's own liveness diagnostic reports it as `event_loop_delay = 29,091 ms`, and queued sessions.list / config.get / cron.list responses balloon to 40-60 second response times during the freeze. The gateway log floods with `lmstudio embeddings warmup failed; continuing without preload` warnings with no operator-friendly indication that the actual cause is a vendor-specific preload endpoint mismatched against a perfectly good local server.

2. **Point the bundled `openai` adapter at the local server.** This works (the per-plugin `embedding.baseUrl` overrides the global `models.providers.openai.baseUrl`, and the openai adapter has no warmup), but it inherits the global openai config block's headers, attribution, and api-key resolution. If the per-plugin `embedding.baseUrl` line ever gets removed by mistake during a config edit, embedding requests silently fall back to api.openai.com, leaking embedded text to a cloud provider the operator may not have intended for memory.

Neither option says what it is on the tin. Operators searching for "how do I use my local embedding server with openclaw" end up confused, sometimes filing followup issues like #72875 (`Unknown memory embedding provider: local`) thinking the existing `local` adapter is what they want, when in fact the existing `local` is for in-process node-llama-cpp on a `.gguf` file and not for HTTP-based local servers.

## Proposed solution

Add a new bundled extension `extensions/openai-compatible-embeddings/` that registers an `openai-compatible` memory embedding provider adapter.

Design:

- Provider id: `openai-compatible`. Matches the term llama.cpp, Ollama, vLLM, TGI, LocalAI, and llamafile all use to describe their HTTP API.
- `transport: "remote"`. Routed through the same SSRF + remote-fetch path as the cloud adapters.
- No `autoSelectPriority`. Operator must opt in explicitly via `embedding.provider: "openai-compatible"`. We do not want auto-selection, because every operator with another adapter's credentials configured would otherwise route embeddings to the cloud the moment they enabled memory-lancedb.
- No `authProviderId`. There is no centralized auth flow for arbitrary local servers; the optional `apiKey` lives directly in the per-plugin `embedding` config block.
- No warmup, preload, or model-load probe. The first /v1/embeddings call loads the model lazily, which every server in this family already does.
- Reads only from the per-plugin `embedding` config block. Does not consult any global `models.providers.*` block. Cannot accidentally route to a vendor cloud.
- Fails-fast with a clear error message when `embedding.baseUrl` or `embedding.model` is missing.

Config:

```json5
{
  plugins: {
    entries: {
      "memory-lancedb": {
        enabled: true,
        config: {
          embedding: {
            provider: "openai-compatible",
            baseUrl: "http://localhost:8081/v1",
            model: "text-embedding-bge-m3",
            apiKey: "${LLAMA_API_TOKEN}",
            dimensions: 1024,
          },
        },
      },
    },
  },
}
```

Distinct from the existing in-process `local` adapter (`extensions/memory-core/src/memory/provider-adapters.ts`), which loads a `.gguf` file via node-llama-cpp inside the gateway process. See "Alternatives considered" below for the full breakdown of why the two are complementary rather than redundant.

## Alternatives considered

Considered four other approaches.

1. **Use the existing `local` adapter (in-process node-llama-cpp).** The natural first question. The existing `local` adapter loads a `.gguf` file directly into the gateway Node process via node-llama-cpp; my proposed adapter talks HTTP to a separately-running server. They are not interchangeable.

   | | Existing `local` | Proposed `openai-compatible` |
   |---|---|---|
   | Where the model lives | inside the gateway process | separate HTTP server |
   | Wire | in-process Node bindings | HTTP /v1/embeddings |
   | Reload model | gateway restart | server restart only |
   | Share with other clients | no, gateway owns the model | yes, any HTTP client |
   | GPU tuning surface | node-llama-cpp options | the server's own CLI flags (e.g. `llama-server -ngl ...`) |
   | Works with Ollama / vLLM / TGI / LocalAI / llamafile | no (not Node libs) | yes (they all speak OpenAI /v1) |
   | Operator's existing tuned setup | must be ported to node-llama-cpp options | unchanged |

   Operators running a separately-managed embedding server (which is the common shape on Apple Silicon, on machines with a dedicated GPU, or on shared infrastructure) cannot use the existing `local` adapter without abandoning their existing tuned setup. And operators on Ollama / vLLM / TGI / LocalAI / llamafile cannot use it at all because those projects are not Node libraries. Both adapters stay supported; they target different deployment shapes.

2. **Fix the `lmstudio` adapter so its warmup gracefully no-ops against non-LMStudio servers.** Doable, but the operator is still using `provider: "lmstudio"` against an Ollama or llama.cpp server, which is semantically misleading and easy to mis-document. The fix lands the same wire behavior under a wrong name.

3. **Document the existing workaround harder** (set `provider: "openai"` plus `embedding.baseUrl`). Today the docs already mention this. The trap is silent: if the per-plugin `baseUrl` is removed during a config edit, traffic silently goes to api.openai.com. A safer adapter that fails-fast on missing baseUrl is preferable to documentation that depends on operator vigilance.

4. **Run a small reverse-proxy in front of the local server** that stubs the LMStudio-specific endpoints and forwards /v1/embeddings. Adds infra to a memory plugin, doesn't generalize across deployments, and still leaves the misleading `provider: "lmstudio"` in operator config.

The proposed bundled adapter is the simplest path that solves all the failure modes above: explicit name that matches what the upstream projects call themselves, no warmup, no global config inheritance, and complementary to the existing in-process `local` adapter without redundancy.

## Impact

- **Affected**: any operator running a self-hosted OpenAI-compatible embeddings server for memory-lancedb. The local-embeddings server ecosystem includes llama.cpp's `llama-server`, Ollama (via its /v1 surface), vLLM, TGI, LocalAI, and llamafile, all of which are popular alternatives to cloud embeddings for privacy, cost, or offline reasons. The user base overlaps heavily with operators of self-hosted openclaw stacks.
- **Severity**: high for operators in the trap. The lmstudio-warmup-against-non-LMStudio path actively stalls the gateway for ~30 seconds per memory-lancedb embedding-provider rebuild, which fires roughly every 24-30 minutes of channel activity. The dashboard goes unresponsive during the freeze, queued WebSocket calls back up, and operators spend hours diagnosing what is actually a missing-adapter UX gap. The openai-with-baseUrl-override path is medium severity: works correctly until a config edit accidentally removes the override.
- **Frequency**: every memory-lancedb embedding-provider rebuild for affected operators. On my machine (single openclaw instance, two channels, normal usage), the rebuild fires several times per hour.
- **Consequence**: silent gateway stalls (lmstudio path), or silent leak of embedded chat content to a cloud provider (openai path). Both are operator-trust-eroding outcomes. The proposed adapter eliminates both.

## Evidence / examples

Live evidence from my machine running this exact setup (llama.cpp `llama-server` serving `bge-m3-Q8_0.gguf` on `http://localhost:8081/v1`).

**Before** (with `provider: "lmstudio"` and the same baseUrl), gateway log during a single memory-lancedb embedding-provider rebuild:

```
2026-05-11T05:05:50  ‚áÑ res ‚úì sessions.list 22416ms
2026-05-11T05:05:50  ‚áÑ res ‚úì config.get   61301ms
2026-05-11T05:05:50  ‚áÑ res ‚úì config.get   59181ms
WARN  liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu  eventLoopDelayMaxMs=29091.7
WARN  lmstudio embeddings warmup failed; continuing without preload
```

988 channels/imessage events queued behind the warmup hang in a 7-minute window when the issue compounded with another stall.

**After** (with my prototype `openai-compatible` adapter), live invocation against the same llama.cpp server:

```
[proof] target  : http://localhost:8081/v1
[proof] model   : text-embedding-bge-m3
[proof] factory : 1ms (no warmup, just client construction)
[proof] embed   : 124ms, dims=1024
[proof] batch   : 25ms, count=4, dims=1024
[proof] OK. openai-compatible embeddings adapter wired end-to-end against llama.cpp.
```

The `factory: 1ms` line is the key evidence. The lmstudio adapter takes up to 120s on the same input.

**Prior art:**

- The existing `ollama` adapter (`extensions/ollama/src/memory-embedding-adapter.ts`) follows the same general shape: vendor-specific id, no warmup, self-contained config. It was added to fix the operator pain previously raised in #66163.
- The proposed `openai-compatible` adapter is the same pattern, generalized for the broader local-server ecosystem rather than scoped to one vendor's native API.

## Additional information

Backward compatible. Pure addition. Existing adapters (`openai`, `lmstudio`, `mistral`, `gemini`, `voyage`, `bedrock`, `deepinfra`, `ollama`, in-process `local`) all keep working unchanged. Operators currently working around the gap by misusing `lmstudio` or `openai` can switch to `openai-compatible` when convenient; their existing config keeps working in the meantime.

Accompanying PR drafted on branch `feat/openai-compatible-embeddings-provider` (will link the actual PR number once filed).

**Related:**

- #72875 (open). `provider: "local"` fails with "Unknown memory embedding provider: local". Operators land here after misunderstanding which adapter to use for HTTP-based local servers; the existing `local` adapter is for in-process node-llama-cpp, not HTTP. The new `openai-compatible` adapter gives them the right name.
- #72937 (open PR). fix for #72875's registration timing. Adjacent.
- #66163 (closed). `Unknown memory embedding provider: ollama`, which led to the bundled ollama adapter. The proposed `openai-compatible` adapter follows the same pattern, generalized.
- #74204 (open). `memory.qmd.update.embedTimeoutMs` too low for local GGUF. Same operator profile (running local embedding server), different timeout surface.
- #74761 (open). Document oMLX (Apple Silicon MLX) as a memorySearch embedding provider. Same family of "add a local-server adapter" requests; oMLX exposes an OpenAI-compatible API and would work through the proposed `openai-compatible` adapter without further plugin code.
- #60994 (closed). Cannot reliably connect to remote Ollama / LM Studio instances via LAN IP. Adjacent operator pain in the same ecosystem.
- #42270 (closed). LM Studio backend regression. Related (lmstudio-adapter brittleness).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: bundled openai-compatible embedding provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI) #80476

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence / examples

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	Existing `local`	Proposed `openai-compatible`
Where the model lives	inside the gateway process	separate HTTP server
Wire	in-process Node bindings	HTTP /v1/embeddings
Reload model	gateway restart	server restart only
Share with other clients	no, gateway owns the model	yes, any HTTP client
GPU tuning surface	node-llama-cpp options	the server's own CLI flags (e.g. `llama-server -ngl ...`)
Works with Ollama / vLLM / TGI / LocalAI / llamafile	no (not Node libs)	yes (they all speak OpenAI /v1)
Operator's existing tuned setup	must be ported to node-llama-cpp options	unchanged

Uh oh!

[Feature]: bundled openai-compatible embedding provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI) #80476

Description

Summary

Problem to solve

Proposed solution

Alternatives considered

Impact

Evidence / examples

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions