You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a bundled memory embedding provider adapter named openai-compatible that targets any local OpenAI-compatible HTTP embedding server (llama.cpp's llama-server, Ollama via its /v1 surface, vLLM, TGI, LocalAI, llamafile, or any reverse-proxied internal instance), without any vendor-specific warmup probe and without inheriting from any global models.providers.* config.
Problem to solve
Operators running a self-hosted OpenAI-compatible embeddings server today have two unsatisfying choices, both of which produce real operator pain.
Point the bundled lmstudio adapter at the local server. The /v1/embeddings call works fine, but the adapter's ensureLmstudioModelLoaded warmup calls an LMStudio-only "load model" endpoint that hangs against generic servers. On my machine running llama.cpp's llama-server with BGE-M3 on localhost:8081, this hang blocks the gateway event loop for ~30 seconds per memory-lancedb embedding-provider rebuild. The gateway's own liveness diagnostic reports it as event_loop_delay = 29,091 ms, and queued sessions.list / config.get / cron.list responses balloon to 40-60 second response times during the freeze. The gateway log floods with lmstudio embeddings warmup failed; continuing without preload warnings with no operator-friendly indication that the actual cause is a vendor-specific preload endpoint mismatched against a perfectly good local server.
Point the bundled openai adapter at the local server. This works (the per-plugin embedding.baseUrl overrides the global models.providers.openai.baseUrl, and the openai adapter has no warmup), but it inherits the global openai config block's headers, attribution, and api-key resolution. If the per-plugin embedding.baseUrl line ever gets removed by mistake during a config edit, embedding requests silently fall back to api.openai.com, leaking embedded text to a cloud provider the operator may not have intended for memory.
Neither option says what it is on the tin. Operators searching for "how do I use my local embedding server with openclaw" end up confused, sometimes filing followup issues like #72875 (Unknown memory embedding provider: local) thinking the existing local adapter is what they want, when in fact the existing local is for in-process node-llama-cpp on a .gguf file and not for HTTP-based local servers.
Proposed solution
Add a new bundled extension extensions/openai-compatible-embeddings/ that registers an openai-compatible memory embedding provider adapter.
Design:
Provider id: openai-compatible. Matches the term llama.cpp, Ollama, vLLM, TGI, LocalAI, and llamafile all use to describe their HTTP API.
transport: "remote". Routed through the same SSRF + remote-fetch path as the cloud adapters.
No autoSelectPriority. Operator must opt in explicitly via embedding.provider: "openai-compatible". We do not want auto-selection, because every operator with another adapter's credentials configured would otherwise route embeddings to the cloud the moment they enabled memory-lancedb.
No authProviderId. There is no centralized auth flow for arbitrary local servers; the optional apiKey lives directly in the per-plugin embedding config block.
No warmup, preload, or model-load probe. The first /v1/embeddings call loads the model lazily, which every server in this family already does.
Reads only from the per-plugin embedding config block. Does not consult any global models.providers.* block. Cannot accidentally route to a vendor cloud.
Fails-fast with a clear error message when embedding.baseUrl or embedding.model is missing.
Distinct from the existing in-process local adapter (extensions/memory-core/src/memory/provider-adapters.ts), which loads a .gguf file via node-llama-cpp inside the gateway process. See "Alternatives considered" below for the full breakdown of why the two are complementary rather than redundant.
Alternatives considered
Considered four other approaches.
Use the existing local adapter (in-process node-llama-cpp). The natural first question. The existing local adapter loads a .gguf file directly into the gateway Node process via node-llama-cpp; my proposed adapter talks HTTP to a separately-running server. They are not interchangeable.
Existing local
Proposed openai-compatible
Where the model lives
inside the gateway process
separate HTTP server
Wire
in-process Node bindings
HTTP /v1/embeddings
Reload model
gateway restart
server restart only
Share with other clients
no, gateway owns the model
yes, any HTTP client
GPU tuning surface
node-llama-cpp options
the server's own CLI flags (e.g. llama-server -ngl ...)
Works with Ollama / vLLM / TGI / LocalAI / llamafile
no (not Node libs)
yes (they all speak OpenAI /v1)
Operator's existing tuned setup
must be ported to node-llama-cpp options
unchanged
Operators running a separately-managed embedding server (which is the common shape on Apple Silicon, on machines with a dedicated GPU, or on shared infrastructure) cannot use the existing local adapter without abandoning their existing tuned setup. And operators on Ollama / vLLM / TGI / LocalAI / llamafile cannot use it at all because those projects are not Node libraries. Both adapters stay supported; they target different deployment shapes.
Fix the lmstudio adapter so its warmup gracefully no-ops against non-LMStudio servers. Doable, but the operator is still using provider: "lmstudio" against an Ollama or llama.cpp server, which is semantically misleading and easy to mis-document. The fix lands the same wire behavior under a wrong name.
Document the existing workaround harder (set provider: "openai" plus embedding.baseUrl). Today the docs already mention this. The trap is silent: if the per-plugin baseUrl is removed during a config edit, traffic silently goes to api.openai.com. A safer adapter that fails-fast on missing baseUrl is preferable to documentation that depends on operator vigilance.
Run a small reverse-proxy in front of the local server that stubs the LMStudio-specific endpoints and forwards /v1/embeddings. Adds infra to a memory plugin, doesn't generalize across deployments, and still leaves the misleading provider: "lmstudio" in operator config.
The proposed bundled adapter is the simplest path that solves all the failure modes above: explicit name that matches what the upstream projects call themselves, no warmup, no global config inheritance, and complementary to the existing in-process local adapter without redundancy.
Impact
Affected: any operator running a self-hosted OpenAI-compatible embeddings server for memory-lancedb. The local-embeddings server ecosystem includes llama.cpp's llama-server, Ollama (via its /v1 surface), vLLM, TGI, LocalAI, and llamafile, all of which are popular alternatives to cloud embeddings for privacy, cost, or offline reasons. The user base overlaps heavily with operators of self-hosted openclaw stacks.
Severity: high for operators in the trap. The lmstudio-warmup-against-non-LMStudio path actively stalls the gateway for ~30 seconds per memory-lancedb embedding-provider rebuild, which fires roughly every 24-30 minutes of channel activity. The dashboard goes unresponsive during the freeze, queued WebSocket calls back up, and operators spend hours diagnosing what is actually a missing-adapter UX gap. The openai-with-baseUrl-override path is medium severity: works correctly until a config edit accidentally removes the override.
Frequency: every memory-lancedb embedding-provider rebuild for affected operators. On my machine (single openclaw instance, two channels, normal usage), the rebuild fires several times per hour.
Consequence: silent gateway stalls (lmstudio path), or silent leak of embedded chat content to a cloud provider (openai path). Both are operator-trust-eroding outcomes. The proposed adapter eliminates both.
Evidence / examples
Live evidence from my machine running this exact setup (llama.cpp llama-server serving bge-m3-Q8_0.gguf on http://localhost:8081/v1).
Before (with provider: "lmstudio" and the same baseUrl), gateway log during a single memory-lancedb embedding-provider rebuild:
2026-05-11T05:05:50 ⇄ res ✓ sessions.list 22416ms
2026-05-11T05:05:50 ⇄ res ✓ config.get 61301ms
2026-05-11T05:05:50 ⇄ res ✓ config.get 59181ms
WARN liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu eventLoopDelayMaxMs=29091.7
WARN lmstudio embeddings warmup failed; continuing without preload
988 channels/imessage events queued behind the warmup hang in a 7-minute window when the issue compounded with another stall.
After (with my prototype openai-compatible adapter), live invocation against the same llama.cpp server:
[proof] target : http://localhost:8081/v1
[proof] model : text-embedding-bge-m3
[proof] factory : 1ms (no warmup, just client construction)
[proof] embed : 124ms, dims=1024
[proof] batch : 25ms, count=4, dims=1024
[proof] OK. openai-compatible embeddings adapter wired end-to-end against llama.cpp.
The factory: 1ms line is the key evidence. The lmstudio adapter takes up to 120s on the same input.
The proposed openai-compatible adapter is the same pattern, generalized for the broader local-server ecosystem rather than scoped to one vendor's native API.
Additional information
Backward compatible. Pure addition. Existing adapters (openai, lmstudio, mistral, gemini, voyage, bedrock, deepinfra, ollama, in-process local) all keep working unchanged. Operators currently working around the gap by misusing lmstudio or openai can switch to openai-compatible when convenient; their existing config keeps working in the meantime.
Accompanying PR drafted on branch feat/openai-compatible-embeddings-provider (will link the actual PR number once filed).
docs: Document oMLX (Apple Silicon MLX) as memorySearch embedding provider #74761 (open). Document oMLX (Apple Silicon MLX) as a memorySearch embedding provider. Same family of "add a local-server adapter" requests; oMLX exposes an OpenAI-compatible API and would work through the proposed openai-compatible adapter without further plugin code.
Summary
Add a bundled memory embedding provider adapter named
openai-compatiblethat targets any local OpenAI-compatible HTTP embedding server (llama.cpp'sllama-server, Ollama via its/v1surface, vLLM, TGI, LocalAI, llamafile, or any reverse-proxied internal instance), without any vendor-specific warmup probe and without inheriting from any globalmodels.providers.*config.Problem to solve
Operators running a self-hosted OpenAI-compatible embeddings server today have two unsatisfying choices, both of which produce real operator pain.
Point the bundled
lmstudioadapter at the local server. The /v1/embeddings call works fine, but the adapter'sensureLmstudioModelLoadedwarmup calls an LMStudio-only "load model" endpoint that hangs against generic servers. On my machine running llama.cpp'sllama-serverwith BGE-M3 on localhost:8081, this hang blocks the gateway event loop for ~30 seconds per memory-lancedb embedding-provider rebuild. The gateway's own liveness diagnostic reports it asevent_loop_delay = 29,091 ms, and queued sessions.list / config.get / cron.list responses balloon to 40-60 second response times during the freeze. The gateway log floods withlmstudio embeddings warmup failed; continuing without preloadwarnings with no operator-friendly indication that the actual cause is a vendor-specific preload endpoint mismatched against a perfectly good local server.Point the bundled
openaiadapter at the local server. This works (the per-pluginembedding.baseUrloverrides the globalmodels.providers.openai.baseUrl, and the openai adapter has no warmup), but it inherits the global openai config block's headers, attribution, and api-key resolution. If the per-pluginembedding.baseUrlline ever gets removed by mistake during a config edit, embedding requests silently fall back to api.openai.com, leaking embedded text to a cloud provider the operator may not have intended for memory.Neither option says what it is on the tin. Operators searching for "how do I use my local embedding server with openclaw" end up confused, sometimes filing followup issues like #72875 (
Unknown memory embedding provider: local) thinking the existinglocaladapter is what they want, when in fact the existinglocalis for in-process node-llama-cpp on a.gguffile and not for HTTP-based local servers.Proposed solution
Add a new bundled extension
extensions/openai-compatible-embeddings/that registers anopenai-compatiblememory embedding provider adapter.Design:
openai-compatible. Matches the term llama.cpp, Ollama, vLLM, TGI, LocalAI, and llamafile all use to describe their HTTP API.transport: "remote". Routed through the same SSRF + remote-fetch path as the cloud adapters.autoSelectPriority. Operator must opt in explicitly viaembedding.provider: "openai-compatible". We do not want auto-selection, because every operator with another adapter's credentials configured would otherwise route embeddings to the cloud the moment they enabled memory-lancedb.authProviderId. There is no centralized auth flow for arbitrary local servers; the optionalapiKeylives directly in the per-pluginembeddingconfig block.embeddingconfig block. Does not consult any globalmodels.providers.*block. Cannot accidentally route to a vendor cloud.embedding.baseUrlorembedding.modelis missing.Config:
Distinct from the existing in-process
localadapter (extensions/memory-core/src/memory/provider-adapters.ts), which loads a.gguffile via node-llama-cpp inside the gateway process. See "Alternatives considered" below for the full breakdown of why the two are complementary rather than redundant.Alternatives considered
Considered four other approaches.
Use the existing
localadapter (in-process node-llama-cpp). The natural first question. The existinglocaladapter loads a.gguffile directly into the gateway Node process via node-llama-cpp; my proposed adapter talks HTTP to a separately-running server. They are not interchangeable.localopenai-compatiblellama-server -ngl ...)Operators running a separately-managed embedding server (which is the common shape on Apple Silicon, on machines with a dedicated GPU, or on shared infrastructure) cannot use the existing
localadapter without abandoning their existing tuned setup. And operators on Ollama / vLLM / TGI / LocalAI / llamafile cannot use it at all because those projects are not Node libraries. Both adapters stay supported; they target different deployment shapes.Fix the
lmstudioadapter so its warmup gracefully no-ops against non-LMStudio servers. Doable, but the operator is still usingprovider: "lmstudio"against an Ollama or llama.cpp server, which is semantically misleading and easy to mis-document. The fix lands the same wire behavior under a wrong name.Document the existing workaround harder (set
provider: "openai"plusembedding.baseUrl). Today the docs already mention this. The trap is silent: if the per-pluginbaseUrlis removed during a config edit, traffic silently goes to api.openai.com. A safer adapter that fails-fast on missing baseUrl is preferable to documentation that depends on operator vigilance.Run a small reverse-proxy in front of the local server that stubs the LMStudio-specific endpoints and forwards /v1/embeddings. Adds infra to a memory plugin, doesn't generalize across deployments, and still leaves the misleading
provider: "lmstudio"in operator config.The proposed bundled adapter is the simplest path that solves all the failure modes above: explicit name that matches what the upstream projects call themselves, no warmup, no global config inheritance, and complementary to the existing in-process
localadapter without redundancy.Impact
llama-server, Ollama (via its /v1 surface), vLLM, TGI, LocalAI, and llamafile, all of which are popular alternatives to cloud embeddings for privacy, cost, or offline reasons. The user base overlaps heavily with operators of self-hosted openclaw stacks.Evidence / examples
Live evidence from my machine running this exact setup (llama.cpp
llama-serverservingbge-m3-Q8_0.ggufonhttp://localhost:8081/v1).Before (with
provider: "lmstudio"and the same baseUrl), gateway log during a single memory-lancedb embedding-provider rebuild:988 channels/imessage events queued behind the warmup hang in a 7-minute window when the issue compounded with another stall.
After (with my prototype
openai-compatibleadapter), live invocation against the same llama.cpp server:The
factory: 1msline is the key evidence. The lmstudio adapter takes up to 120s on the same input.Prior art:
ollamaadapter (extensions/ollama/src/memory-embedding-adapter.ts) follows the same general shape: vendor-specific id, no warmup, self-contained config. It was added to fix the operator pain previously raised in CLI memory commands crash with 'Unknown memory embedding provider: ollama' #66163.openai-compatibleadapter is the same pattern, generalized for the broader local-server ecosystem rather than scoped to one vendor's native API.Additional information
Backward compatible. Pure addition. Existing adapters (
openai,lmstudio,mistral,gemini,voyage,bedrock,deepinfra,ollama, in-processlocal) all keep working unchanged. Operators currently working around the gap by misusinglmstudiooropenaican switch toopenai-compatiblewhen convenient; their existing config keeps working in the meantime.Accompanying PR drafted on branch
feat/openai-compatible-embeddings-provider(will link the actual PR number once filed).Related:
memorySearch provider: "local"fails with "Unknown memory embedding provider: local" but capability embedding path works #72875 (open).provider: "local"fails with "Unknown memory embedding provider: local". Operators land here after misunderstanding which adapter to use for HTTP-based local servers; the existinglocaladapter is for in-process node-llama-cpp, not HTTP. The newopenai-compatibleadapter gives them the right name.memorySearch provider: "local"fails with "Unknown memory embedding provider: local" but capability embedding path works #72875's registration timing. Adjacent.Unknown memory embedding provider: ollama, which led to the bundled ollama adapter. The proposedopenai-compatibleadapter follows the same pattern, generalized.memory.qmd.update.embedTimeoutMstoo low for local GGUF. Same operator profile (running local embedding server), different timeout surface.openai-compatibleadapter without further plugin code.