Skip to content

[Feature]: bundled openai-compatible embedding provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI) #80476

@yaanfpv

Description

@yaanfpv

Summary

Add a bundled memory embedding provider adapter named openai-compatible that targets any local OpenAI-compatible HTTP embedding server (llama.cpp's llama-server, Ollama via its /v1 surface, vLLM, TGI, LocalAI, llamafile, or any reverse-proxied internal instance), without any vendor-specific warmup probe and without inheriting from any global models.providers.* config.

Problem to solve

Operators running a self-hosted OpenAI-compatible embeddings server today have two unsatisfying choices, both of which produce real operator pain.

  1. Point the bundled lmstudio adapter at the local server. The /v1/embeddings call works fine, but the adapter's ensureLmstudioModelLoaded warmup calls an LMStudio-only "load model" endpoint that hangs against generic servers. On my machine running llama.cpp's llama-server with BGE-M3 on localhost:8081, this hang blocks the gateway event loop for ~30 seconds per memory-lancedb embedding-provider rebuild. The gateway's own liveness diagnostic reports it as event_loop_delay = 29,091 ms, and queued sessions.list / config.get / cron.list responses balloon to 40-60 second response times during the freeze. The gateway log floods with lmstudio embeddings warmup failed; continuing without preload warnings with no operator-friendly indication that the actual cause is a vendor-specific preload endpoint mismatched against a perfectly good local server.

  2. Point the bundled openai adapter at the local server. This works (the per-plugin embedding.baseUrl overrides the global models.providers.openai.baseUrl, and the openai adapter has no warmup), but it inherits the global openai config block's headers, attribution, and api-key resolution. If the per-plugin embedding.baseUrl line ever gets removed by mistake during a config edit, embedding requests silently fall back to api.openai.com, leaking embedded text to a cloud provider the operator may not have intended for memory.

Neither option says what it is on the tin. Operators searching for "how do I use my local embedding server with openclaw" end up confused, sometimes filing followup issues like #72875 (Unknown memory embedding provider: local) thinking the existing local adapter is what they want, when in fact the existing local is for in-process node-llama-cpp on a .gguf file and not for HTTP-based local servers.

Proposed solution

Add a new bundled extension extensions/openai-compatible-embeddings/ that registers an openai-compatible memory embedding provider adapter.

Design:

  • Provider id: openai-compatible. Matches the term llama.cpp, Ollama, vLLM, TGI, LocalAI, and llamafile all use to describe their HTTP API.
  • transport: "remote". Routed through the same SSRF + remote-fetch path as the cloud adapters.
  • No autoSelectPriority. Operator must opt in explicitly via embedding.provider: "openai-compatible". We do not want auto-selection, because every operator with another adapter's credentials configured would otherwise route embeddings to the cloud the moment they enabled memory-lancedb.
  • No authProviderId. There is no centralized auth flow for arbitrary local servers; the optional apiKey lives directly in the per-plugin embedding config block.
  • No warmup, preload, or model-load probe. The first /v1/embeddings call loads the model lazily, which every server in this family already does.
  • Reads only from the per-plugin embedding config block. Does not consult any global models.providers.* block. Cannot accidentally route to a vendor cloud.
  • Fails-fast with a clear error message when embedding.baseUrl or embedding.model is missing.

Config:

{
  plugins: {
    entries: {
      "memory-lancedb": {
        enabled: true,
        config: {
          embedding: {
            provider: "openai-compatible",
            baseUrl: "http://localhost:8081/v1",
            model: "text-embedding-bge-m3",
            apiKey: "${LLAMA_API_TOKEN}",
            dimensions: 1024,
          },
        },
      },
    },
  },
}

Distinct from the existing in-process local adapter (extensions/memory-core/src/memory/provider-adapters.ts), which loads a .gguf file via node-llama-cpp inside the gateway process. See "Alternatives considered" below for the full breakdown of why the two are complementary rather than redundant.

Alternatives considered

Considered four other approaches.

  1. Use the existing local adapter (in-process node-llama-cpp). The natural first question. The existing local adapter loads a .gguf file directly into the gateway Node process via node-llama-cpp; my proposed adapter talks HTTP to a separately-running server. They are not interchangeable.

    Existing local Proposed openai-compatible
    Where the model lives inside the gateway process separate HTTP server
    Wire in-process Node bindings HTTP /v1/embeddings
    Reload model gateway restart server restart only
    Share with other clients no, gateway owns the model yes, any HTTP client
    GPU tuning surface node-llama-cpp options the server's own CLI flags (e.g. llama-server -ngl ...)
    Works with Ollama / vLLM / TGI / LocalAI / llamafile no (not Node libs) yes (they all speak OpenAI /v1)
    Operator's existing tuned setup must be ported to node-llama-cpp options unchanged

    Operators running a separately-managed embedding server (which is the common shape on Apple Silicon, on machines with a dedicated GPU, or on shared infrastructure) cannot use the existing local adapter without abandoning their existing tuned setup. And operators on Ollama / vLLM / TGI / LocalAI / llamafile cannot use it at all because those projects are not Node libraries. Both adapters stay supported; they target different deployment shapes.

  2. Fix the lmstudio adapter so its warmup gracefully no-ops against non-LMStudio servers. Doable, but the operator is still using provider: "lmstudio" against an Ollama or llama.cpp server, which is semantically misleading and easy to mis-document. The fix lands the same wire behavior under a wrong name.

  3. Document the existing workaround harder (set provider: "openai" plus embedding.baseUrl). Today the docs already mention this. The trap is silent: if the per-plugin baseUrl is removed during a config edit, traffic silently goes to api.openai.com. A safer adapter that fails-fast on missing baseUrl is preferable to documentation that depends on operator vigilance.

  4. Run a small reverse-proxy in front of the local server that stubs the LMStudio-specific endpoints and forwards /v1/embeddings. Adds infra to a memory plugin, doesn't generalize across deployments, and still leaves the misleading provider: "lmstudio" in operator config.

The proposed bundled adapter is the simplest path that solves all the failure modes above: explicit name that matches what the upstream projects call themselves, no warmup, no global config inheritance, and complementary to the existing in-process local adapter without redundancy.

Impact

  • Affected: any operator running a self-hosted OpenAI-compatible embeddings server for memory-lancedb. The local-embeddings server ecosystem includes llama.cpp's llama-server, Ollama (via its /v1 surface), vLLM, TGI, LocalAI, and llamafile, all of which are popular alternatives to cloud embeddings for privacy, cost, or offline reasons. The user base overlaps heavily with operators of self-hosted openclaw stacks.
  • Severity: high for operators in the trap. The lmstudio-warmup-against-non-LMStudio path actively stalls the gateway for ~30 seconds per memory-lancedb embedding-provider rebuild, which fires roughly every 24-30 minutes of channel activity. The dashboard goes unresponsive during the freeze, queued WebSocket calls back up, and operators spend hours diagnosing what is actually a missing-adapter UX gap. The openai-with-baseUrl-override path is medium severity: works correctly until a config edit accidentally removes the override.
  • Frequency: every memory-lancedb embedding-provider rebuild for affected operators. On my machine (single openclaw instance, two channels, normal usage), the rebuild fires several times per hour.
  • Consequence: silent gateway stalls (lmstudio path), or silent leak of embedded chat content to a cloud provider (openai path). Both are operator-trust-eroding outcomes. The proposed adapter eliminates both.

Evidence / examples

Live evidence from my machine running this exact setup (llama.cpp llama-server serving bge-m3-Q8_0.gguf on http://localhost:8081/v1).

Before (with provider: "lmstudio" and the same baseUrl), gateway log during a single memory-lancedb embedding-provider rebuild:

2026-05-11T05:05:50  ⇄ res ✓ sessions.list 22416ms
2026-05-11T05:05:50  ⇄ res ✓ config.get   61301ms
2026-05-11T05:05:50  ⇄ res ✓ config.get   59181ms
WARN  liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu  eventLoopDelayMaxMs=29091.7
WARN  lmstudio embeddings warmup failed; continuing without preload

988 channels/imessage events queued behind the warmup hang in a 7-minute window when the issue compounded with another stall.

After (with my prototype openai-compatible adapter), live invocation against the same llama.cpp server:

[proof] target  : http://localhost:8081/v1
[proof] model   : text-embedding-bge-m3
[proof] factory : 1ms (no warmup, just client construction)
[proof] embed   : 124ms, dims=1024
[proof] batch   : 25ms, count=4, dims=1024
[proof] OK. openai-compatible embeddings adapter wired end-to-end against llama.cpp.

The factory: 1ms line is the key evidence. The lmstudio adapter takes up to 120s on the same input.

Prior art:

  • The existing ollama adapter (extensions/ollama/src/memory-embedding-adapter.ts) follows the same general shape: vendor-specific id, no warmup, self-contained config. It was added to fix the operator pain previously raised in CLI memory commands crash with 'Unknown memory embedding provider: ollama' #66163.
  • The proposed openai-compatible adapter is the same pattern, generalized for the broader local-server ecosystem rather than scoped to one vendor's native API.

Additional information

Backward compatible. Pure addition. Existing adapters (openai, lmstudio, mistral, gemini, voyage, bedrock, deepinfra, ollama, in-process local) all keep working unchanged. Operators currently working around the gap by misusing lmstudio or openai can switch to openai-compatible when convenient; their existing config keeps working in the meantime.

Accompanying PR drafted on branch feat/openai-compatible-embeddings-provider (will link the actual PR number once filed).

Related:

Metadata

Metadata

Labels

P2Normal backlog priority with limited blast radius.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:needs-security-reviewClawSweeper marked this issue as needing security-sensitive review.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:securitySecurity boundary, credential, authz, sandbox, or sensitive-data risk.issue-rating: 🌊 off-meta tidepoolIssue quality rating does not apply to this item.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions