feat(memory/embeddings): add openai-compatible provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI)#80479
Conversation
|
Codex review: needs maintainer review before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: not applicable. as a feature PR rather than a current-main bug report. The PR body supplies credible after-change terminal proof against llama.cpp, and current main lacks an PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: If maintainers want this as an official bundled provider, land the narrow no-warmup adapter with the labeler/docs/package metadata and require normal dependency/CI proof; otherwise ask for the same plugin to be published through ClawHub without adding core inventory. Do we have a high-confidence way to reproduce the issue? Not applicable as a feature PR rather than a current-main bug report. The PR body supplies credible after-change terminal proof against llama.cpp, and current main lacks an Is this the best way to solve the issue? Unclear as a product decision. The code uses the existing memory embedding provider seam cleanly, but the best final path depends on whether maintainers want this bundled or published through ClawHub. Label changes:
Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against cbf72e5e26ee. |
ab414a8 to
1fd293a
Compare
…al OpenAI-compatible HTTP server What an operator hits today: Operators running a self-hosted OpenAI-compatible embeddings server (llama.cpp's `llama-server`, Ollama via its `/v1` surface, vLLM, TGI, LocalAI, llamafile, or any reverse-proxied internal instance) have two inconvenient choices: 1. Point the bundled `lmstudio` adapter at it. Works for the `/v1/embeddings` call, but the adapter's `ensureLmstudioModelLoaded` warmup calls an LMStudio-only "load model" endpoint that hangs against generic servers. The hang blocks the gateway event loop for ~30s per memory-lancedb embedding-provider rebuild. 2. Point the bundled `openai` adapter at it. Works (per-plugin baseUrl overrides global), but the adapter inherits global OpenAI headers, attribution, and api-key resolution; if `embedding.baseUrl` ever gets removed the requests fall back to api.openai.com, leaking embedded text to the cloud. What changes: Adds a new bundled extension `extensions/openai-compatible-embeddings/` that registers an `openai-compatible` memory embedding provider. The adapter: - Has no warmup / preload / model-load probe. The first /v1/embeddings call loads the model lazily, which every server in this family already does. - Reads only from the per-plugin `embedding` config block. Does not consult any global `models.providers.*` block. Cannot accidentally route to a vendor cloud. - Fails-fast with a clear error message when `embedding.baseUrl` or `embedding.model` is missing. - Does not auto-select. Operators must opt in explicitly with `embedding.provider: "openai-compatible"`. Naming: `openai-compatible` is the term llama.cpp, Ollama, vLLM, TGI, LocalAI, and llamafile all use to describe their HTTP API. Distinct from the existing `local` adapter (extensions/memory-core/src/memory/provider- adapters.ts), which is `transport: "local"` for in-process node-llama-cpp on a `.gguf` file. Both stay supported; they target different deployment shapes. Tests: `extensions/openai-compatible-embeddings/memory-embedding-adapter. test.ts` covers the no-auto-select / no-auth-dependency posture, the no-warmup invariant during create, the per-plugin baseUrl/model in cache key, and the Authorization-header strip from the cache key. 4/4 pass. Docs: `docs/plugins/memory-lancedb.md` updated with the new provider example, the safety note about why `openai-compatible` is preferred over `openai` when the operator also has cloud providers configured for chat models, and the disambiguation note about `local` vs `openai-compatible`.
1fd293a to
29037e9
Compare
Dependency Changes DetectedThis PR changes dependency-related files. Maintainers should confirm these changes are intentional. Changed files:
Maintainer follow-up:
|
|
Rebased onto latest P2 fix (option b, strictly-scoped): the changelog now lists only The adapter's internal On the 2 CI failures: both Scope question: clawsweeper flagged whether this belongs bundled vs published on ClawHub. The reasoning for bundling: every other open-weight serving stack we already bundle ( |
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Mossy Diff Drake Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
Addressed the ClawSweeper P2 and P3 from the latest review. P2 ( P3 ( Diff is two files, four insertions, one deletion. Branch is now at |
|
Closing in favor of #84930, thank you! You will be credited |
Summary
llama-server, Ollama via its/v1surface, vLLM, TGI, LocalAI, llamafile, or any reverse-proxied internal instance) have no clean adapter for it. Pointing the bundledlmstudioadapter at it triggers an LMStudio-only "load model" warmup that hangs against generic servers and stalls the gateway event loop for ~30 seconds per memory-lancedb embedding-provider rebuild. Pointing the bundledopenaiadapter at it works, but inherits global OpenAI headers/attribution/api-key resolution, and a removedembedding.baseUrlline silently falls back to api.openai.com which leaks embedded text to the cloud.sessions.listbacklogs and a flooded gateway log. Operators spend hours diagnosing what is actually a UX gap: the bundled adapters do not include a generic local-server option, and the existing in-processlocaladapter (node-llama-cpp on a.gguffile) does not cover operators who run their embeddings server as a separate HTTP process.extensions/openai-compatible-embeddings/that registers anopenai-compatiblememory embedding provider. The adapter has no warmup, no global config inheritance, fails-fast on missingembedding.baseUrl/embedding.model, and does not auto-select (operator must explicitly opt in withembedding.provider: "openai-compatible").lmstudio,openai,mistral,gemini,voyage,bedrock,deepinfra,ollama, in-processlocaladapters all behave byte-identically. The Plugin SDK surface is unchanged; the new adapter consumes the same public exports the other bundled adapters do. No protocol change, no schema change, no migration, no telemetry. The existing in-processlocaladapter stays as-is for operators who load.gguffiles in-process via node-llama-cpp; the two adapters are complementary, not redundant.Change Type
Scope
Linked Issue/PR
memorySearch provider: "local"fails with "Unknown memory embedding provider: local" but capability embedding path works #72875 (operator confusion:provider: "local"actually means in-process node-llama-cpp, not HTTP)memorySearch provider: "local"fails with "Unknown memory embedding provider: local" but capability embedding path works #72875's registration timing for the in-processlocaladapter)Unknown memory embedding provider: ollama, the precedent that led to the bundled ollama adapter; this PR follows the same pattern, generalized)memory.qmd.update.embedTimeoutMstoo low for local GGUF; same operator profile)Real behavior proof
Behavior or issue addressed: an operator running
llama-server(llama.cpp) with the BGE-M3 embedding model onhttp://localhost:8081/v1had memory-lancedb captures triggering ~30-second event-loop stalls every time the embedding provider rebuilt, because thelmstudioadapter'sensureLmstudioModelLoadedwarmup hangs against llama.cpp's OpenAI-compatible server (which does not expose LMStudio's load-model endpoint). The newopenai-compatibleadapter routes through the same genericcreateRemoteEmbeddingProviderfactory the other adapters use, just without the warmup phase. Embeddings work end-to-end on the first call, no preload required.Real environment tested: macOS 26.4.1 on Apple Silicon (arm64).
llama-serverfrom llama.cpp servingbge-m3-Q8_0.gguf(605 MB, 1024 dimensions) onhttp://127.0.0.1:8081, with--ngl 24 -c 32768 -np 4 -b 512 -ub 512 --mmap --mlock --cont-batching --api-key <set>. Live~/.openclaw/with memory-lancedb enabled.Exact steps or command run after this patch: ran
pnpm test extensions/openai-compatible-embeddingsto validate the adapter posture and the no-warmup invariant. Then invoked the new factory directly throughnode --import tsxagainst the live llama-server, capturing the round-trip latency for bothembedQueryandembedBatch. Independently verified the same llama-server endpoint withcurl -H "Authorization: Bearer ..." http://localhost:8081/v1/embeddingsreturns 1024-dim vectors with the same model name.Evidence after fix:
Live invocation of the new adapter from a small Node script (
node --import tsx):Notice the factory took 1 ms (the lmstudio adapter would have taken up to 120 s here against the same server), and the actual embedding round-trip is 124 ms with the expected 1024-dim BGE-M3 output.
Independent confirmation of the same endpoint via curl, showing the local server answers OpenAI-shaped requests without any vendor-specific preamble:
Targeted regression test for the adapter posture and the no-warmup invariant:
Observed result after fix: provider construction takes 1 ms (no warmup network call). The adapter holds the per-plugin baseUrl/model exactly as configured, with no fallback to any global config block. Embeddings round-trip in well under 200 ms against the live local server. Existing adapters (
openai,lmstudio,mistral,gemini,voyage,bedrock,deepinfra,ollama, in-processlocal) are untouched.What was not tested: did not run the new adapter inside an actual openclaw gateway process end-to-end, because the dist bundle does not include the new extension yet (the source-only invocation above is the closest equivalent without a release-tagged build). Did not run
pnpm check:changedin Testbox; targetedpnpm test extensions/openai-compatible-embeddingsplus targetednpx oxlint extensions/openai-compatible-embeddings/plus targetedpnpm tsgo:prodare all clean on the touched files.Before evidence: not applicable for a feature add. The "before" is "this provider did not exist," and the operator pain it addresses is documented in the Summary above and in linked issue [Feature]: bundled openai-compatible embedding provider for self-hosted servers (llama.cpp, Ollama, vLLM, TGI, LocalAI) #80476.
Root Cause
N/A. Feature addition, not a regression fix. (For the underlying operator pain that motivated the addition, see linked issue #80476.)
Regression Test Plan
extensions/openai-compatible-embeddings/memory-embedding-adapter.test.ts(new), with the factory inextensions/openai-compatible-embeddings/embedding-provider.ts.id: "openai-compatible",transport: "remote", noautoSelectPriority, noauthProviderId,allowExplicitWhenConfiguredAuto: true, noshouldContinueAutoSelection. This is what stops the adapter from accidentally being auto-selected over an unrelated cloud provider whose key happens to be configured.create. The adapter must produce exactly one factory invocation percreatecall; nothing else.baseUrlandmodelexactly as supplied, so two different local servers do not share a cache entry.createRemoteEmbeddingProvider. The risk surface is the posture (auto-select / auth / fallback) and the absence of any pre-call side effect. Both are testable in pure-TS with a mocked factory; no live server needed for the unit tests.openai-compatibleneeds to (no auth provider, no auto-select, fully self-contained config, no vendor-specific warmup).User-visible / Behavior Changes
Operators who configure
embedding.provider: "openai-compatible"plusembedding.baseUrlandembedding.modelunderplugins.entries.memory-lancedb.config.embeddingget a working embeddings flow against any OpenAI-compatible local server. No behavior change for any operator who has not opted in. Existinglmstudio/openai/local/etc. adapters keep doing exactly what they do today.Diagram
Security Impact
apiKeyis a per-plugin config field, treated identically to existing adapters' apiKey handling. Cache key strips the Authorization header.models.providers.*block, so it cannot leak embedded text to a cloud provider on a stale config.Repro + Verification
Environment
llama-serveron localhost:8081plugins.entries.memory-lancedb.config.embedding.provider: "openai-compatible",baseUrl: "http://localhost:8081/v1",model: "text-embedding-bge-m3",apiKey: "<bearer>",dimensions: 1024Steps
llama-server -m <bge-m3.gguf> -a text-embedding-bge-m3 --embedding --host 127.0.0.1 --port 8081 --api-key <bearer>.~/.openclaw/openclaw.jsonset memory-lancedb's embedding block toprovider: "openai-compatible"plusbaseUrl,model, optionalapiKey/headers.Expected
provider.embedQuery("hello")returns a 1024-dim vector in well under 200 ms. No event-loop stalls. No warmup warnings in the gateway log.Actual
Matches expected. Verified end-to-end against llama.cpp serving BGE-M3 (terminal output included in Real behavior proof).
Evidence
Human Verification
embedQueryandembedBatchreturn correct-dimensionality vectors. Confirmed factory construction completes in 1 ms with no network call (vs lmstudio adapter's ~30s warmup hang against the same server). Verified the cache key contains the per-plugin baseUrl and model, with Authorization stripped. Verifiedpnpm test extensions/openai-compatible-embeddings(4/4 pass),npx oxlint extensions/openai-compatible-embeddings/(0 errors), andpnpm tsgo:prod(clean on touched files).baseUrlthrows a clear error rather than silently falling back. Missingmodeldoes the same. Adapter has noautoSelectPriority, so it cannot be picked automatically when the operator has another adapter's credentials configured. Headers passed throughembedding.headersget attached to every request alongside the Authorization Bearer.pnpm check:changedin Testbox.Review Conversations
Compatibility / Migration
embedding.provider: "openai-compatible"andbaseUrl/model.lmstudiooropenaiat a local server can switch when convenient. Their existing setup keeps working.Risks and Mitigations
openai-compatibleprovider with the existing in-processlocalprovider.openai-compatiblereads as "any server that speaks the OpenAI HTTP API," which is the term llama.cpp / Ollama / vLLM / TGI / LocalAI all use to describe themselves; the existinglocalid keeps the semantic of "local in-process model file."embedding.baseUrlline by mistake while theopenai-compatibleprovider is configured.