Skip to content

fix(embedding): Respect embedding context length#1694

Merged
jundot merged 1 commit into
jundot:mainfrom
jackwh:embeddings-context
Jun 6, 2026
Merged

fix(embedding): Respect embedding context length#1694
jundot merged 1 commit into
jundot:mainfrom
jackwh:embeddings-context

Conversation

@jackwh

@jackwh jackwh commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Fixes #1687.

The embeddings endpoint now uses the model’s effective context window by default instead of falling through to the mlx-embeddings 512-token default. It also adds optional max_length and truncation request controls for callers who want to override that behavior explicitly.

Details

  • Adds max_length and truncation to EmbeddingRequest.
  • Threads the effective embedding max length from the server into EmbeddingEngine.embed().
  • Resolves embedding max length from model/tokenizer metadata for direct model usage.
  • Keeps 512 only as a final fallback when no model/tokenizer limit is known.
  • Preserves explicit caller overrides, including intentionally passing max_length: 512.

Testing

I ran the focused test coverage:

pytest tests/test_embedding.py tests/integration/test_server_endpoints.py::TestEmbeddingsEndpoint -q

Result:

87 passed, 1 deselected

I also tested locally with mlx-community/Qwen3-Embedding-8B-mxfp8. The server logs now show:

max_length=40960, truncation=True

and embeddings for long inputs with identical prefixes but different suffixes no longer collapse to the same vector/hash, confirming content beyond the old 512-token cutoff is being used.

Backwards Compatibility

Existing embedding requests continue to work unchanged. The main behavioral change is that long inputs may now use more compute/memory because they are no longer silently capped at 512 tokens. Callers that want the previous cap can pass max_length: 512 explicitly.

Use the effective model context window for /v1/embeddings instead of falling through to the mlx-embeddings 512-token default.

Adds optional max_length and truncation request controls, preserves an explicit 512 cap when callers ask for it, and keeps 512 only as the final fallback when no model or tokenizer limit is available.

Includes regression coverage for discovered context lengths, explicit request overrides, and direct embedding model defaults.
@jundot

jundot commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Thanks for fixing this. I verified that the endpoint now threads the effective embedding context length into the engine, preserves explicit max_length overrides, and covers the regression with focused tests.

One follow-up I may fold in later: custom embedding processors still use their own prepare_embedding_inputs() limits, so the new max_length/truncation request controls are not universal there. This does not block the reported Qwen3 text embedding issue.

This looks good to me, and I'm going to merge it.

@jundot jundot merged commit 6f99272 into jundot:main Jun 6, 2026
JimStenstrom added a commit to JimStenstrom/omlx that referenced this pull request Jun 7, 2026
get_embedding_max_length() returned a hard 512 when neither the request nor
the server's max_context_window pinned a limit, re-truncating long-context
embedding models in exactly the no-config case jundot#1687 was about. Return None
instead so the engine/model embed() path resolves the model's own context
length (max_position_embeddings / tokenizer model_max_length, already in
MLXEmbeddingModel._resolve_max_length), keeping 512 only as that resolver's
final fallback. Follow-up to jundot#1694.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants