fix(embedding): Respect embedding context length#1694
Merged
Conversation
Use the effective model context window for /v1/embeddings instead of falling through to the mlx-embeddings 512-token default. Adds optional max_length and truncation request controls, preserves an explicit 512 cap when callers ask for it, and keeps 512 only as the final fallback when no model or tokenizer limit is available. Includes regression coverage for discovered context lengths, explicit request overrides, and direct embedding model defaults.
Owner
|
Thanks for fixing this. I verified that the endpoint now threads the effective embedding context length into the engine, preserves explicit max_length overrides, and covers the regression with focused tests. One follow-up I may fold in later: custom embedding processors still use their own prepare_embedding_inputs() limits, so the new max_length/truncation request controls are not universal there. This does not block the reported Qwen3 text embedding issue. This looks good to me, and I'm going to merge it. |
JimStenstrom
added a commit
to JimStenstrom/omlx
that referenced
this pull request
Jun 7, 2026
get_embedding_max_length() returned a hard 512 when neither the request nor the server's max_context_window pinned a limit, re-truncating long-context embedding models in exactly the no-config case jundot#1687 was about. Return None instead so the engine/model embed() path resolves the model's own context length (max_position_embeddings / tokenizer model_max_length, already in MLXEmbeddingModel._resolve_max_length), keeping 512 only as that resolver's final fallback. Follow-up to jundot#1694.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1687.
The embeddings endpoint now uses the model’s effective context window by default instead of falling through to the
mlx-embeddings512-token default. It also adds optionalmax_lengthandtruncationrequest controls for callers who want to override that behavior explicitly.Details
max_lengthandtruncationtoEmbeddingRequest.EmbeddingEngine.embed().512only as a final fallback when no model/tokenizer limit is known.max_length: 512.Testing
I ran the focused test coverage:
Result:
I also tested locally with
mlx-community/Qwen3-Embedding-8B-mxfp8. The server logs now show:and embeddings for long inputs with identical prefixes but different suffixes no longer collapse to the same vector/hash, confirming content beyond the old 512-token cutoff is being used.
Backwards Compatibility
Existing embedding requests continue to work unchanged. The main behavioral change is that long inputs may now use more compute/memory because they are no longer silently capped at 512 tokens. Callers that want the previous cap can pass
max_length: 512explicitly.