fix(embedding): Respect embedding context length by jackwh · Pull Request #1694 · jundot/omlx

jackwh · 2026-06-05T22:28:13Z

The embeddings endpoint now uses the model’s effective context window by default instead of falling through to the mlx-embeddings 512-token default. It also adds optional max_length and truncation request controls for callers who want to override that behavior explicitly.

Details

Adds max_length and truncation to EmbeddingRequest.
Threads the effective embedding max length from the server into EmbeddingEngine.embed().
Resolves embedding max length from model/tokenizer metadata for direct model usage.
Keeps 512 only as a final fallback when no model/tokenizer limit is known.
Preserves explicit caller overrides, including intentionally passing max_length: 512.

Testing

I ran the focused test coverage:

pytest tests/test_embedding.py tests/integration/test_server_endpoints.py::TestEmbeddingsEndpoint -q

Result:

87 passed, 1 deselected

I also tested locally with mlx-community/Qwen3-Embedding-8B-mxfp8. The server logs now show:

max_length=40960, truncation=True

and embeddings for long inputs with identical prefixes but different suffixes no longer collapse to the same vector/hash, confirming content beyond the old 512-token cutoff is being used.

Backwards Compatibility

Existing embedding requests continue to work unchanged. The main behavioral change is that long inputs may now use more compute/memory because they are no longer silently capped at 512 tokens. Callers that want the previous cap can pass max_length: 512 explicitly.

Use the effective model context window for /v1/embeddings instead of falling through to the mlx-embeddings 512-token default. Adds optional max_length and truncation request controls, preserves an explicit 512 cap when callers ask for it, and keeps 512 only as the final fallback when no model or tokenizer limit is available. Includes regression coverage for discovered context lengths, explicit request overrides, and direct embedding model defaults.

jundot · 2026-06-06T16:02:28Z

Thanks for fixing this. I verified that the endpoint now threads the effective embedding context length into the engine, preserves explicit max_length overrides, and covers the regression with focused tests.

One follow-up I may fold in later: custom embedding processors still use their own prepare_embedding_inputs() limits, so the new max_length/truncation request controls are not universal there. This does not block the reported Qwen3 text embedding issue.

This looks good to me, and I'm going to merge it.

get_embedding_max_length() returned a hard 512 when neither the request nor the server's max_context_window pinned a limit, re-truncating long-context embedding models in exactly the no-config case jundot#1687 was about. Return None instead so the engine/model embed() path resolves the model's own context length (max_position_embeddings / tokenizer model_max_length, already in MLXEmbeddingModel._resolve_max_length), keeping 512 only as that resolver's final fallback. Follow-up to jundot#1694.

jackwh mentioned this pull request Jun 5, 2026

Embeddings silently truncate beyond 512 tokens, and configured overrides are ignored; Embeddings effectively useless for content longer than a few sentences #1687

Closed

jundot merged commit 6f99272 into jundot:main Jun 6, 2026

JimStenstrom mentioned this pull request Jun 7, 2026

fix(embeddings): fall back to model context length when no window is set #1718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embedding): Respect embedding context length#1694

fix(embedding): Respect embedding context length#1694
jundot merged 1 commit into
jundot:mainfrom
jackwh:embeddings-context

jackwh commented Jun 5, 2026

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jackwh commented Jun 5, 2026

Details

Testing

Backwards Compatibility

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants