feat(server): add max_context_window_policy for server-wide context cap by cfbraun · Pull Request #1628 · jundot/omlx

cfbraun · 2026-06-03T11:02:06Z

Summary

Reworked per the maintainer review. The original PR changed max_context_window semantics from "fallback" to "policy cap"; this rev instead adds a separate nullable field alongside the existing fallback, preserving all current behavior unless explicitly opted in.

Design

SamplingSettings.max_context_window_policy: int | None = None (unset by default). When set, get_max_context_window returns min(native, policy) for models whose native context length was discovered. Per-model overrides and the fallback default are not capped — those are explicit operator choices that the policy must not silently override.

Resolution

Per-model override	Native (config.json)	Policy field	Effective cap	Notes
—	262 144	unset (None)	262 144	No behavior change for existing installs
—	262 144	128 000	128 000	Policy clamps the native value
—	32 768	128 000	32 768	Native already below policy — native wins
200 000	262 144	128 000	200 000	Per-model override escapes the clamp
—	(none discovered)	16 000	32 768	Fallback path; policy does not apply here

Why a separate field

Preserves existing settings.json behavior — files carrying the historical max_context_window=32768 default keep working as fallback only; no model gets silently clamped after upgrade.
Avoids 1_000_000 as a pseudo "no policy" value, which would have failed for models whose native context is above that.
Discoverability — ..._policy reads as what it does, and the existing field's role is unchanged so the dashboard label can be clarified without breaking semantics.

Test plan

pytest tests/test_context_window.py tests/test_context_scaling.py tests/test_settings.py tests/test_server.py — 243 passed
New TestGetMaxContextWindow cases: policy unset, policy clamps, native below policy, per-model override escapes, fallback-only path unaffected
TestSamplingSettings::test_policy_field_round_trip covers None-semantics serialization

Dashboard label/help-text clarification for the existing field is intentionally out of scope for this PR.

jundot · 2026-06-03T14:26:06Z

Thanks for explaining the motivation. I agree the current dashboard wording can read like a policy cap, but I do not think changing the existing max_context_window semantics is the right fix.

To me, this setting is better treated as a fallback/default for models whose native context length cannot be discovered. If a model declares 262k in config.json, that native value is the better source of truth for the model capability exposed by /v1/models and used by validation.

A single global hard cap also feels too broad as an operational control. The memory cost of the same context length varies a lot by model size, architecture, quantization, KV cache behavior, and batching, so operators would likely have to tune it for the largest active model. That makes it less useful as a server-wide policy than either per-model settings or a separate explicit cap.

It would also change behavior for existing settings.json files that persisted the old 32768 default, causing long-context models to be clamped after upgrade.

I think the safer direction is to keep max_context_window as the fallback behavior, update the dashboard label/help text to make that clear, and add a separate explicit global policy cap later if that feature is needed.

cfbraun · 2026-06-03T15:16:36Z

@jundot thanks for the considered review. The migration risk is the load-bearing concern here, so pushing back specifically on that — the rest of the analysis follows from where it lands.

The persisted 32768 only ever mattered when a model's config.json didn't declare context length. Modern mlx model packs (Qwen, Llama, Mistral, etc.) all expose max_position_embeddings or equivalent, so for the typical upgrade user that value has been silently inert — migrating it to the new default is cosmetic, not behavior-changing for their workload. The narrow exception is an operator running a model without discoverable native context who deliberately wanted a 32k clamp; a per-model override on that specific model preserves their current behavior, and the rare model case is exactly where per-model tuning belongs anyway.

If the migration mechanic itself is the blocker, I can push a version-gated from_dict migration that bumps stale 32768 values to 1_000_000 on load with a warning log — a few lines, idempotent, reversible by re-setting the field in the dashboard. That would make the upgrade path observable for the rare edge case while leaving the typical user's effective behavior unchanged.

Briefly on the other points:

"Too broad as operational control" — that's what the priority-1 per-model max_context_window is for. It already handles per-model sizing concerns (model size, quantization, KV behavior). The global is the operator's worst-case "no model on this server should accept prompts above N" ceiling — useful for shared/multi-tenant deployments, instances with a hard memory budget, or anywhere the largest-active-model defines the policy. Different concern, not the same knob worn twice.
"Add a separate policy field later" — that field would carry the same migration cost (operators have already been setting the existing field expecting it to cap, as I did with 128000) plus a discoverability problem (two near-identical-sounding fields in the dashboard). Whatever name you pick for the new field would just be the semantics this PR ships, under a different label.
"Update the dashboard label/help text" — that surfaces "this is just a fallback, models can exceed it" without actually giving operators a way to enforce a policy through the UI. Documents the problem without fixing it.

Happy to push the version-gated migration if that closes the upgrade-safety concern. If you'd rather go with the label-only fix + separate field later, I'll close this and rework — let me know which path.

jundot · 2026-06-04T04:42:47Z

@cfbraun Thanks, that framing helps. I agree the operator need for a server-wide context ceiling is valid, but I do not want to change the semantics of the existing max_context_window field.

I want to keep the current field as the fallback value used when a model's native context length cannot be discovered, and make that meaning clearer in the dashboard text. For the hard policy behavior, I think the safer design is to add a separate nullable global context limit setting that is unset by default and only applies min(native_context, global_limit) when explicitly configured.

That preserves existing settings.json behavior, avoids treating stale 32768 defaults as new hard caps, and avoids using 1_000_000 as a pseudo "no policy" value for models with native context above that. If you are open to that direction, this PR can be reworked around the separate setting.

cfbraun · 2026-06-05T08:39:11Z

@jundot Reworked exactly around your proposal. The PR now adds SamplingSettings.max_context_window_policy: int | None = None as a separate nullable field, unset by default. get_max_context_window returns min(native, policy) only when policy is set and the model's native context length was discovered via its config.json. Per-model overrides and the fallback default are not capped — those are explicit operator choices and the policy must not override them.

The three properties you flagged are preserved:

Existing settings.json behavior unchanged — files carrying the historical max_context_window=32768 keep working as fallback only; no model gets silently clamped on upgrade.
No 1_000_000 pseudo "no policy" sentinel — the original PR's default change is fully reverted; the new field is nullable in the natural way.
Per-model override is the escape hatch — operators can still raise individual models above the policy when needed.

Title + body updated to match the new design. Force-pushed 725a9ff (now 511095c after this morning's sync rebase). 243 tests pass on the touched modules; new TestGetMaxContextWindow cases cover policy unset / policy clamps / native below policy / per-model override escapes / fallback path unaffected, plus a to_dict/from_dict round-trip for the new field.

Ready for another look.

@jundot

Replaces the original PR design (rework reusing max_context_window semantics) with @jundot's proposed shape after review: > I want to keep the current field as the fallback value used when a > model's native context length cannot be discovered, and make that > meaning clearer in the dashboard text. For the hard policy behavior, > I think the safer design is to add a separate nullable global > context limit setting that is unset by default and only applies > min(native_context, global_limit) when explicitly configured. What this PR adds: - ``SamplingSettings.max_context_window_policy: int | None = None`` (nullable, unset by default) in both ``omlx/server.py`` (``SamplingDefaults``) and ``omlx/settings.py`` (``SamplingSettings``). - ``get_max_context_window`` returns ``min(native, policy)`` only when policy is set and the model's native context length was discovered via its ``config.json``. Per-model overrides and the fallback default are not capped — those are explicit operator choices that the policy must not override. Why a separate field: - Preserves existing ``settings.json`` behavior. Files carrying the historical ``max_context_window=32768`` default keep working as fallback only; no model gets silently clamped after upgrade. - Avoids using a sentinel like ``1_000_000`` as a pseudo "no policy" value (which would have failed for models whose native context is above that). - Discoverability: the new field has a name that reads as what it does (``..._policy``), and the existing field's role is unchanged so its dashboard label can be clarified without breaking semantics. Tests under ``TestGetMaxContextWindow``: - Policy unset → native returned verbatim (zero behavior change). - Policy set + native > policy → ``min`` clamps to policy. - Policy set + native < policy → native wins (cap, not floor). - Per-model override escapes policy clamp. - Policy does not apply to the fallback-only path (no native, no per-model override) — fallback wins. Plus ``test_policy_field_round_trip`` in ``TestSamplingSettings`` covers serialization/deserialization keeping ``None`` semantics.

jundot · 2026-06-06T16:08:36Z

Thanks for reworking this around the separate nullable setting. The core server-side resolution now matches the design I was looking for: the existing max_context_window remains the fallback, and max_context_window_policy is unset by default and only caps discovered native context lengths.

I'm going to handle the remaining admin UI/settings polish as a follow-up: clarify the existing global setting so it reads as a fallback, and add the new optional "Global Context Window Cap" at the top of Generation Defaults with an empty/default None value.

This looks good to me, and I'm going to merge it.

Resolves conflicts in favor of the merged upstream shape for: - jundot#1628 (max_context_window_policy): upstream merged the reworked separate-nullable-field design, not the original "reuse max_context_window" design that still sat on this branch. Drop the 1_000_000 default and adopt the 32768 fallback + optional policy cap jundot ultimately wanted. - jundot#1451 (inline LRU unlinks): upstream's slice+reinsert variant of the burst cap landed (with observability counters from 021162b). Drop the local push-cap-into-evict_until_size variant. - Test suites realigned to the merged shapes (TestInlineLRUUnlinks, policy-cap test names, 32768 fallback assertions).

cfbraun force-pushed the pr/max-context-window-policy-cap branch from f292112 to 725a9ff Compare June 5, 2026 06:11

cfbraun changed the title ~~feat(server): global max_context_window acts as policy cap, not fallback~~ feat(server): add max_context_window_policy for server-wide context cap Jun 5, 2026

cfbraun force-pushed the pr/max-context-window-policy-cap branch from 725a9ff to 511095c Compare June 5, 2026 06:16

cfbraun force-pushed the pr/max-context-window-policy-cap branch from 511095c to 9d3dd05 Compare June 5, 2026 16:15

cfbraun force-pushed the pr/max-context-window-policy-cap branch from 9d3dd05 to 5beb395 Compare June 6, 2026 08:49

jundot merged commit c031c83 into jundot:main Jun 6, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add max_context_window_policy for server-wide context cap#1628

feat(server): add max_context_window_policy for server-wide context cap#1628
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/max-context-window-policy-cap

cfbraun commented Jun 3, 2026 •

edited

Loading

Uh oh!

jundot commented Jun 3, 2026 •

edited

Loading

Uh oh!

cfbraun commented Jun 3, 2026

Uh oh!

jundot commented Jun 4, 2026

Uh oh!

cfbraun commented Jun 5, 2026

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cfbraun commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Resolution

Why a separate field

Test plan

Uh oh!

jundot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfbraun commented Jun 3, 2026

Uh oh!

jundot commented Jun 4, 2026

Uh oh!

cfbraun commented Jun 5, 2026

Uh oh!

jundot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cfbraun commented Jun 3, 2026 •

edited

Loading

jundot commented Jun 3, 2026 •

edited

Loading