Skip to content

feat(server): add max_context_window_policy for server-wide context cap#1628

Merged
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/max-context-window-policy-cap
Jun 6, 2026
Merged

feat(server): add max_context_window_policy for server-wide context cap#1628
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/max-context-window-policy-cap

Conversation

@cfbraun

@cfbraun cfbraun commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Reworked per the maintainer review. The original PR changed max_context_window semantics from "fallback" to "policy cap"; this rev instead adds a separate nullable field alongside the existing fallback, preserving all current behavior unless explicitly opted in.

Design

SamplingSettings.max_context_window_policy: int | None = None (unset by default). When set, get_max_context_window returns min(native, policy) for models whose native context length was discovered. Per-model overrides and the fallback default are not capped — those are explicit operator choices that the policy must not silently override.

Resolution

Per-model override Native (config.json) Policy field Effective cap Notes
262 144 unset (None) 262 144 No behavior change for existing installs
262 144 128 000 128 000 Policy clamps the native value
32 768 128 000 32 768 Native already below policy — native wins
200 000 262 144 128 000 200 000 Per-model override escapes the clamp
(none discovered) 16 000 32 768 Fallback path; policy does not apply here

Why a separate field

  • Preserves existing settings.json behavior — files carrying the historical max_context_window=32768 default keep working as fallback only; no model gets silently clamped after upgrade.
  • Avoids 1_000_000 as a pseudo "no policy" value, which would have failed for models whose native context is above that.
  • Discoverability..._policy reads as what it does, and the existing field's role is unchanged so the dashboard label can be clarified without breaking semantics.

Test plan

  • pytest tests/test_context_window.py tests/test_context_scaling.py tests/test_settings.py tests/test_server.py — 243 passed
  • New TestGetMaxContextWindow cases: policy unset, policy clamps, native below policy, per-model override escapes, fallback-only path unaffected
  • TestSamplingSettings::test_policy_field_round_trip covers None-semantics serialization

Dashboard label/help-text clarification for the existing field is intentionally out of scope for this PR.

@jundot

jundot commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Thanks for explaining the motivation. I agree the current dashboard wording can read like a policy cap, but I do not think changing the existing max_context_window semantics is the right fix.

To me, this setting is better treated as a fallback/default for models whose native context length cannot be discovered. If a model declares 262k in config.json, that native value is the better source of truth for the model capability exposed by /v1/models and used by validation.

A single global hard cap also feels too broad as an operational control. The memory cost of the same context length varies a lot by model size, architecture, quantization, KV cache behavior, and batching, so operators would likely have to tune it for the largest active model. That makes it less useful as a server-wide policy than either per-model settings or a separate explicit cap.

It would also change behavior for existing settings.json files that persisted the old 32768 default, causing long-context models to be clamped after upgrade.

I think the safer direction is to keep max_context_window as the fallback behavior, update the dashboard label/help text to make that clear, and add a separate explicit global policy cap later if that feature is needed.

@cfbraun

cfbraun commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

@jundot thanks for the considered review. The migration risk is the load-bearing concern here, so pushing back specifically on that — the rest of the analysis follows from where it lands.

The persisted 32768 only ever mattered when a model's config.json didn't declare context length. Modern mlx model packs (Qwen, Llama, Mistral, etc.) all expose max_position_embeddings or equivalent, so for the typical upgrade user that value has been silently inert — migrating it to the new default is cosmetic, not behavior-changing for their workload. The narrow exception is an operator running a model without discoverable native context who deliberately wanted a 32k clamp; a per-model override on that specific model preserves their current behavior, and the rare model case is exactly where per-model tuning belongs anyway.

If the migration mechanic itself is the blocker, I can push a version-gated from_dict migration that bumps stale 32768 values to 1_000_000 on load with a warning log — a few lines, idempotent, reversible by re-setting the field in the dashboard. That would make the upgrade path observable for the rare edge case while leaving the typical user's effective behavior unchanged.

Briefly on the other points:

  • "Too broad as operational control" — that's what the priority-1 per-model max_context_window is for. It already handles per-model sizing concerns (model size, quantization, KV behavior). The global is the operator's worst-case "no model on this server should accept prompts above N" ceiling — useful for shared/multi-tenant deployments, instances with a hard memory budget, or anywhere the largest-active-model defines the policy. Different concern, not the same knob worn twice.

  • "Add a separate policy field later" — that field would carry the same migration cost (operators have already been setting the existing field expecting it to cap, as I did with 128000) plus a discoverability problem (two near-identical-sounding fields in the dashboard). Whatever name you pick for the new field would just be the semantics this PR ships, under a different label.

  • "Update the dashboard label/help text" — that surfaces "this is just a fallback, models can exceed it" without actually giving operators a way to enforce a policy through the UI. Documents the problem without fixing it.

Happy to push the version-gated migration if that closes the upgrade-safety concern. If you'd rather go with the label-only fix + separate field later, I'll close this and rework — let me know which path.

@jundot

jundot commented Jun 4, 2026

Copy link
Copy Markdown
Owner

@cfbraun Thanks, that framing helps. I agree the operator need for a server-wide context ceiling is valid, but I do not want to change the semantics of the existing max_context_window field.

I want to keep the current field as the fallback value used when a model's native context length cannot be discovered, and make that meaning clearer in the dashboard text. For the hard policy behavior, I think the safer design is to add a separate nullable global context limit setting that is unset by default and only applies min(native_context, global_limit) when explicitly configured.

That preserves existing settings.json behavior, avoids treating stale 32768 defaults as new hard caps, and avoids using 1_000_000 as a pseudo "no policy" value for models with native context above that. If you are open to that direction, this PR can be reworked around the separate setting.

@cfbraun cfbraun force-pushed the pr/max-context-window-policy-cap branch from f292112 to 725a9ff Compare June 5, 2026 06:11
@cfbraun cfbraun changed the title feat(server): global max_context_window acts as policy cap, not fallback feat(server): add max_context_window_policy for server-wide context cap Jun 5, 2026
@cfbraun cfbraun force-pushed the pr/max-context-window-policy-cap branch from 725a9ff to 511095c Compare June 5, 2026 06:16
@cfbraun

cfbraun commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@jundot Reworked exactly around your proposal. The PR now adds SamplingSettings.max_context_window_policy: int | None = None as a separate nullable field, unset by default. get_max_context_window returns min(native, policy) only when policy is set and the model's native context length was discovered via its config.json. Per-model overrides and the fallback default are not capped — those are explicit operator choices and the policy must not override them.

The three properties you flagged are preserved:

  • Existing settings.json behavior unchanged — files carrying the historical max_context_window=32768 keep working as fallback only; no model gets silently clamped on upgrade.
  • No 1_000_000 pseudo "no policy" sentinel — the original PR's default change is fully reverted; the new field is nullable in the natural way.
  • Per-model override is the escape hatch — operators can still raise individual models above the policy when needed.

Title + body updated to match the new design. Force-pushed 725a9ff (now 511095c after this morning's sync rebase). 243 tests pass on the touched modules; new TestGetMaxContextWindow cases cover policy unset / policy clamps / native below policy / per-model override escapes / fallback path unaffected, plus a to_dict/from_dict round-trip for the new field.

Ready for another look.

@cfbraun cfbraun force-pushed the pr/max-context-window-policy-cap branch from 511095c to 9d3dd05 Compare June 5, 2026 16:15
Replaces the original PR design (rework reusing max_context_window
semantics) with @jundot's proposed shape after review:

> I want to keep the current field as the fallback value used when a
> model's native context length cannot be discovered, and make that
> meaning clearer in the dashboard text. For the hard policy behavior,
> I think the safer design is to add a separate nullable global
> context limit setting that is unset by default and only applies
> min(native_context, global_limit) when explicitly configured.

What this PR adds:

- ``SamplingSettings.max_context_window_policy: int | None = None``
  (nullable, unset by default) in both ``omlx/server.py``
  (``SamplingDefaults``) and ``omlx/settings.py``
  (``SamplingSettings``).
- ``get_max_context_window`` returns ``min(native, policy)`` only
  when policy is set and the model's native context length was
  discovered via its ``config.json``. Per-model overrides and the
  fallback default are not capped — those are explicit operator
  choices that the policy must not override.

Why a separate field:

- Preserves existing ``settings.json`` behavior. Files carrying the
  historical ``max_context_window=32768`` default keep working as
  fallback only; no model gets silently clamped after upgrade.
- Avoids using a sentinel like ``1_000_000`` as a pseudo "no policy"
  value (which would have failed for models whose native context is
  above that).
- Discoverability: the new field has a name that reads as what it
  does (``..._policy``), and the existing field's role is unchanged
  so its dashboard label can be clarified without breaking semantics.

Tests under ``TestGetMaxContextWindow``:

- Policy unset → native returned verbatim (zero behavior change).
- Policy set + native > policy → ``min`` clamps to policy.
- Policy set + native < policy → native wins (cap, not floor).
- Per-model override escapes policy clamp.
- Policy does not apply to the fallback-only path (no native, no
  per-model override) — fallback wins.

Plus ``test_policy_field_round_trip`` in ``TestSamplingSettings``
covers serialization/deserialization keeping ``None`` semantics.
@cfbraun cfbraun force-pushed the pr/max-context-window-policy-cap branch from 9d3dd05 to 5beb395 Compare June 6, 2026 08:49
@jundot

jundot commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Thanks for reworking this around the separate nullable setting. The core server-side resolution now matches the design I was looking for: the existing max_context_window remains the fallback, and max_context_window_policy is unset by default and only caps discovered native context lengths.

I'm going to handle the remaining admin UI/settings polish as a follow-up: clarify the existing global setting so it reads as a fallback, and add the new optional "Global Context Window Cap" at the top of Generation Defaults with an empty/default None value.

This looks good to me, and I'm going to merge it.

@jundot jundot merged commit c031c83 into jundot:main Jun 6, 2026
4 checks passed
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 7, 2026
Resolves conflicts in favor of the merged upstream shape for:

- jundot#1628 (max_context_window_policy): upstream merged the reworked
  separate-nullable-field design, not the original "reuse
  max_context_window" design that still sat on this branch. Drop the
  1_000_000 default and adopt the 32768 fallback + optional policy cap
  jundot ultimately wanted.
- jundot#1451 (inline LRU unlinks): upstream's slice+reinsert variant of
  the burst cap landed (with observability counters from 021162b).
  Drop the local push-cap-into-evict_until_size variant.
- Test suites realigned to the merged shapes (TestInlineLRUUnlinks,
  policy-cap test names, 32768 fallback assertions).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants