fix(dflash): enforce prefill memory guard on DFlash primary path by JimStenstrom · Pull Request #1770 · jundot/omlx

JimStenstrom · 2026-06-09T14:07:29Z

Builds on #1755 (which exposed DFlashEngine.scheduler for fallback mode) and the recent move of the prefill guard to HTTP 400. Those land the guard for DFlash's fallback path; this PR adds the missing primary-mode guard so DFlash is protected in both modes, reusing the same machinery rather than reshaping it.

Summary

DFlash's primary (speculative) path bypasses the Scheduler, so it inherited BaseEngine's no-op preflight_chat/preflight_completion and ran prefills with no memory guard — a long prompt could OOM the process instead of returning a clean HTTP 400.
Give DFlash a _DFlashPrefillGuard (built in start()) plus preflight_* overrides that estimate the prefill peak and raise PrefillMemoryExceededError; in fallback mode they delegate to the fallback engine, whose scheduler already runs the full guard.
The ProcessMemoryEnforcer reaches the new guard through a small _prefill_guard arm in _resolve_scheduler.

Why

After #1755, DFlashEngine.scheduler resolves the fallback engine's scheduler, so the enforcer can propagate watermarks once DFlash enters fallback mode. But in normal speculative mode DFlash has no scheduler at all — the enforcer had nowhere to push the ceiling, and the front-door preflight was still BaseEngine's no-op. An oversized prompt served on the primary path therefore skipped the guard entirely and could drive the process past the memory ceiling (observed end-to-end serving Qwen3-Coder-Next + DFlash with 56k-token prompts). This closes that primary-mode gap with the same pieces #1755 and the 400 handler already established.

Changes

omlx/memory_monitor.py: two additive, engine-agnostic helpers — set_model_info_from_model() (populate a MemoryMonitor's KV/SDPA dims from an mlx-lm model; CacheList-aware layer counting, matching Scheduler._set_model_info_for_monitor) and raise_if_prefill_exceeds() (the shared peak-estimate-and-raise). Scheduler is intentionally left untouched — it keeps its own (now TurboQuant-coupled) copy, so this PR doesn't reshape the scheduler hot path or its preflight math.
omlx/engine/dflash.py: _DFlashPrefillGuard holds a MemoryMonitor + the enforcer's watermark attrs; preflight_chat/preflight_completion estimate the peak and raise, or delegate to the fallback engine in fallback mode. Coexists with the scheduler property added in engine/dflash: expose scheduler property for ProcessMemoryEnforcer #1755.
omlx/process_memory_enforcer.py: _resolve_scheduler resolves a primary-mode DFlash to its _prefill_guard (None for every other engine, so standard resolution is unchanged). Fallback-mode resolution and the "could not resolve scheduler" warning suppression are already handled by engine/dflash: expose scheduler property for ProcessMemoryEnforcer #1755 and untouched here.
The guard uses the uncompressed (base-dtype) KV estimate — the conservative, never-undercount choice for an OOM guard.

Tests

New tests/test_dflash_prefill_memory_guard.py (12 cases: guard math mirroring test_scheduler_prefill_memory_guard, plus engine-level delegation including fallback-mode pass-through) and TestDFlashGuardPropagation in tests/test_process_memory_enforcer.py (enforcer reaches the primary guard; standard engines unchanged; no spurious warning).

.venv/bin/python -m pytest tests/test_dflash_prefill_memory_guard.py \
  tests/test_process_memory_enforcer.py tests/test_dflash_engine.py \
  tests/test_engine_preflight.py -q
# 198 passed

Also verified the new tests fail when the fix is reverted: neutralizing raise_if_prefill_exceeds makes test_preflight_raises_when_oversized fail with "DID NOT RAISE".

DFlash speculative decoding bypasses the scheduler, so its primary (non-fallback) path inherited BaseEngine's no-op preflight_chat/ completion and ran prefills with no memory guard. A long prompt could OOM the process instead of getting a clean HTTP 400 (observed serving Qwen3-Coder-Next with 56k-token prompts). Give DFlash's primary path a front-door preflight that reuses the existing estimation machinery: - Add two engine-agnostic helpers to memory_monitor.py: set_model_info_from_model() (populate a MemoryMonitor's KV/SDPA dims from an mlx-lm model, CacheList-aware layer counting) and raise_if_prefill_exceeds() (the shared peak-estimate + raise). These are additive: Scheduler keeps its own now-TurboQuant-coupled copy, so this change leaves the scheduler hot path untouched. - DFlashEngine builds a _DFlashPrefillGuard (a MemoryMonitor + the enforcer's watermarks) in start() and overrides preflight_chat/ completion to estimate the prefill peak and raise PrefillMemoryExceededError; in fallback mode it delegates to the fallback engine, whose scheduler runs the full guard. - ProcessMemoryEnforcer._resolve_scheduler resolves the primary-mode guard via the engine's _prefill_guard so the watermarks propagate. Fallback-mode scheduler resolution and the spurious "could not resolve scheduler" warning are already handled upstream by DFlashEngine.scheduler. The guard uses the uncompressed (base-dtype) KV estimate, the conservative never-undercount choice for an OOM guard. Tests: new tests/test_dflash_prefill_memory_guard.py plus enforcer _prefill_guard propagation cases.

A DFlash prefix-cache hit reconstructs the matched KV into active memory (dflash_mlx hydrate_target_cache clones every array), so the guard deliberately passes no cached_tokens — unlike the scheduler's resident paged cache, the full prompt's KV is allocated this request. Document this at both preflight call sites and in the raise_if_prefill_exceeds docstring so the prefix-hit count surfaced as cached_tokens in API usage is never wired into the guard.

Nothing passes it: both preflight call sites omit it, and a DFlash prefix-cache hit reconstructs the matched KV into active memory, so a nonzero value would always under-count the prefill peak. Removing the parameter makes that misuse unrepresentable rather than merely documented; raise_if_prefill_exceeds keeps cached_tokens for engines whose caches hold KV resident. A new test pins the narrowed signature.

jundot · 2026-06-10T01:56:25Z

Thanks for adding this. This fixes a real DFlash primary-path gap: it bypasses the scheduler, so the prefill guard was not applied before admission.

This complements #1774 rather than replacing it. One small follow-up is needed: the new helper should avoid calling mx.get_active_memory() from the HTTP preflight path and use cached/physical memory instead, matching the current scheduler/enforcer policy.

JimStenstrom added 3 commits June 9, 2026 08:41

jundot merged commit c6fa02e into jundot:main Jun 10, 2026
4 checks passed

jundot added a commit that referenced this pull request Jun 10, 2026

fix(dflash): avoid MLX telemetry in preflight guard (#1770)

c43e600

JimStenstrom mentioned this pull request Jun 10, 2026

fix(engine_pool): skip the settle wait when other engines are serving #1785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dflash): enforce prefill memory guard on DFlash primary path#1770

fix(dflash): enforce prefill memory guard on DFlash primary path#1770
jundot merged 3 commits into
jundot:mainfrom
JimStenstrom:fix/dflash-prefill-memory-guard

JimStenstrom commented Jun 9, 2026

Uh oh!

jundot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JimStenstrom commented Jun 9, 2026

Summary

Why

Changes

Tests

Uh oh!

jundot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants