fix(dflash): enforce prefill memory guard on DFlash primary path#1770
Merged
jundot merged 3 commits intoJun 10, 2026
Conversation
DFlash speculative decoding bypasses the scheduler, so its primary (non-fallback) path inherited BaseEngine's no-op preflight_chat/ completion and ran prefills with no memory guard. A long prompt could OOM the process instead of getting a clean HTTP 400 (observed serving Qwen3-Coder-Next with 56k-token prompts). Give DFlash's primary path a front-door preflight that reuses the existing estimation machinery: - Add two engine-agnostic helpers to memory_monitor.py: set_model_info_from_model() (populate a MemoryMonitor's KV/SDPA dims from an mlx-lm model, CacheList-aware layer counting) and raise_if_prefill_exceeds() (the shared peak-estimate + raise). These are additive: Scheduler keeps its own now-TurboQuant-coupled copy, so this change leaves the scheduler hot path untouched. - DFlashEngine builds a _DFlashPrefillGuard (a MemoryMonitor + the enforcer's watermarks) in start() and overrides preflight_chat/ completion to estimate the prefill peak and raise PrefillMemoryExceededError; in fallback mode it delegates to the fallback engine, whose scheduler runs the full guard. - ProcessMemoryEnforcer._resolve_scheduler resolves the primary-mode guard via the engine's _prefill_guard so the watermarks propagate. Fallback-mode scheduler resolution and the spurious "could not resolve scheduler" warning are already handled upstream by DFlashEngine.scheduler. The guard uses the uncompressed (base-dtype) KV estimate, the conservative never-undercount choice for an OOM guard. Tests: new tests/test_dflash_prefill_memory_guard.py plus enforcer _prefill_guard propagation cases.
A DFlash prefix-cache hit reconstructs the matched KV into active memory (dflash_mlx hydrate_target_cache clones every array), so the guard deliberately passes no cached_tokens — unlike the scheduler's resident paged cache, the full prompt's KV is allocated this request. Document this at both preflight call sites and in the raise_if_prefill_exceeds docstring so the prefix-hit count surfaced as cached_tokens in API usage is never wired into the guard.
Nothing passes it: both preflight call sites omit it, and a DFlash prefix-cache hit reconstructs the matched KV into active memory, so a nonzero value would always under-count the prefill peak. Removing the parameter makes that misuse unrepresentable rather than merely documented; raise_if_prefill_exceeds keeps cached_tokens for engines whose caches hold KV resident. A new test pins the narrowed signature.
Owner
|
Thanks for adding this. This fixes a real DFlash primary-path gap: it bypasses the scheduler, so the prefill guard was not applied before admission. This complements #1774 rather than replacing it. One small follow-up is needed: the new helper should avoid calling mx.get_active_memory() from the HTTP preflight path and use cached/physical memory instead, matching the current scheduler/enforcer policy. |
jundot
added a commit
that referenced
this pull request
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on #1755 (which exposed
DFlashEngine.schedulerfor fallback mode) and the recent move of the prefill guard to HTTP 400. Those land the guard for DFlash's fallback path; this PR adds the missing primary-mode guard so DFlash is protected in both modes, reusing the same machinery rather than reshaping it.Summary
Scheduler, so it inheritedBaseEngine's no-oppreflight_chat/preflight_completionand ran prefills with no memory guard — a long prompt could OOM the process instead of returning a clean HTTP 400._DFlashPrefillGuard(built instart()) pluspreflight_*overrides that estimate the prefill peak and raisePrefillMemoryExceededError; in fallback mode they delegate to the fallback engine, whose scheduler already runs the full guard.ProcessMemoryEnforcerreaches the new guard through a small_prefill_guardarm in_resolve_scheduler.Why
After #1755,
DFlashEngine.schedulerresolves the fallback engine's scheduler, so the enforcer can propagate watermarks once DFlash enters fallback mode. But in normal speculative mode DFlash has no scheduler at all — the enforcer had nowhere to push the ceiling, and the front-door preflight was stillBaseEngine's no-op. An oversized prompt served on the primary path therefore skipped the guard entirely and could drive the process past the memory ceiling (observed end-to-end serving Qwen3-Coder-Next + DFlash with 56k-token prompts). This closes that primary-mode gap with the same pieces #1755 and the 400 handler already established.Changes
omlx/memory_monitor.py: two additive, engine-agnostic helpers —set_model_info_from_model()(populate aMemoryMonitor's KV/SDPA dims from an mlx-lm model;CacheList-aware layer counting, matchingScheduler._set_model_info_for_monitor) andraise_if_prefill_exceeds()(the shared peak-estimate-and-raise).Scheduleris intentionally left untouched — it keeps its own (now TurboQuant-coupled) copy, so this PR doesn't reshape the scheduler hot path or its preflight math.omlx/engine/dflash.py:_DFlashPrefillGuardholds aMemoryMonitor+ the enforcer's watermark attrs;preflight_chat/preflight_completionestimate the peak and raise, or delegate to the fallback engine in fallback mode. Coexists with theschedulerproperty added in engine/dflash: expose scheduler property for ProcessMemoryEnforcer #1755.omlx/process_memory_enforcer.py:_resolve_schedulerresolves a primary-mode DFlash to its_prefill_guard(Nonefor every other engine, so standard resolution is unchanged). Fallback-mode resolution and the "could not resolve scheduler" warning suppression are already handled by engine/dflash: expose scheduler property for ProcessMemoryEnforcer #1755 and untouched here.Tests
New
tests/test_dflash_prefill_memory_guard.py(12 cases: guard math mirroringtest_scheduler_prefill_memory_guard, plus engine-level delegation including fallback-mode pass-through) andTestDFlashGuardPropagationintests/test_process_memory_enforcer.py(enforcer reaches the primary guard; standard engines unchanged; no spurious warning).Also verified the new tests fail when the fix is reverted: neutralizing
raise_if_prefill_exceedsmakestest_preflight_raises_when_oversizedfail with "DID NOT RAISE".