Skip to content

fix(dflash): enforce prefill memory guard on DFlash primary path#1770

Merged
jundot merged 3 commits into
jundot:mainfrom
JimStenstrom:fix/dflash-prefill-memory-guard
Jun 10, 2026
Merged

fix(dflash): enforce prefill memory guard on DFlash primary path#1770
jundot merged 3 commits into
jundot:mainfrom
JimStenstrom:fix/dflash-prefill-memory-guard

Conversation

@JimStenstrom

Copy link
Copy Markdown
Contributor

Builds on #1755 (which exposed DFlashEngine.scheduler for fallback mode) and the recent move of the prefill guard to HTTP 400. Those land the guard for DFlash's fallback path; this PR adds the missing primary-mode guard so DFlash is protected in both modes, reusing the same machinery rather than reshaping it.

Summary

  • DFlash's primary (speculative) path bypasses the Scheduler, so it inherited BaseEngine's no-op preflight_chat/preflight_completion and ran prefills with no memory guard — a long prompt could OOM the process instead of returning a clean HTTP 400.
  • Give DFlash a _DFlashPrefillGuard (built in start()) plus preflight_* overrides that estimate the prefill peak and raise PrefillMemoryExceededError; in fallback mode they delegate to the fallback engine, whose scheduler already runs the full guard.
  • The ProcessMemoryEnforcer reaches the new guard through a small _prefill_guard arm in _resolve_scheduler.

Why

After #1755, DFlashEngine.scheduler resolves the fallback engine's scheduler, so the enforcer can propagate watermarks once DFlash enters fallback mode. But in normal speculative mode DFlash has no scheduler at all — the enforcer had nowhere to push the ceiling, and the front-door preflight was still BaseEngine's no-op. An oversized prompt served on the primary path therefore skipped the guard entirely and could drive the process past the memory ceiling (observed end-to-end serving Qwen3-Coder-Next + DFlash with 56k-token prompts). This closes that primary-mode gap with the same pieces #1755 and the 400 handler already established.

Changes

  • omlx/memory_monitor.py: two additive, engine-agnostic helpers — set_model_info_from_model() (populate a MemoryMonitor's KV/SDPA dims from an mlx-lm model; CacheList-aware layer counting, matching Scheduler._set_model_info_for_monitor) and raise_if_prefill_exceeds() (the shared peak-estimate-and-raise). Scheduler is intentionally left untouched — it keeps its own (now TurboQuant-coupled) copy, so this PR doesn't reshape the scheduler hot path or its preflight math.
  • omlx/engine/dflash.py: _DFlashPrefillGuard holds a MemoryMonitor + the enforcer's watermark attrs; preflight_chat/preflight_completion estimate the peak and raise, or delegate to the fallback engine in fallback mode. Coexists with the scheduler property added in engine/dflash: expose scheduler property for ProcessMemoryEnforcer #1755.
  • omlx/process_memory_enforcer.py: _resolve_scheduler resolves a primary-mode DFlash to its _prefill_guard (None for every other engine, so standard resolution is unchanged). Fallback-mode resolution and the "could not resolve scheduler" warning suppression are already handled by engine/dflash: expose scheduler property for ProcessMemoryEnforcer #1755 and untouched here.
  • The guard uses the uncompressed (base-dtype) KV estimate — the conservative, never-undercount choice for an OOM guard.

Tests

New tests/test_dflash_prefill_memory_guard.py (12 cases: guard math mirroring test_scheduler_prefill_memory_guard, plus engine-level delegation including fallback-mode pass-through) and TestDFlashGuardPropagation in tests/test_process_memory_enforcer.py (enforcer reaches the primary guard; standard engines unchanged; no spurious warning).

.venv/bin/python -m pytest tests/test_dflash_prefill_memory_guard.py \
  tests/test_process_memory_enforcer.py tests/test_dflash_engine.py \
  tests/test_engine_preflight.py -q
# 198 passed

Also verified the new tests fail when the fix is reverted: neutralizing raise_if_prefill_exceeds makes test_preflight_raises_when_oversized fail with "DID NOT RAISE".

DFlash speculative decoding bypasses the scheduler, so its primary
(non-fallback) path inherited BaseEngine's no-op preflight_chat/
completion and ran prefills with no memory guard. A long prompt could
OOM the process instead of getting a clean HTTP 400 (observed serving
Qwen3-Coder-Next with 56k-token prompts).

Give DFlash's primary path a front-door preflight that reuses the
existing estimation machinery:

- Add two engine-agnostic helpers to memory_monitor.py:
  set_model_info_from_model() (populate a MemoryMonitor's KV/SDPA dims
  from an mlx-lm model, CacheList-aware layer counting) and
  raise_if_prefill_exceeds() (the shared peak-estimate + raise). These
  are additive: Scheduler keeps its own now-TurboQuant-coupled copy, so
  this change leaves the scheduler hot path untouched.
- DFlashEngine builds a _DFlashPrefillGuard (a MemoryMonitor + the
  enforcer's watermarks) in start() and overrides preflight_chat/
  completion to estimate the prefill peak and raise
  PrefillMemoryExceededError; in fallback mode it delegates to the
  fallback engine, whose scheduler runs the full guard.
- ProcessMemoryEnforcer._resolve_scheduler resolves the primary-mode
  guard via the engine's _prefill_guard so the watermarks propagate.
  Fallback-mode scheduler resolution and the spurious "could not
  resolve scheduler" warning are already handled upstream by
  DFlashEngine.scheduler.

The guard uses the uncompressed (base-dtype) KV estimate, the
conservative never-undercount choice for an OOM guard.

Tests: new tests/test_dflash_prefill_memory_guard.py plus enforcer
_prefill_guard propagation cases.
A DFlash prefix-cache hit reconstructs the matched KV into active
memory (dflash_mlx hydrate_target_cache clones every array), so the
guard deliberately passes no cached_tokens — unlike the scheduler's
resident paged cache, the full prompt's KV is allocated this request.
Document this at both preflight call sites and in the
raise_if_prefill_exceeds docstring so the prefix-hit count surfaced
as cached_tokens in API usage is never wired into the guard.
Nothing passes it: both preflight call sites omit it, and a DFlash
prefix-cache hit reconstructs the matched KV into active memory, so a
nonzero value would always under-count the prefill peak. Removing the
parameter makes that misuse unrepresentable rather than merely
documented; raise_if_prefill_exceeds keeps cached_tokens for engines
whose caches hold KV resident. A new test pins the narrowed signature.
@jundot

jundot commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Thanks for adding this. This fixes a real DFlash primary-path gap: it bypasses the scheduler, so the prefill guard was not applied before admission.

This complements #1774 rather than replacing it. One small follow-up is needed: the new helper should avoid calling mx.get_active_memory() from the HTTP preflight path and use cached/physical memory instead, matching the current scheduler/enforcer policy.

@jundot jundot merged commit c6fa02e into jundot:main Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants