fix(memory): walk VLM nested config for accurate KV+SDPA estimation#1448
Merged
Conversation
675402a to
2035d24
Compare
``_set_model_info_for_monitor`` reads ``model.config.num_hidden_layers`` directly. For multimodal models (Qwen3.6-VL, Gemma-4, some Llava / HF auto-wrappers) the top-level config holds the *vision tower* dimensions; the actual language model lives under ``text_config`` / ``language_config`` / ``llm_config``. Reading the wrong field under-estimates the LM's KV+SDPA peak by a constant factor (a 40-layer LM wrapped in a 33-layer vision tower undercounts by ~20%) and lets ``_preflight_memory_check`` approve prefills that go on to crash Metal mid-prefill. Fix: probe ``text_config`` / ``language_config`` / ``llm_config`` in priority order before extracting dimensions. If any sub-config exposes ``num_hidden_layers`` (or the legacy ``n_layer`` alias for GPT-style configs), prefer it — even when the top-level config also has a value, since the top-level value is the unreliable one on these wrappers. Falls back to the top-level config when no sub-config qualifies. Tests (tests/test_memory_monitor_vlm_config.py::TestSetModelInfoForMonitorVLMWalk): - text_config / language_config / llm_config each take priority over top-level vision-tower dims - plain LM (no sub-configs) still works - sub-config without a layer count is correctly skipped - ``n_layer`` (GPT-style) in the sub-config is recognized 6 new tests pass; 93 existing scheduler tests pass on this branch.
2035d24 to
0920a2c
Compare
Owner
|
Thanks for tracing this — VLM nested config is a real gap. I confirmed mlx-vlm Qwen3-VL/Gemma-4 configs (and the gemma-4-31B-it config.json on disk) and they all carry the LM layer count under Merging. |
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 1, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 2, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 2, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 3, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 3, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 3, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 5, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 5, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
cfbraun
added a commit
to cfbraun/omlx
that referenced
this pull request
Jun 6, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492) surfaced three failures introduced by the rebase against upstream's cb0904d (predictive prefill throttle) and our own change to ``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only mode so preflight checks work without prior set_model_info): 1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the ``if num_layers and num_kv_heads and head_dim:`` truthiness gate doesn't reject auto-attribute mocks. The estimator path (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then blew up the upstream throttle tests with ``TypeError: '>' not supported between instances of 'MagicMock' and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so the estimator stays disabled until set_model_info is called with genuine model dims. 2. ``test_schedule_waiting_preflight_rejection_releases`` patched ``_preflight_memory_check`` to return a plain ``str`` (matching the pre-PR contract). Our PR types the return as ``_PreflightRejection`` and the rejection-path code in ``_schedule_waiting`` reads ``.message`` on it. Update the mock to return an actual ``_PreflightRejection`` so the test exercises the real shape. 3. ``test_picks_text_config_over_top_level_vision_dims`` (added by our merged jundot#1448) asserted ``sched.memory_monitor is None`` at init. Our PR now constructs the monitor in estimator-only mode at init time; replace the assertion with the actual setup it needs (overwrite ``memory_monitor`` with a MagicMock to inspect ``set_model_info`` calls).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_set_model_info_for_monitorreadsmodel.config.num_hidden_layersdirectly. For multimodal models (Qwen3.6-VL, Gemma-4, some Llava / HF auto-wrappers) the top-level config holds the vision tower dimensions; the actual language model lives undertext_config/language_config/llm_config. Reading the wrong field under-estimates the LM's KV+SDPA peak by a constant factor (a 40-layer LM wrapped in a 33-layer vision tower undercounts by ~20%) and lets_preflight_memory_checkapprove prefills that go on to crash Metal mid-prefill.Fix: probe
text_config/language_config/llm_configin priority order before extracting dimensions. If any sub-config exposesnum_hidden_layers(or the legacyn_layeralias for GPT-style configs), prefer it — even when the top-level config also has a value, since the top-level value is the unreliable one on these wrappers. Falls back to the top-level config when no sub-config qualifies.Test plan
pytest tests/test_memory_monitor_vlm_config.py::TestSetModelInfoForMonitorVLMWalk— 6 passedn_layerin the sub-config is recognized