Skip to content

fix(memory): walk VLM nested config for accurate KV+SDPA estimation#1448

Merged
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/vlm-nested-config-estimation
May 28, 2026
Merged

fix(memory): walk VLM nested config for accurate KV+SDPA estimation#1448
jundot merged 1 commit into
jundot:mainfrom
cfbraun:pr/vlm-nested-config-estimation

Conversation

@cfbraun

@cfbraun cfbraun commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

_set_model_info_for_monitor reads model.config.num_hidden_layers directly. For multimodal models (Qwen3.6-VL, Gemma-4, some Llava / HF auto-wrappers) the top-level config holds the vision tower dimensions; the actual language model lives under text_config / language_config / llm_config. Reading the wrong field under-estimates the LM's KV+SDPA peak by a constant factor (a 40-layer LM wrapped in a 33-layer vision tower undercounts by ~20%) and lets _preflight_memory_check approve prefills that go on to crash Metal mid-prefill.

Fix: probe text_config / language_config / llm_config in priority order before extracting dimensions. If any sub-config exposes num_hidden_layers (or the legacy n_layer alias for GPT-style configs), prefer it — even when the top-level config also has a value, since the top-level value is the unreliable one on these wrappers. Falls back to the top-level config when no sub-config qualifies.

Test plan

  • pytest tests/test_memory_monitor_vlm_config.py::TestSetModelInfoForMonitorVLMWalk — 6 passed
    • text_config / language_config / llm_config each take priority over top-level vision-tower dims
    • plain LM (no sub-configs) still works
    • sub-config without a layer count is correctly skipped
    • GPT-style n_layer in the sub-config is recognized

@cfbraun cfbraun force-pushed the pr/vlm-nested-config-estimation branch 3 times, most recently from 675402a to 2035d24 Compare May 28, 2026 05:48
``_set_model_info_for_monitor`` reads ``model.config.num_hidden_layers``
directly. For multimodal models (Qwen3.6-VL, Gemma-4, some Llava / HF
auto-wrappers) the top-level config holds the *vision tower*
dimensions; the actual language model lives under ``text_config`` /
``language_config`` / ``llm_config``. Reading the wrong field
under-estimates the LM's KV+SDPA peak by a constant factor (a 40-layer
LM wrapped in a 33-layer vision tower undercounts by ~20%) and lets
``_preflight_memory_check`` approve prefills that go on to crash Metal
mid-prefill.

Fix: probe ``text_config`` / ``language_config`` / ``llm_config`` in
priority order before extracting dimensions. If any sub-config exposes
``num_hidden_layers`` (or the legacy ``n_layer`` alias for GPT-style
configs), prefer it — even when the top-level config also has a value,
since the top-level value is the unreliable one on these wrappers.
Falls back to the top-level config when no sub-config qualifies.

Tests (tests/test_memory_monitor_vlm_config.py::TestSetModelInfoForMonitorVLMWalk):
- text_config / language_config / llm_config each take priority over
  top-level vision-tower dims
- plain LM (no sub-configs) still works
- sub-config without a layer count is correctly skipped
- ``n_layer`` (GPT-style) in the sub-config is recognized

6 new tests pass; 93 existing scheduler tests pass on this branch.
@cfbraun cfbraun force-pushed the pr/vlm-nested-config-estimation branch from 2035d24 to 0920a2c Compare May 28, 2026 08:01
@jundot

jundot commented May 28, 2026

Copy link
Copy Markdown
Owner

Thanks for tracing this — VLM nested config is a real gap. I confirmed mlx-vlm Qwen3-VL/Gemma-4 configs (and the gemma-4-31B-it config.json on disk) and they all carry the LM layer count under text_config only. omlx already walks text_config in models/vlm.py and models/embedding.py; scheduler was just missing it.

Merging.

@jundot jundot merged commit b5bfdd9 into jundot:main May 28, 2026
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 1, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 2, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 2, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 3, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 3, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 3, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 5, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 5, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
cfbraun added a commit to cfbraun/omlx that referenced this pull request Jun 6, 2026
The first CI run on this PR (https://github.com/jundot/omlx/actions/runs/26738412492)
surfaced three failures introduced by the rebase against upstream's
cb0904d (predictive prefill throttle) and our own change to
``Scheduler.__init__`` (constructs a MemoryMonitor in estimator-only
mode so preflight checks work without prior set_model_info):

1. ``_set_model_info_for_monitor`` accepted MagicMock-style proxies
   as ``head_dim`` / ``num_kv_heads`` / ``num_layers`` because the
   ``if num_layers and num_kv_heads and head_dim:`` truthiness gate
   doesn't reject auto-attribute mocks. The estimator path
   (``estimate_chunk_transient_bytes`` → ``if hd > 128``) then
   blew up the upstream throttle tests with
   ``TypeError: '>' not supported between instances of 'MagicMock'
   and 'int'``. Tighten to ``isinstance(..., int) and ... > 0`` so
   the estimator stays disabled until set_model_info is called with
   genuine model dims.

2. ``test_schedule_waiting_preflight_rejection_releases`` patched
   ``_preflight_memory_check`` to return a plain ``str`` (matching
   the pre-PR contract). Our PR types the return as
   ``_PreflightRejection`` and the rejection-path code in
   ``_schedule_waiting`` reads ``.message`` on it. Update the mock to
   return an actual ``_PreflightRejection`` so the test exercises the
   real shape.

3. ``test_picks_text_config_over_top_level_vision_dims`` (added by
   our merged jundot#1448) asserted ``sched.memory_monitor is None`` at
   init. Our PR now constructs the monitor in estimator-only mode at
   init time; replace the assertion with the actual setup it needs
   (overwrite ``memory_monitor`` with a MagicMock to inspect
   ``set_model_info`` calls).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants