Environment:
macOS, Apple M5 Max (applegpu_g17s), 36GB unified memory
mlx 0.31.2, mlx-lm 0.31.3, mlx-metal 0.31.2
Model: mlx-community/Qwen3.5-9B-4bit
What happens:
Validation (forward pass) completes successfully. The first training iteration crashes immediately with:
[METAL] Command buffer execution failed: Insufficient Memory
(00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
This happens regardless of batch size (1 or 2), sequence length (2048–8192), number of LoRA layers (4–16), or whether --grad-checkpoint is set. Total system memory usage is low at the time of crash.
Workaround:
Switching to mlx-community/Qwen3-8B-4bit (same architecture family, previous generation) trains successfully with identical settings. Suggests the issue is specific to Qwen3.5's architecture changes in this mlx version.
Reproduce:
mlx_lm lora
--model mlx-community/Qwen3.5-9B-4bit
--train --data data/
--batch-size 1 --num-layers 4
--max-seq-length 2048
--grad-checkpoint --val-batches 0
Environment:
macOS, Apple M5 Max (applegpu_g17s), 36GB unified memory
mlx 0.31.2, mlx-lm 0.31.3, mlx-metal 0.31.2
Model: mlx-community/Qwen3.5-9B-4bit
What happens:
Validation (forward pass) completes successfully. The first training iteration crashes immediately with:
[METAL] Command buffer execution failed: Insufficient Memory
(00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
This happens regardless of batch size (1 or 2), sequence length (2048–8192), number of LoRA layers (4–16), or whether --grad-checkpoint is set. Total system memory usage is low at the time of crash.
Workaround:
Switching to mlx-community/Qwen3-8B-4bit (same architecture family, previous generation) trains successfully with identical settings. Suggests the issue is specific to Qwen3.5's architecture changes in this mlx version.
Reproduce:
mlx_lm lora
--model mlx-community/Qwen3.5-9B-4bit
--train --data data/
--batch-size 1 --num-layers 4
--max-seq-length 2048
--grad-checkpoint --val-batches 0