Problem
The current --fit implementation in llama_params_fit_impl() (src/llama.cpp) probes device memory by repeatedly calling llama_get_device_memory_data(), which does a full llama_model_load_from_file() + llama_init_from_model() (with no_alloc=true) each iteration.
For large MoE models this becomes impractically slow:
- Qwen3-Coder-Next (80B-A3B): 512 experts × 48 layers × 3 tensors (up/down/gate) = ~73,000 expert tensors
- The binary search over 2 GPUs runs ~15-20 probes
- Each probe re-parses all 73K tensor headers from a 45GB GGUF file
- Result: fit appears to "hang" for 30+ minutes before any weights are loaded
Dense models with ~100-200 tensors probe quickly, but MoE models with 10K-70K+ tensors make the current approach infeasible.
Proposed Solution
Calculate memory requirements analytically from tensor metadata loaded once, rather than doing repeated full model loads:
- Load model metadata once (GGUF header + tensor info, no weights)
- Build a per-layer memory map: for each layer, compute the size of dense tensors vs MoE expert tensors based on tensor shapes and quantization types
- Use this map to analytically compute memory usage for any
(n_gpu_layers, n_cpu_moe, tensor_split) configuration without re-loading
- Run the same binary search / false position optimization, but against the analytical model instead of repeated
llama_model_load_from_file calls
This would reduce fit time from O(probes × model_load_time) to O(1 × model_load_time + probes × arithmetic), making it viable for even the largest MoE models.
Current Workaround
Set n-gpu-layers and n-cpu-moe explicitly in the --models-preset .ini file, which causes fit to abort early (line 327-328: "n_gpu_layers already set by user").
Environment
- 2x RTX 3090 (48GB total VRAM)
- Model: Qwen3-Coder-Next Q4_K_M (45GB, 512 experts, 48 layers)
- ht branch
🤖 Generated with Claude Code
Problem
The current
--fitimplementation inllama_params_fit_impl()(src/llama.cpp) probes device memory by repeatedly callingllama_get_device_memory_data(), which does a fullllama_model_load_from_file()+llama_init_from_model()(withno_alloc=true) each iteration.For large MoE models this becomes impractically slow:
Dense models with ~100-200 tensors probe quickly, but MoE models with 10K-70K+ tensors make the current approach infeasible.
Proposed Solution
Calculate memory requirements analytically from tensor metadata loaded once, rather than doing repeated full model loads:
(n_gpu_layers, n_cpu_moe, tensor_split)configuration without re-loadingllama_model_load_from_filecallsThis would reduce fit time from O(probes × model_load_time) to O(1 × model_load_time + probes × arithmetic), making it viable for even the largest MoE models.
Current Workaround
Set
n-gpu-layersandn-cpu-moeexplicitly in the--models-preset.ini file, which causes fit to abort early (line 327-328: "n_gpu_layers already set by user").Environment
🤖 Generated with Claude Code