Skip to content

feat: analytical fit for large MoE models (avoid repeated model loads) #11

@marksverdhei

Description

@marksverdhei

Problem

The current --fit implementation in llama_params_fit_impl() (src/llama.cpp) probes device memory by repeatedly calling llama_get_device_memory_data(), which does a full llama_model_load_from_file() + llama_init_from_model() (with no_alloc=true) each iteration.

For large MoE models this becomes impractically slow:

  • Qwen3-Coder-Next (80B-A3B): 512 experts × 48 layers × 3 tensors (up/down/gate) = ~73,000 expert tensors
  • The binary search over 2 GPUs runs ~15-20 probes
  • Each probe re-parses all 73K tensor headers from a 45GB GGUF file
  • Result: fit appears to "hang" for 30+ minutes before any weights are loaded

Dense models with ~100-200 tensors probe quickly, but MoE models with 10K-70K+ tensors make the current approach infeasible.

Proposed Solution

Calculate memory requirements analytically from tensor metadata loaded once, rather than doing repeated full model loads:

  1. Load model metadata once (GGUF header + tensor info, no weights)
  2. Build a per-layer memory map: for each layer, compute the size of dense tensors vs MoE expert tensors based on tensor shapes and quantization types
  3. Use this map to analytically compute memory usage for any (n_gpu_layers, n_cpu_moe, tensor_split) configuration without re-loading
  4. Run the same binary search / false position optimization, but against the analytical model instead of repeated llama_model_load_from_file calls

This would reduce fit time from O(probes × model_load_time) to O(1 × model_load_time + probes × arithmetic), making it viable for even the largest MoE models.

Current Workaround

Set n-gpu-layers and n-cpu-moe explicitly in the --models-preset .ini file, which causes fit to abort early (line 327-328: "n_gpu_layers already set by user").

Environment

  • 2x RTX 3090 (48GB total VRAM)
  • Model: Qwen3-Coder-Next Q4_K_M (45GB, 512 experts, 48 layers)
  • ht branch

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions