feat: analytical fit for large MoE models (avoid repeated model loads)

## Problem

The current `--fit` implementation in `llama_params_fit_impl()` (src/llama.cpp) probes device memory by repeatedly calling `llama_get_device_memory_data()`, which does a full `llama_model_load_from_file()` + `llama_init_from_model()` (with `no_alloc=true`) each iteration.

For large MoE models this becomes impractically slow:
- **Qwen3-Coder-Next (80B-A3B)**: 512 experts × 48 layers × 3 tensors (up/down/gate) = ~73,000 expert tensors
- The binary search over 2 GPUs runs ~15-20 probes
- Each probe re-parses all 73K tensor headers from a 45GB GGUF file
- Result: fit appears to "hang" for 30+ minutes before any weights are loaded

Dense models with ~100-200 tensors probe quickly, but MoE models with 10K-70K+ tensors make the current approach infeasible.

## Proposed Solution

Calculate memory requirements analytically from tensor metadata loaded **once**, rather than doing repeated full model loads:

1. Load model metadata once (GGUF header + tensor info, no weights)
2. Build a per-layer memory map: for each layer, compute the size of dense tensors vs MoE expert tensors based on tensor shapes and quantization types
3. Use this map to analytically compute memory usage for any `(n_gpu_layers, n_cpu_moe, tensor_split)` configuration without re-loading
4. Run the same binary search / false position optimization, but against the analytical model instead of repeated `llama_model_load_from_file` calls

This would reduce fit time from O(probes × model_load_time) to O(1 × model_load_time + probes × arithmetic), making it viable for even the largest MoE models.

## Current Workaround

Set `n-gpu-layers` and `n-cpu-moe` explicitly in the `--models-preset` .ini file, which causes fit to abort early (line 327-328: "n_gpu_layers already set by user").

## Environment

- 2x RTX 3090 (48GB total VRAM)
- Model: Qwen3-Coder-Next Q4_K_M (45GB, 512 experts, 48 layers)
- ht branch

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: analytical fit for large MoE models (avoid repeated model loads) #11

Problem

Proposed Solution

Current Workaround

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: analytical fit for large MoE models (avoid repeated model loads) #11

Description

Problem

Proposed Solution

Current Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions