feat(fit): expose per-device byte plan from common_fit_params (#66 step 1) by marksverdhei · Pull Request #69 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-04T18:26:43Z

Issue #66 step 1 — foundation for per-GPU-aware router fit.

What

Adds an optional out_bytes_per_device output parameter to common_fit_params (defaults to nullptr so existing callers are unaffected). On SUCCESS, populates with the projected per-device byte demand for the resolved plan.

std::vector<int64_t> plan;
common_fit_params(path, &mparams, &cparams, ts, tbo, margins, n_ctx_min,
                  GGML_LOG_LEVEL_INFO, &plan);
// plan[0] = CPU bytes, plan[1..N] = GPU/accel bytes

Why

The router currently admits candidates against TOTAL VRAM via common_fit_params + --models-max. Two models pinned to CUDA0 (tensor-split=100,0) collide on GPU0's 24 GiB even though the 48 GiB pool says they fit — see issue #66 for the exact prod signature and snoop-kube's deterministic repro.

How

Both common_params_fit_impl exit paths (the early MoE-trivial path at the previous line 619, and the full-search path at line 763) populate the output from the mem vector that fit_impl already computes. No behavior change for nullptr callers.

 static void common_params_fit_impl(
         const char * path_model, struct llama_model_params * mparams, struct llama_context_params * cparams,
         float * tensor_split, struct llama_model_tensor_buft_override * tensor_buft_overrides,
-        size_t * margins_s, uint32_t n_ctx_min, enum ggml_log_level log_level) {
+        size_t * margins_s, uint32_t n_ctx_min, enum ggml_log_level log_level,
+        std::vector<int64_t> * out_bytes_per_device) {

What this is NOT

This PR doesn't wire the output anywhere — it's the foundation. Subsequent PRs (each separate, each independently shippable):

Step 2: track reserved_per_device[] in server_models, updated on load (+) and unload (-).
Step 3: replace count-only unload_lru() with unload_lru_for_devices(targeted, candidate_per_device).
Step 4: wire admit decision against free_device[d] - reserved[d] per targeted device.

Verified

✅ cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-server succeeds.
✅ No call sites needed updates (default-nullptr parameter).
✅ Master sync friction: one new parameter, default nullptr — trivial to resolve if upstream changes the signature.

Test plan after merge

Snoop-kube has a deterministic repro: load devstral-24b (CUDA0), then request gemma-4-31b-iq4 (also CUDA0). Today this OOMs at the second admit. After step 2-4 land + a canary roll, the same sequence should: (a) succeed if there's room after evictions, or (b) cleanly 503 with model_unavailable_error if not — never OOM.

…ep 1) Adds an optional `out_bytes_per_device` output parameter to `common_fit_params` (defaulting to nullptr so all existing callers are unaffected). On SUCCESS, populates with the projected per-device byte demand for the resolved plan — index 0 is the CPU device, indices 1..N are GPU/accel devices in the same order as `tensor_split`. This is the foundation for issue #66 (per-GPU-aware router fit). The router currently admits candidates against the TOTAL VRAM pool via `common_fit_params` + `--models-max`, ignoring that two models pinned to CUDA0 collide on GPU0's 24 GiB even though the 48 GiB pool says fits. Subsequent steps (separate PRs): * (2) Track `reserved_per_device[]` in `server_models`, populated from this output on each load and decremented on unload. * (3) Replace count-only `unload_lru()` with `unload_lru_for_devices(targeted, candidate_per_device)` that LRU-evicts from the constrained device set. * (4) Wire admit decision against `free_device[d] - reserved[d]` per targeted device. No behavior change in this PR — the new output is opt-in and the existing planning logic is untouched. Both fit-impl return paths (the early MoE-trivial path and the full-search path) populate `out_bytes_per_device` identically with the final `mem` vector that fit_impl already computes internally.

…#66 step 2 prep) The router's per-GPU admit decision (#66) needs the per-device byte demand for a candidate model BEFORE spawning the child subprocess. PR #69 added the underlying `out_bytes_per_device` output to `common_fit_params`; this PR exposes it via the existing `tools/fit-params` CLI as a subprocess-friendly JSON output. * New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`). * On success, prints a single-line JSON object on stdout: {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T} plan[i] = i-th GPU/accel device, same order as tensor_split; CPU host memory NOT included. Empty plan for CPU-only builds. * On fit failure, emits an explicit JSON failure marker and exits 1: {"error":"fit_failed","status":N} * common/fit.cpp: populate `out_bytes_per_device` at the three early- return paths (the impl had three 'no changes needed' fast-paths that bypassed the main return point where PR #69 wrote the plan). Doc string in common/fit.h corrected — plan covers GPU devices only. Designed to be subprocessed from `server_models::compute_admit_plan(name)` (#66 step 2 — out-of-process approach per the architectural call on issue #66 / task ggml-org#123). The router parses this JSON, tracks `reserved[d]` for in-flight LOADING models, admits candidates against `live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the existing `--fit-print` mode; if both set, `--fit-print-plan` wins. Local CPU build verified: `--help` renders the new flag, empty plan returned for CPU-only build as expected. GPU verification deferred to snoop-kube's canary-cycle.

…#66 step 2 prep) (#72) * feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep) The router's per-GPU admit decision (#66) needs the per-device byte demand for a candidate model BEFORE spawning the child subprocess. PR #69 added the underlying `out_bytes_per_device` output to `common_fit_params`; this PR exposes it via the existing `tools/fit-params` CLI as a subprocess-friendly JSON output. * New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`). * On success, prints a single-line JSON object on stdout: {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T} plan[i] = i-th GPU/accel device, same order as tensor_split; CPU host memory NOT included. Empty plan for CPU-only builds. * On fit failure, emits an explicit JSON failure marker and exits 1: {"error":"fit_failed","status":N} * common/fit.cpp: populate `out_bytes_per_device` at the three early- return paths (the impl had three 'no changes needed' fast-paths that bypassed the main return point where PR #69 wrote the plan). Doc string in common/fit.h corrected — plan covers GPU devices only. Designed to be subprocessed from `server_models::compute_admit_plan(name)` (#66 step 2 — out-of-process approach per the architectural call on issue #66 / task ggml-org#123). The router parses this JSON, tracks `reserved[d]` for in-flight LOADING models, admits candidates against `live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the existing `--fit-print` mode; if both set, `--fit-print-plan` wins. Local CPU build verified: `--help` renders the new flag, empty plan returned for CPU-only build as expected. GPU verification deferred to snoop-kube's canary-cycle. * test(fit-params): smoke for --fit-print-plan JSON output (PR #72 coverage) (#75) PR #72 added the --fit-print-plan flag to llama-fit-params without test coverage. This adds a tools/fit-params/tests.sh (pattern lifted from tools/gguf-split/tests.sh) that downloads a small Qwen3-0.6B GGUF and verifies six invariants: 1. success-path emits single-line JSON 2. schema has per_device_bytes / n_devices / total_bytes with correct types 3. len(per_device_bytes) == n_devices 4. total_bytes == sum(per_device_bytes) 5. on CPU-only builds (n_devices==0): plan is empty, total is 0 6. fit-failure (nonexistent model) emits the documented "error":"fit_failed" JSON marker on stdout (not garbage) so subprocess callers can distinguish fit-failure from parse-failure Run with: tools/fit-params/tests.sh path/to/build/bin Verified locally on CPU-only build: ALL fit-params --fit-print-plan smoke tests PASSED.

marksverdhei merged commit 922ee31 into ht Jun 12, 2026
3 of 7 checks passed

marksverdhei deleted the feat/fit-per-device-plan branch June 12, 2026 18:36

marksverdhei mentioned this pull request Jun 12, 2026

docs(readme): complete HT Fork Changes inventory with per-change justifications #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(fit): expose per-device byte plan from common_fit_params (#66 step 1)#69

feat(fit): expose per-device byte plan from common_fit_params (#66 step 1)#69
marksverdhei merged 1 commit into
htfrom
feat/fit-per-device-plan

marksverdhei commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 4, 2026

What

Why

How

What this is NOT

Verified

Test plan after merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant