Skip to content

feat(fit): expose per-device byte plan from common_fit_params (#66 step 1)#69

Merged
marksverdhei merged 1 commit into
htfrom
feat/fit-per-device-plan
Jun 12, 2026
Merged

feat(fit): expose per-device byte plan from common_fit_params (#66 step 1)#69
marksverdhei merged 1 commit into
htfrom
feat/fit-per-device-plan

Conversation

@marksverdhei

Copy link
Copy Markdown

Issue #66 step 1 — foundation for per-GPU-aware router fit.

What

Adds an optional out_bytes_per_device output parameter to common_fit_params (defaults to nullptr so existing callers are unaffected). On SUCCESS, populates with the projected per-device byte demand for the resolved plan.

std::vector<int64_t> plan;
common_fit_params(path, &mparams, &cparams, ts, tbo, margins, n_ctx_min,
                  GGML_LOG_LEVEL_INFO, &plan);
// plan[0] = CPU bytes, plan[1..N] = GPU/accel bytes

Why

The router currently admits candidates against TOTAL VRAM via common_fit_params + --models-max. Two models pinned to CUDA0 (tensor-split=100,0) collide on GPU0's 24 GiB even though the 48 GiB pool says they fit — see issue #66 for the exact prod signature and snoop-kube's deterministic repro.

How

Both common_params_fit_impl exit paths (the early MoE-trivial path at the previous line 619, and the full-search path at line 763) populate the output from the mem vector that fit_impl already computes. No behavior change for nullptr callers.

 static void common_params_fit_impl(
         const char * path_model, struct llama_model_params * mparams, struct llama_context_params * cparams,
         float * tensor_split, struct llama_model_tensor_buft_override * tensor_buft_overrides,
-        size_t * margins_s, uint32_t n_ctx_min, enum ggml_log_level log_level) {
+        size_t * margins_s, uint32_t n_ctx_min, enum ggml_log_level log_level,
+        std::vector<int64_t> * out_bytes_per_device) {

What this is NOT

This PR doesn't wire the output anywhere — it's the foundation. Subsequent PRs (each separate, each independently shippable):

  • Step 2: track reserved_per_device[] in server_models, updated on load (+) and unload (-).
  • Step 3: replace count-only unload_lru() with unload_lru_for_devices(targeted, candidate_per_device).
  • Step 4: wire admit decision against free_device[d] - reserved[d] per targeted device.

Verified

  • cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-server succeeds.
  • ✅ No call sites needed updates (default-nullptr parameter).
  • ✅ Master sync friction: one new parameter, default nullptr — trivial to resolve if upstream changes the signature.

Test plan after merge

Snoop-kube has a deterministic repro: load devstral-24b (CUDA0), then request gemma-4-31b-iq4 (also CUDA0). Today this OOMs at the second admit. After step 2-4 land + a canary roll, the same sequence should: (a) succeed if there's room after evictions, or (b) cleanly 503 with model_unavailable_error if not — never OOM.

…ep 1)

Adds an optional `out_bytes_per_device` output parameter to
`common_fit_params` (defaulting to nullptr so all existing callers are
unaffected). On SUCCESS, populates with the projected per-device byte
demand for the resolved plan — index 0 is the CPU device, indices 1..N
are GPU/accel devices in the same order as `tensor_split`.

This is the foundation for issue #66 (per-GPU-aware router fit). The
router currently admits candidates against the TOTAL VRAM pool via
`common_fit_params` + `--models-max`, ignoring that two models pinned
to CUDA0 collide on GPU0's 24 GiB even though the 48 GiB pool says fits.

Subsequent steps (separate PRs):
* (2) Track `reserved_per_device[]` in `server_models`, populated from
      this output on each load and decremented on unload.
* (3) Replace count-only `unload_lru()` with
      `unload_lru_for_devices(targeted, candidate_per_device)` that
      LRU-evicts from the constrained device set.
* (4) Wire admit decision against `free_device[d] - reserved[d]` per
      targeted device.

No behavior change in this PR — the new output is opt-in and the
existing planning logic is untouched. Both fit-impl return paths
(the early MoE-trivial path and the full-search path) populate
`out_bytes_per_device` identically with the final `mem` vector that
fit_impl already computes internally.
@marksverdhei marksverdhei merged commit 922ee31 into ht Jun 12, 2026
3 of 7 checks passed
@marksverdhei marksverdhei deleted the feat/fit-per-device-plan branch June 12, 2026 18:36
marksverdhei added a commit that referenced this pull request Jun 12, 2026
…#66 step 2 prep)

The router's per-GPU admit decision (#66) needs the per-device byte
demand for a candidate model BEFORE spawning the child subprocess.
PR #69 added the underlying `out_bytes_per_device` output to
`common_fit_params`; this PR exposes it via the existing
`tools/fit-params` CLI as a subprocess-friendly JSON output.

* New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`).
* On success, prints a single-line JSON object on stdout:
    {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T}
  plan[i] = i-th GPU/accel device, same order as tensor_split; CPU
  host memory NOT included. Empty plan for CPU-only builds.
* On fit failure, emits an explicit JSON failure marker and exits 1:
    {"error":"fit_failed","status":N}
* common/fit.cpp: populate `out_bytes_per_device` at the three early-
  return paths (the impl had three 'no changes needed' fast-paths that
  bypassed the main return point where PR #69 wrote the plan). Doc
  string in common/fit.h corrected — plan covers GPU devices only.

Designed to be subprocessed from `server_models::compute_admit_plan(name)`
(#66 step 2 — out-of-process approach per the architectural call on
issue #66 / task ggml-org#123). The router parses this JSON, tracks
`reserved[d]` for in-flight LOADING models, admits candidates against
`live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the
existing `--fit-print` mode; if both set, `--fit-print-plan` wins.

Local CPU build verified: `--help` renders the new flag, empty plan
returned for CPU-only build as expected. GPU verification deferred to
snoop-kube's canary-cycle.
marksverdhei added a commit that referenced this pull request Jun 12, 2026
…#66 step 2 prep) (#72)

* feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep)

The router's per-GPU admit decision (#66) needs the per-device byte
demand for a candidate model BEFORE spawning the child subprocess.
PR #69 added the underlying `out_bytes_per_device` output to
`common_fit_params`; this PR exposes it via the existing
`tools/fit-params` CLI as a subprocess-friendly JSON output.

* New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`).
* On success, prints a single-line JSON object on stdout:
    {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T}
  plan[i] = i-th GPU/accel device, same order as tensor_split; CPU
  host memory NOT included. Empty plan for CPU-only builds.
* On fit failure, emits an explicit JSON failure marker and exits 1:
    {"error":"fit_failed","status":N}
* common/fit.cpp: populate `out_bytes_per_device` at the three early-
  return paths (the impl had three 'no changes needed' fast-paths that
  bypassed the main return point where PR #69 wrote the plan). Doc
  string in common/fit.h corrected — plan covers GPU devices only.

Designed to be subprocessed from `server_models::compute_admit_plan(name)`
(#66 step 2 — out-of-process approach per the architectural call on
issue #66 / task ggml-org#123). The router parses this JSON, tracks
`reserved[d]` for in-flight LOADING models, admits candidates against
`live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the
existing `--fit-print` mode; if both set, `--fit-print-plan` wins.

Local CPU build verified: `--help` renders the new flag, empty plan
returned for CPU-only build as expected. GPU verification deferred to
snoop-kube's canary-cycle.

* test(fit-params): smoke for --fit-print-plan JSON output (PR #72 coverage) (#75)

PR #72 added the --fit-print-plan flag to llama-fit-params without test
coverage. This adds a tools/fit-params/tests.sh (pattern lifted from
tools/gguf-split/tests.sh) that downloads a small Qwen3-0.6B GGUF and
verifies six invariants:

1. success-path emits single-line JSON
2. schema has per_device_bytes / n_devices / total_bytes with correct types
3. len(per_device_bytes) == n_devices
4. total_bytes == sum(per_device_bytes)
5. on CPU-only builds (n_devices==0): plan is empty, total is 0
6. fit-failure (nonexistent model) emits the documented "error":"fit_failed"
   JSON marker on stdout (not garbage) so subprocess callers can
   distinguish fit-failure from parse-failure

Run with: tools/fit-params/tests.sh path/to/build/bin

Verified locally on CPU-only build: ALL fit-params --fit-print-plan
smoke tests PASSED.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant