feat(fit): expose per-device byte plan from common_fit_params (#66 step 1)#69
Merged
Conversation
…ep 1) Adds an optional `out_bytes_per_device` output parameter to `common_fit_params` (defaulting to nullptr so all existing callers are unaffected). On SUCCESS, populates with the projected per-device byte demand for the resolved plan — index 0 is the CPU device, indices 1..N are GPU/accel devices in the same order as `tensor_split`. This is the foundation for issue #66 (per-GPU-aware router fit). The router currently admits candidates against the TOTAL VRAM pool via `common_fit_params` + `--models-max`, ignoring that two models pinned to CUDA0 collide on GPU0's 24 GiB even though the 48 GiB pool says fits. Subsequent steps (separate PRs): * (2) Track `reserved_per_device[]` in `server_models`, populated from this output on each load and decremented on unload. * (3) Replace count-only `unload_lru()` with `unload_lru_for_devices(targeted, candidate_per_device)` that LRU-evicts from the constrained device set. * (4) Wire admit decision against `free_device[d] - reserved[d]` per targeted device. No behavior change in this PR — the new output is opt-in and the existing planning logic is untouched. Both fit-impl return paths (the early MoE-trivial path and the full-search path) populate `out_bytes_per_device` identically with the final `mem` vector that fit_impl already computes internally.
This was referenced Jun 5, 2026
marksverdhei
added a commit
that referenced
this pull request
Jun 12, 2026
…#66 step 2 prep) The router's per-GPU admit decision (#66) needs the per-device byte demand for a candidate model BEFORE spawning the child subprocess. PR #69 added the underlying `out_bytes_per_device` output to `common_fit_params`; this PR exposes it via the existing `tools/fit-params` CLI as a subprocess-friendly JSON output. * New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`). * On success, prints a single-line JSON object on stdout: {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T} plan[i] = i-th GPU/accel device, same order as tensor_split; CPU host memory NOT included. Empty plan for CPU-only builds. * On fit failure, emits an explicit JSON failure marker and exits 1: {"error":"fit_failed","status":N} * common/fit.cpp: populate `out_bytes_per_device` at the three early- return paths (the impl had three 'no changes needed' fast-paths that bypassed the main return point where PR #69 wrote the plan). Doc string in common/fit.h corrected — plan covers GPU devices only. Designed to be subprocessed from `server_models::compute_admit_plan(name)` (#66 step 2 — out-of-process approach per the architectural call on issue #66 / task ggml-org#123). The router parses this JSON, tracks `reserved[d]` for in-flight LOADING models, admits candidates against `live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the existing `--fit-print` mode; if both set, `--fit-print-plan` wins. Local CPU build verified: `--help` renders the new flag, empty plan returned for CPU-only build as expected. GPU verification deferred to snoop-kube's canary-cycle.
marksverdhei
added a commit
that referenced
this pull request
Jun 12, 2026
…#66 step 2 prep) (#72) * feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep) The router's per-GPU admit decision (#66) needs the per-device byte demand for a candidate model BEFORE spawning the child subprocess. PR #69 added the underlying `out_bytes_per_device` output to `common_fit_params`; this PR exposes it via the existing `tools/fit-params` CLI as a subprocess-friendly JSON output. * New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`). * On success, prints a single-line JSON object on stdout: {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T} plan[i] = i-th GPU/accel device, same order as tensor_split; CPU host memory NOT included. Empty plan for CPU-only builds. * On fit failure, emits an explicit JSON failure marker and exits 1: {"error":"fit_failed","status":N} * common/fit.cpp: populate `out_bytes_per_device` at the three early- return paths (the impl had three 'no changes needed' fast-paths that bypassed the main return point where PR #69 wrote the plan). Doc string in common/fit.h corrected — plan covers GPU devices only. Designed to be subprocessed from `server_models::compute_admit_plan(name)` (#66 step 2 — out-of-process approach per the architectural call on issue #66 / task ggml-org#123). The router parses this JSON, tracks `reserved[d]` for in-flight LOADING models, admits candidates against `live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the existing `--fit-print` mode; if both set, `--fit-print-plan` wins. Local CPU build verified: `--help` renders the new flag, empty plan returned for CPU-only build as expected. GPU verification deferred to snoop-kube's canary-cycle. * test(fit-params): smoke for --fit-print-plan JSON output (PR #72 coverage) (#75) PR #72 added the --fit-print-plan flag to llama-fit-params without test coverage. This adds a tools/fit-params/tests.sh (pattern lifted from tools/gguf-split/tests.sh) that downloads a small Qwen3-0.6B GGUF and verifies six invariants: 1. success-path emits single-line JSON 2. schema has per_device_bytes / n_devices / total_bytes with correct types 3. len(per_device_bytes) == n_devices 4. total_bytes == sum(per_device_bytes) 5. on CPU-only builds (n_devices==0): plan is empty, total is 0 6. fit-failure (nonexistent model) emits the documented "error":"fit_failed" JSON marker on stdout (not garbage) so subprocess callers can distinguish fit-failure from parse-failure Run with: tools/fit-params/tests.sh path/to/build/bin Verified locally on CPU-only build: ALL fit-params --fit-print-plan smoke tests PASSED.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #66 step 1 — foundation for per-GPU-aware router fit.
What
Adds an optional
out_bytes_per_deviceoutput parameter tocommon_fit_params(defaults tonullptrso existing callers are unaffected). On SUCCESS, populates with the projected per-device byte demand for the resolved plan.Why
The router currently admits candidates against TOTAL VRAM via
common_fit_params+--models-max. Two models pinned to CUDA0 (tensor-split=100,0) collide on GPU0's 24 GiB even though the 48 GiB pool says they fit — see issue #66 for the exact prod signature and snoop-kube's deterministic repro.How
Both
common_params_fit_implexit paths (the early MoE-trivial path at the previous line 619, and the full-search path at line 763) populate the output from thememvector that fit_impl already computes. No behavior change fornullptrcallers.static void common_params_fit_impl( const char * path_model, struct llama_model_params * mparams, struct llama_context_params * cparams, float * tensor_split, struct llama_model_tensor_buft_override * tensor_buft_overrides, - size_t * margins_s, uint32_t n_ctx_min, enum ggml_log_level log_level) { + size_t * margins_s, uint32_t n_ctx_min, enum ggml_log_level log_level, + std::vector<int64_t> * out_bytes_per_device) {What this is NOT
This PR doesn't wire the output anywhere — it's the foundation. Subsequent PRs (each separate, each independently shippable):
reserved_per_device[]inserver_models, updated on load (+) and unload (-).unload_lru()withunload_lru_for_devices(targeted, candidate_per_device).free_device[d] - reserved[d]per targeted device.Verified
cmake -B build -DGGML_CPU=ON -DLLAMA_BUILD_APP=ON && cmake --build build --target llama-serversucceeds.Test plan after merge
Snoop-kube has a deterministic repro: load devstral-24b (CUDA0), then request gemma-4-31b-iq4 (also CUDA0). Today this OOMs at the second admit. After step 2-4 land + a canary roll, the same sequence should: (a) succeed if there's room after evictions, or (b) cleanly 503 with
model_unavailable_errorif not — never OOM.