feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep) by marksverdhei · Pull Request #72 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-05T02:31:06Z

Stacks on PR #69. Step 2 prep work for issue #66.

Why

Router-side admit decision (#66) needs the per-device byte demand for a candidate model BEFORE spawning the child subprocess. Architectural decision on issue #66 (captured in task ggml-org#123): use the out-of-process subprocess approach — tools/fit-params already runs the same fit logic in-process; just need a subprocess-friendly output format.

What

New CLI flag --fit-print-plan (env LLAMA_ARG_FIT_PRINT_PLAN).

Emits single-line JSON on success:

{"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T}

Emits JSON failure marker on fit failure (subprocess callers can distinguish from parse failures):
```
{"error":"fit_failed","status":N}
```
common/fit.cpp: also populate out_bytes_per_device at the three "no changes needed" early-return paths. PR feat(fit): expose per-device byte plan from common_fit_params (#66 step 1) #69 only populated the main return points; for models that fit without adjustment (the common case), the plan was empty.
common/fit.h: doc string corrected — plan covers GPU/accel devices only, CPU host memory NOT included, plan empty for CPU-only builds.
Mutually exclusive with the existing --fit-print mode; if both set, --fit-print-plan wins.

Verified

✅ Build clean (cmake --build build --target llama-fit-params)
✅ --help renders the new flag
✅ CPU-only smoke returns {"per_device_bytes":[],"n_devices":0,"total_bytes":0} (correct — no GPU demand for CPU build)
⏳ GPU verification deferred to snoop-kube's canary cycle (local GPU busy with centurion container)

Usage from the future router

When picking up step 2 in server_models::compute_admit_plan(name):

subprocess: tools/fit-params --fit-print-plan <model args from preset>
parse the single JSON line
plan[i] = bytes for the i-th GPU/accel device (router uses tensor_split order)

Sequence

PR feat(fit): expose per-device byte plan from common_fit_params (#66 step 1) #69 — out_bytes_per_device foundation (open)
This PR — fit-params CLI surface (depends on feat(fit): expose per-device byte plan from common_fit_params (#66 step 1) #69)
server_models::compute_admit_plan helper + reserved[] state (depends on this)
Shadow-mode admit logging (depends on 3)
Flip to enforcement + LRU evict (depends on 4, needs canary)

Each is small and individually shippable.

…rage) (#75) PR #72 added the --fit-print-plan flag to llama-fit-params without test coverage. This adds a tools/fit-params/tests.sh (pattern lifted from tools/gguf-split/tests.sh) that downloads a small Qwen3-0.6B GGUF and verifies six invariants: 1. success-path emits single-line JSON 2. schema has per_device_bytes / n_devices / total_bytes with correct types 3. len(per_device_bytes) == n_devices 4. total_bytes == sum(per_device_bytes) 5. on CPU-only builds (n_devices==0): plan is empty, total is 0 6. fit-failure (nonexistent model) emits the documented "error":"fit_failed" JSON marker on stdout (not garbage) so subprocess callers can distinguish fit-failure from parse-failure Run with: tools/fit-params/tests.sh path/to/build/bin Verified locally on CPU-only build: ALL fit-params --fit-print-plan smoke tests PASSED.

…#66 step 2 prep) The router's per-GPU admit decision (#66) needs the per-device byte demand for a candidate model BEFORE spawning the child subprocess. PR #69 added the underlying `out_bytes_per_device` output to `common_fit_params`; this PR exposes it via the existing `tools/fit-params` CLI as a subprocess-friendly JSON output. * New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`). * On success, prints a single-line JSON object on stdout: {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T} plan[i] = i-th GPU/accel device, same order as tensor_split; CPU host memory NOT included. Empty plan for CPU-only builds. * On fit failure, emits an explicit JSON failure marker and exits 1: {"error":"fit_failed","status":N} * common/fit.cpp: populate `out_bytes_per_device` at the three early- return paths (the impl had three 'no changes needed' fast-paths that bypassed the main return point where PR #69 wrote the plan). Doc string in common/fit.h corrected — plan covers GPU devices only. Designed to be subprocessed from `server_models::compute_admit_plan(name)` (#66 step 2 — out-of-process approach per the architectural call on issue #66 / task ggml-org#123). The router parses this JSON, tracks `reserved[d]` for in-flight LOADING models, admits candidates against `live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the existing `--fit-print` mode; if both set, `--fit-print-plan` wins. Local CPU build verified: `--help` renders the new flag, empty plan returned for CPU-only build as expected. GPU verification deferred to snoop-kube's canary-cycle.

…rage) (#75) PR #72 added the --fit-print-plan flag to llama-fit-params without test coverage. This adds a tools/fit-params/tests.sh (pattern lifted from tools/gguf-split/tests.sh) that downloads a small Qwen3-0.6B GGUF and verifies six invariants: 1. success-path emits single-line JSON 2. schema has per_device_bytes / n_devices / total_bytes with correct types 3. len(per_device_bytes) == n_devices 4. total_bytes == sum(per_device_bytes) 5. on CPU-only builds (n_devices==0): plan is empty, total is 0 6. fit-failure (nonexistent model) emits the documented "error":"fit_failed" JSON marker on stdout (not garbage) so subprocess callers can distinguish fit-failure from parse-failure Run with: tools/fit-params/tests.sh path/to/build/bin Verified locally on CPU-only build: ALL fit-params --fit-print-plan smoke tests PASSED.

This was referenced Jun 5, 2026

Hivemind Maintenance Tasks Epoch 1 #73

Closed

test(fit-params): smoke for --fit-print-plan JSON output #75

Merged

Hivemind Maintenance Tasks Epoch 2 #79

Closed

Hivemind Maintenance Tasks Epoch 3 #81

Closed

Hivemind Maintenance Tasks Epoch 4 #86

Closed

Base automatically changed from feat/fit-per-device-plan to ht June 12, 2026 18:36

marksverdhei and others added 2 commits June 12, 2026 20:40

marksverdhei force-pushed the feat/fit-params-print-plan branch from 62e2afe to 005d9b0 Compare June 12, 2026 18:40

marksverdhei merged commit 68e31e9 into ht Jun 12, 2026
3 of 7 checks passed

marksverdhei deleted the feat/fit-params-print-plan branch June 12, 2026 19:06

marksverdhei mentioned this pull request Jun 12, 2026

docs(readme): complete HT Fork Changes inventory with per-change justifications #106

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep)#72

feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep)#72
marksverdhei merged 2 commits into
htfrom
feat/fit-params-print-plan

marksverdhei commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 5, 2026

Why

What

Verified

Usage from the future router

Sequence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant