Skip to content

chore(sync): upstream master → ht (114-commit sync 2026-06-13)#107

Closed
marksverdhei wants to merge 39 commits into
htfrom
chore/upstream-sync-2026-06-13
Closed

chore(sync): upstream master → ht (114-commit sync 2026-06-13)#107
marksverdhei wants to merge 39 commits into
htfrom
chore/upstream-sync-2026-06-13

Conversation

@marksverdhei

Copy link
Copy Markdown

Upstream sync: ht ← upstream/master (114 commits)

Syncs the ht fork onto the latest upstream llama.cpp master.

  • Merge-base: 006640408 → upstream d8a24ccee (114 upstream commits)
  • Conflicts resolved: 96 (63 modify/delete, 32 content, 1 add/add)
  • After merge: merge-base advances to d8a24ccee, so the next sync measures its delta correctly (a vendored/squashed merge would not — see below).

Important

Merge this with a MERGE COMMIT (or fast-forward), NOT squash. This is a true 2-parent merge commit; squashing flattens it, the merge-base would not advance, and the next sync would again over-count its delta. This is the one PR where the repo's default squash-merge policy must be overridden.

Validation

  • ✅ Builds clean — CPU Release, 118 targets, no errors/-Werror.
  • ctest -L main47/47 green (incl. test-quantize-fns, test-llama-archs covering the DFlash+eagle3 arch merge, test-backend-ops).
  • ✅ Server smoke (merged binary) — test_basic 5/6, test_completion 38/40. The 2 failures are sandbox-environment artifacts, not regressions: webui index.html couldn't download from the llama-ui HF bucket offline (test_no_webui), and one test needs a model not in the offline cache (..._stream_..._stops).

Key resolution decisions

Area Decision Rationale
tools/ui/* (63 files) kept deleted Fork ships a prebuilt UI from the llama-ui HF bucket, not the upstream Svelte source tree.
tools/mtmd/* (6 files) converged to upstream ht's mtmd was vendored upstream commits now behind master (wrapper-struct bitmap API, video/ffmpeg, builder pattern). Verified no ht-unique features lost; server callers updated to the wrapper API.
Gemma4 MTP / DFlash / eagle3 merged both Kept ht's DFlash arch/hparams/speculative wiring; adopted upstream's now-merged eagle3 + masked-embedding (the vendored ggml-org#23398 converging upstream).
KV cache upstream v_cells_impl refactor Upstream's shared_ptr cells mainline the dual-context cell-sharing ht hacked in via other; the init-list shares the source cache's storage when other is set.
llama-graph kq-mask upstream buffer-guarded set Matches the shared-cells refactor (idxs may be unbacked while the mask stays live); dropped the now-duplicate unconditional call.
Embedding scale kept ht's ubatch.token guard Gemma4 vision must not scale raw image embeddings; upstream's deepstack-aware variant would scale them for non-deepstack models. Documented inline.
Server ht features + upstream API Kept router glue, #94 benign-warning downgrade, #87 api-prefix stripping, DFlash/MTMD-draft wiring; adopted upstream's expanded public endpoints, mtmd wrapper API, int64 token counts.

Post-merge fixes (auto-merge artifacts the build surfaced, not conflicts): removed a duplicate llama_model::fc member; restored n_embd in output_reserve for upstream's embeddings-layer-inp path.

⚠️ Needs a real-model smoke before fleet rollout

ctest + the small-model server smoke don't exercise the deployment-shaped paths most affected by the merge. Before deploying to titan/crystal:

  • Dual-context speculative KV sharing (DFlash/MTP) — the v_cells_impl refactor changed the cell-sharing mechanism.
  • Gemma4 vision — the embedding-scale guard + the mtmd convergence (wrapper API + video).
  • mtmd video — new upstream feature, untested here.

Rollback: tag ht-pre-sync-2026-06-12 (immutable).

marksverdhei and others added 30 commits June 4, 2026 16:28
The full webui source tree (tools/ui/.storybook, src/, static/, tests/,
docs/, configs) is now maintained at heiervang-technologies/heierchat
as a standalone repo. Drop it from ht-llama.cpp along with the
upstream ui.yml CI workflow that builds it.

The minimal build glue stays: tools/ui/CMakeLists.txt, tools/ui/embed.cpp,
tools/ui/sources.cmake, scripts/ui-assets.cmake — enough to wire up
llama-ui as a static lib so tools/server can keep linking it. With no
source files present, scripts/ui-assets.cmake hits its 'no assets
available' fallback and emits empty ui.cpp/ui.h; llama_ui_find_asset()
returns nullptr and the server simply 404s for embedded UI routes.

Heierchat is served separately (its own build pipeline), and the
pre-built snapshot in tools/server/public/ is committed for any
deployment that wants to wire its own filesystem-serving route.
…nels

Adds two new ftypes — LLAMA_FTYPE_MOSTLY_TBQ3_0 (3.125 bpw) and
LLAMA_FTYPE_MOSTLY_TBQ4_0 (4.125 bpw) — built on 128-element blocks
(half of QK_K, finer quantization granularity at low bpw).

CPU + CUDA implementations include native dot-products and dequant
kernels. The CUDA flash-attention path consumes K/V directly in
quantized format, eliminating the dequantize-to-f16 intermediate.

Plumbed through include/llama.h ftypes, src/llama-quant.cpp,
src/llama-graph.cpp, src/llama-kv-cache.cpp, tools/quantize,
tools/llama-bench, and tests/test-backend-ops + test-quantize-fns.
CLI and completion docs updated with the new cache-type values.
common/preset.cpp: scan the models directory for GGUF files whose
general.type=adapter, read their architecture / name / version, and
expose them via common_preset_context. Lets the server attach
LoRA adapters automatically instead of requiring per-preset wiring.

common/arg.cpp + common.h:
  - Register TBQ3_0 / TBQ4_0 in the KV cache type table (used by
    --cache-type-k / --cache-type-v alongside the TurboQ feature).
  - Add --remap-developer-role / LLAMA_ARG_REMAP_DEVELOPER_ROLE so
    requests with the OpenAI "developer" role rewrite to "system"
    before chat-template application — needed for Qwen3.5 et al
    whose templates reject unknown roles.
Carries ht's server-side customizations on top of upstream router mode:

* server-models: discovered_adapters tracking + LoRA discovery wiring,
  pick_any_resident() for the "any" model sentinel, custom stop-timeout
  handling, sync load/unload via wait_until_ready / wait_until_unloaded,
  child-to-router CMD_CHILD_TO_ROUTER_INFO support, plus LLAMA_APP_CMD
  env-var injection in update_args() for child supervision.
* server-chat: developer-role remap, image queue and unbundled-tool
  endpoint helpers, downstream chat-template fallbacks.
* server-context / server-task / server.cpp: glue and routing
  enhancements for the above.
* server-common: shared helpers (env injection, json utils).
* tests/unit/test_router.py: regression coverage for the router.
* README + README-dev: heierchat pointers (webui is no longer in-tree)
  and ht-side server architecture notes.
Rust service (tools/termd/) that runs in a hardened container alongside
llama-server and exposes a websocket terminal for tool-call execution.
Used by heierchat to run user-approved shell snippets with sandbox
guardrails (docker isolation, allowlisted commands, per-session state).

* HTTP control plane + WS streaming endpoint
* per-session shell state, async I/O over tokio
* sandbox_guard: command policy enforcement
* docker.rs: container lifecycle (start, exec, kill)
* ht-termd.service: systemd unit for managed deploys
The base model splits kv_b_proj into k_b_proj (transposed) and v_b_proj.
LoraTorchTensor can't handle the required split+transpose on 3D
tensors, so decompose the LoRA and apply the split/transpose to the
raw A/B tensors directly before yielding both decomposed adapter
tensors.
Ht-fork housekeeping:

* README.md, CONTRIBUTING.md, AGENTS.md, media/ht-llama-banner.png:
  fork branding and contributor pointers.
* .gitignore: downstream-only artifact ignores (deploy bundles etc.).
* docs/research/: design notes carried in-tree (diff-edit tool,
  file-editing research, LLM diff-edit literature).
* .github/workflows/:
  - aioc.yml — heiercloud compatibility probe
  - fork-sync.yml — automated upstream-master sync
  - python-lint.yml — fork-specific lint config
  - release.yml — fork release pipeline
* tools/server/public/: pre-built heierchat bundle snapshot embedded
  so the default server experience works without a separate UI deploy.
* scripts/snapdragon/qdc/requirements.txt: Dependabot security bumps
  (idna 3.15, urllib3 2.7.0, pytest 9.0.3).
scripts/ui-assets.cmake gets a new Priority 1 that copies the committed
heierchat snapshot in <repo>/tools/server/public/ into the build's
DIST_DIR, so llama-ui-embed picks it up and llama_ui_find_asset() can
serve it. Existing tools/ui/dist priority is demoted to Priority 2 as a
legacy / manual-override path.

Add tools/server/public/loading.html (carried from upstream master's
tools/ui/static/loading.html) so the 4-asset set is complete and the
server has a loading screen during model load.

README.md: collapse the eight WebUI / Desktop-shell subsections into a
single section pointing at heierchat. Drop the stale internal links to
tools/server/webui/* and tools/server/webui-tauri/* that no longer
resolve in this repo.
)

`tools/server/server-models.cpp:923` used `child_proc->return_status`,
which is a POSIX-only field of `subprocess_s` — the Windows variant of
the struct has `hProcess` instead, so this fails to compile on every
Windows CI runner. Replace the direct field access with a portable
`subprocess_join(child_proc.get(), &child_exit)` call (same pattern
already used at line 908 of the same file). `subprocess_alive()` has
already reaped the zombie on POSIX, so `subprocess_join()` returns
immediately with the cached status.
PR #52 switched TBQ3_0/TBQ4_0 from 256-element to 128-element blocks,
but tests/test-quantize-fns.cpp wasn't updated:

* `test_tbq3_norm_scaling` allocated a single `block_tbq3_0` (128
  elements) on the stack but passed `QK_K` (256) to
  `quantize_row_tbq3_0_ref`. The ref function writes `k / TBQ_BLK_SIZE`
  = 2 blocks, overrunning the single-block buffer. x86 silently scribbled
  past the local; arm64 stack canaries caught it as
  '*** stack smashing detected ***' and aborted the whole test binary.
  Fix: pass `TBQ_BLK_SIZE` and assert against `sqrtf(TBQ_BLK_SIZE)`.

* Bumped tolerances slightly:
  - `MAX_QUANTIZATION_TOTAL_ERROR_TBQ4` 0.0025 → 0.0035
  - `MAX_DOT_PRODUCT_ERROR_TBQ3` 0.05 → 0.06
  - Added `MAX_DOT_PRODUCT_ERROR_TBQ4` = 0.03 (TBQ4 was falling through
    to the default 0.02, which the 128-block path now exceeds).

The threshold bumps are tight (~20%) — worth a follow-up to confirm the
128-block migration isn't masking a real quality regression on uniform
random data. Real-model evals (perplexity, MMLU) should govern accept/
reject of the migration; these tests are just smoke.
Adds DFlash speculative decoding as a per-arch model class:

* src/models/dflash.cpp (new): `llama_model_dflash` — block-diffusion
  drafter that proposes N tokens per step against the target model's
  context-conditional embeddings. SWA-aware attention mask, n_block
  noise tokens layered against ctx tokens, per-layer is_swa_impl
  routing.
* src/llama-arch.{cpp,h}: LLM_ARCH_DFLASH registered (NeoX rope).
* src/llama-graph.{cpp,h}: `llm_graph_input_dflash` carries target
  hidden + masks, sets them on the graph each ubatch.
* src/llama-model{.cpp,.h,-loader.cpp}: arch dispatch, tensor loader
  hooks, std::array<int,5> instantiation for the noise schedule.
* src/llama-cparams.h + llama-hparams.h + llama-context.h +
  llama.cpp: cparams flag + ctx wiring + arch-dispatch entry.
* src/models/{gemma4,llama,models}.* : hooks so the existing target
  archs cooperate with the dflash drafter.
* common/speculative.cpp + speculative.h: COMMON_SPECULATIVE_TYPE_DFLASH
  config; auto-enable logic that gates 'draft' type on actual draft-
  model path presence while leaving room for dflash / eagle3 / mtp.
* tools/server: router-aware integration (last_used_ms field,
  pick_any_resident wired with dflash flag), server-task.h/cpp + queue
  + server.cpp glue, --parallel 1 gate left in place per current
  Round-11 status.
* tests/test-dflash.cpp + scripts/* + HANDOFF.md: smoke harness,
  bench, weight-compare, regression artifacts.

Squashed from 42 commits on the feat/dflash-integration branch (the
previous Round-11 lifecycle iterations). Rebased onto the post-rewrite
ht baseline; conflicts resolved against the per-arch class hierarchy
that's now upstream-stock (renamed swa_layers → is_swa_impl, single
remap_developer_role definition in server_chat_params, has_draft_simple
auto-enable kept dflash-aware).
…65)

Per Markus 2026-06-04: DFlash quality measurement should use a Q8_0
target rather than Q4_K_M, since Q4_K_M introduces enough target-side
quantization noise to confound DFlash's own accept-rate signal. Q8_0
fits in 38 GB total, well within titan A100 80 GB.

* Default `TARGET` is now `gemma-4-31B-it-Q8_0.gguf`. Override via
  `--target PATH` or `DFLASH_BENCH_TARGET` env var.
* Also added `DFLASH_BENCH_DRAFTER_DIR` env var for consistency.
* Comment block documents VRAM math for Q4_K_M / Q8_0 / BF16 targets
  so future runs can pick the right card.
…l-org#23398 vendor (#93)

* build : use umbrella Headers directory for XCFramework module map (ggml-org#23974)

The XCFramework generated by build-xcframework.sh creates a module map
that manually lists public headers.

That list can fall out of sync with the framework's Headers directory.
The module map is currently missing ggml-opt.h, which is present in the
framework headers. This can cause downstream Apple builds to fail with:

    Include of non-modular header inside framework module 'llama'

Use the framework's Headers directory itself as the module map umbrella
instead of maintaining a manual header list. This makes all public headers
under the generated framework's Headers directory part of the llama module.

* webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065)

* webui: fix tool selector toggle/counter, key tools by stable identity

Key the disabled set, counts and toggles by a stable per-tool key
instead of bare function name, deduped from one canonical list. Per-tool
checkboxes become presentational (single row handler, no nested button),
category checkboxes drop the tristate (n/total carries partial). One
getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name.

* ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity

* agents: refactor, include more guidelines (ggml-org#24111)

* agents: refactor, include more guidelines

* better example

* rephrase a bit

* add more examples

* nits

* server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110)

* server: avoid unnecessary checkpoint restore when new tokens are present

The pos_min_thold calculation unconditionally subtracts 1 to ensure at
least one token is evaluated for logits when no new tokens exist.
However, when the request contains new tokens beyond the cached prefix,
this -1 is overly conservative and may trigger an unnecessary checkpoint
restore.

Conditionally apply the -1 only when n_past >= task.n_tokens() (no new
tokens), avoiding redundant KV state restoration when there is actual
work to do.

* cont : add ref

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209)

* ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128

Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using
WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so
non-wasm builds are completely unaffected.

Approach:
- single wasm_v128_load covers all 32 packed 4-bit weights
- nibbles unpacked via AND/SHR into two u8x16 registers
- widened to i16 before multiply (WASM SIMD has no i8*i8 instruction)
- 4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs
- horizontal reduce via 4x wasm_i32x4_extract_lane

Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32,
200k iterations):

| impl   | ns/call | speedup |
|--------|---------|---------|
| scalar |   880.7 |   1.00x |
| simd   |   257.8 |   3.42x |

Correctness verified against scalar reference across 10 random seeds
with exact output match.

* ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend

Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c.
Move for loop in the else block.

* ggml: use generic q4_1_q8_1 fallback in wasm backend

* convert: Fix Gemma 4 Unified conversion (ggml-org#24118)

* Fix Gemma 4 Unified conversion

* Set audio hidden size to audio_embed_dim

* return filter to save memory (ggml-org#24125)

Co-authored-by: lvyichen <lvyichen@stepfun.com>

* ui: added single line reasoning preview (ggml-org#23601)

* webui: added single line reasoning preview.

* patch: reduce width slightly for the previewing section

* refactor: move formatter constants to the right file

* feat: reimplement reasoning preview with throttled dynamic per-line rendering

* chore: fix spacing

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: refactor to requested changes

* refactor: grouped by capture pattern instead of block-level + inline

* ui: fax interrupt state only trigger for 1st reasoning message

* chore: make reasoning preview respects showThoughtInProgress setting

* chore; newline at EOF

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix: thread rawContent so collapsible content can handle compute preview

* patch: showThoughtInProgress accidentally blocks rawContent being passed

* chore: fix lint

* chore: change smoke test

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* ui: Fixed packages (ggml-org#24119)

* chore(ui): pin package versions to currently installed

- Update all dependencies and devDependencies to match exactly what's in package-lock.json
- This ensures reproducible builds by locking to specific versions rather than semver ranges

* chore: Update packages

* chore: Move remaining dependencies to devDependencies

* fix: Add missing `mermaid` package

* chore: Update `cookie` package to `v1.1.1`

* chore: Formatting

* test: Update test configs

* Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445)

* Deduplicate imatrix loading code

* Add back LLAMA_TRACE, early exit on quantize missing metadata

* webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132)

* use child snippets for landing and chat message elements

* make ... icon visible in conversation history menu

* conversation history forward tab fix

* add snippet fix for fork icon in conversation history

* focus/keyboard fix for attachment x icon and scroll left/right

* formatting

* fix scroll down issue

* simply Statistics and pointer events in scrolldown

* create storybook tests and move to folder

* improve tests to actually assert on element

* arg: fix double mtp downloads (ggml-org#24128)

* server : disable on-device spec checkpoints (ggml-org#24108)

* sycl : port multi-column MMVQ from CUDA backend (ggml-org#21845)

mmvq:

Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL.
Read weights once per dispatch instead of once per column.
Covers all standard quant types + reorder paths for Q4_0, Q8_0,
Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to
incompatible vec_dot signatures.

ggml-sycl:

The weight reorder was only bootstrapped on single-token mat-vec
(ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec,
so it never triggered the reorder and ran on the slower non-reorder
kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.

* ci : build-msys job slimming [no ci] (ggml-org#24157)

This PR attempts to slim down the dependencies for build-msys jobs
making the same changes that we applied in whisper.cpp to reduce the
size of the github actions cache, and should also improve the run time
due to fewer dependencies that need to be installed.

I realize this is a scheduled job but I think it would still make sense
to apply these changes.

Refs: ggml-org/whisper.cpp#3858

* CUDA: enroll mul_mat_vec_q_moe into pdl (ggml-org#24087)

* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW

Data collected on a B4500:

Before
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9
```
After
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7
```

Server launched with:
```
➜  llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \
    -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    -ngl all \
    -fa on \
    --host 0.0.0.0 \
    --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"
```

* LC to overlap with following kernels

* kleidiai : dynamic chunck-based scheduling for hybrid execution (ggml-org#23819)

* hparams : refactor `hparams.n_layer` (ggml-org#24060)

* hparams : refactor hparams.n_layer

* cont : remove `n_layer_kv()`, use n_layer_all instead

* cont : type consistency

* pi : update SYSTEM.md

* models : fix Step3.5 MTP

* cont : remove duplicate switch cases

* cont : explicitly set `false` to extra layers for `is_swa` and `is_recr`

* cont : fix nextn layer count handling

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* minor : fix lint issues (ggml-org#24165)

* docs: Update quantization readme (ggml-org#24133)

* Update quantization readme

* install requirements

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* dos2unix suggestions

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ui: add ignore-scripts=true to npmrc (ggml-org#24149)

* Fix link to available UI settings (ggml-org#24169)

The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link

* ui: run npm install when package-lock.json is newer than node_modules (ggml-org#24171)

* model : fix llama_model::n_gpu_layers() (ggml-org#24188)

* cli: fix model params not propagated (ggml-org#23893)

Fixes ggml-org#23847

* TP: round up granularity to 128 (ggml-org#24180)

* TP: round up granularity to 128

* remove assert

* model, mtmd: Granite4 Vision (ggml-org#23545)

* feat(convert): Get language model conversion working for 4.1 vision

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0)

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb python-side vision projector names and mappings

There are several awkward things here:

1. Most of these are essentially identical to the audio qformer tensors. On
the c++ side, that's mapped using the prefix, so the rest of the GGUF
name needs to align, but on the python side there's no prefix notion, so
they all get duplicated.
2. There are a couple of net-new tensors for vision, in particular
PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as
belonging to the qformer portion, but the GGUF name is simply proj_norm
which conflicts with the ideal name for this new PROJ_NORM that is not
qualified as part of the qformer. To get around this, I used
"proj_layernorm" as the GGUF name.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add python side architecture name

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add python-side plumbing for setting FEATURE_LAYERS hparam

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side tensor naming defines

NOTE: Usage of these hasn't been updated to include prefix yet

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(mtmd): Convert vision_feature_layer to an ordered vector

We need to preserve the ordering of these feature index values so that they
can be mapped to the sub-tensors within the stacked projectors.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(mtmd): Add architecture label plumbing

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(wip): Add partial conversion for mmproj

This handles stacking the projector tensors and setting the new harams

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add gguf_writer and constant support for new hparams and deepstack layer arr

Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Full conversion for mmproj w/ tensor mappings

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add lm_head skip for mmproj for 4.0

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: De-alias text_config architecture in convert_lora_to_gguf.py

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add --trust-remote-code arg to convert_lora_to_gguf.py

This defaults to False, but allows a user to enable it programmaticly
instead of using the interactive prompt.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: De-alias model.language_model. -> model. for lora adapters

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Extend language model tensor dealiasing in adapters

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary registration for GraniteSpeech in language model

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb through mm prefix formatting for qformer tensors

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor vision projector tensors to use predictor ID as the block

This is cleaner than stacking them. The modeling file hard-codes
single-layer qformers, so we can punt on the multiipule multi-layer
projectors problem.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add spatial offests array hparam conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add stub plumbing for granite vision in mtmd

Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add new hparam and tensor naming in clip-impl.h

New hparams:
- KEY_PROJ_SAMPLE_QUERY_SIDE
- KEY_PROJ_SAMPLE_WINDOW_SIDE
- KEY_PROJ_SPATIAL_OFFSETS

New tensors:
- TN_MULTI_PROJ_IMG_POS
- TN_MULTI_PROJ_QUERY
- TN_MULTI_PROJ_LAYERNORM
- TN_MULTI_PROJ_LINEAR
- TN_MULTI_PROJ_NORM

Branch: Granite4Vision
AI-usage: none

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Move deepstack_layer_arr to llm hparam instead of mmproj

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove IS_DEEPSTACK_LAYERS

This appears to have been added during Qwen3 VL
(ggml-org#16780), but it was never
actually used.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: n_deepstack_layers -> deepstack_layer_arr

The old logic hard coded a correspondence between the first N layers of the
LLM and the 1->N entries in the input embeddings. Now, that relationship is
maintained at loading time if the GGUF value is single-valued. If it is
multi-valued, it loads directly allowing for deepstack layers to be spaced
out throughout the model.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use try/catch for single/multi valued deepstack info

The alternative would be to use get_key_or_arr, but then the single value
would be populated through the entire array and we'd need to detect that
and update it with the right correspondence.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add deepstack injection point for granite LLM

The use of ggml_add here assumes that the elements of inp_embd will be pre-
arranged to be the full embedding length with only the vision-mask'ed
portions non-zero from the projector. This matches how Qwen3VL does it.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: add missing vision attn layernorm eps

Branch: Granite4Vision
AI-usage: full (OpenCode + Qwen 3.6-35B)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix missing prefix template for TN_QF_PROJ_LINEAR

It's not strictly necessary since vision uses the blockwise version, but it
makes the loading consistent.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add embedding scale and image grid pinpoints hparams in conversion

Also remove dead parsing for self._deepstack_layer_arr

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add mtmd KEY_ section for hparams shared with the LLM

In this case, we need the EMBEDDING_SCALE so we can unscale the image
embeddings to compensate for applying embedding scale to the input
embeddings

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Implement c++ hparam parsing

Branch: Granite4Vision
AI-usage: draft (Claude Code)
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Flatten pinpoints in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing break

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: No reason to have modality prefix for img_pos

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tensor loading

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the right portion of speech for tensor loading!

Also plumb through the layernorm -> post_norm naming change

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add logging of deepstack_layers_arr if set

I also changed the print_f output type to int32_t to avoid printing
overflow values for -1. This could cause overflows on the other side, but
I can't imagine a value for any of the current array hparams that would
trigger that.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Make sure input embeddings are cont before f_embedding_scale

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add init and mmproj_embd cases for g4v

The n_mmproj_embd is 1+ to make space for the text embedding and all 8
projectors

Branch: Granite4Vision
AI-usage: draft (Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Invert (h, w) -> (w, h) pinpoints

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Reorder projectors based on llm index and skip the first injection

The multi-projector stack has a strange asymmetry based on how it's
currently implemented for qwen3vl: on the mmproj side, it's all N
projectors, but the output of the "first" (by inp_embd index) projector is
automatically consumed as if it were a standard single-projector mmproj,
so the deepstack portion needs to only contain the 1-N entries.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix mmproj hparams in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix ordering/logic for deepstack injection in granite

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix preprocessing config to match what the model needs

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* wip: Partial port of Eli's implementation

This is still pretty broken, but it's getting closer. It now happily
generates tokens, but the values are quite incorrect still. I suspect it's
caused by the mapping of projectors from safetensors to their respective
orders here.

Also, this implementation breaks encapsulation pretty badly in mtmd_encode.
This will need a big refactor to put the G4V-specific encoding logic
somewhere more appropriate.

Branch: Granite4Vision
AI-usage: draft (Claude Code, Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix the pre-scaling on the input embeddings to correctly invert the scale

We've got tokens! They still don't line up quite right, so something's a
little off, but we're getting much closer now.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: invert embedding multiplier -> base_scale at load

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix setting image_resize_pad after new enum introduced

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add G4V to mmproj mapping in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Re-add padding disable for non-hybrid hybrid models

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify G4V n_tokens computation

This is slightly more efficient and flexible for when we implement the
unpad cropping. IMO, it's also clearer that it is adding the number of
image_newline tokens (embeddings) to the grid, rather than recomputing the
entire count.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add new clip APIs for post-tile-encoding assembly

Granite 4 Vision uses llava-next style pack-and-unpad which requires
injecting the learned newline after each row of the tile grid. A row here
is a single row of the grid which is composed of (grid_x * cols_per_tile) *
(grid_y * rows_per_tile), so the result is newlines injected in between
individual tile rows, thus not something that can be handled with the
standard llava-uhd block-wise endcoding.

Branch: Granite4Vision
AI-usage: draft (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add model interfaces for granite 4 vision assembler

I'm on the fence about the best organization of this. These free functions
allow the per-architecture logic in clip.cpp to access the model-specific
graph building, but they still require a fair bit of model-specific logic
in clip.cpp which is not ideal.

I think a better approach may be to replicate what is done with the
graph builders themselves (and possibly even make the assembler part of the
model's existing graph builder).

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(mtmd): Consolidate assembler logic into clip_assembler class family

Just like `clip_graph` is the base class for building the model-specific
encoder graphs, `clip_assembler` will be the base class for building the
model-specific assembler graphs. This allows the assembly pattern to follow
how the encoder pattern is implemented where the model-specific logic lives
in a subclass co-located with the encoder graph builder that gets
constructed by a simple factory method.

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Comment improvement

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: granite_vision -> granite4_vision

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack

These pieces were never used on the c++ side (removed there in an earlier
commit), so this is just cleanup that I missed before.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Oops! I did not mean to commit one of my prompt files

But now it's too far back in history to effectively rebase out, even with
interactive and --rebase-merges :(

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing <algorithm> include for std::find

It seems that this was already pulled in on some platforms, but not on
others

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix Flake8 warnings in granite conversion module

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove clip_assembler in favor of clip_image_f32.append_token

Per conversation in the PR, the clip_assembler pattern was too invasive.
This is a compromise that limits model-specific blocks to add_media where
each preprocessed tile is annotated with an injection type, after which all
the token counting logic is generic and the newline injection itself is
handled in the graph based on the value for the given tile image.

Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(convert): Split n_deepstack_layers and deepstack_layers (array)

Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys

Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix GGUF key for deepstack_layers_arr

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs

This follows how gemma3 and gemma4 handle embedding scaling by skipping the
multiplier for raw input embeddings.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr)

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Fully revert changes to n_deepstack_layers and qwen3vl*

Since we're going to keep the GGUF KVs separate, it makes sense to just
keep the hparams separate too to limit the scope of this branch. The down
side is that n_deepstack_layers and deepstack_mapping_arr are potentially
conflicting.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Revert removal of "is_deepstack_layers" GGUF KV

This KV is not used at all on the c++ side, so it's fully dead, but there's
also no need to conflate this cleanup with the addition of G4V.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary ggml_cont and build_forward_expand in cbx

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Clean up comments

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Tighter and more flexible code for g4v_build_block

This could be refactored to look a lot more like granite-speech, but the
overall block constructs before/after the qformer are pretty different, so
for now I'm going to leave it as is and just tighten a bit.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary `unordered_set` include

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add architecture guard on deepstack_mapping_arr printout

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary AI-gen comment

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Always initialize deepstack_mapping_arr with -1 values

This was causing `test-llama-archs` to fail, likely due to trying to save
the uninitialized values, then re-loading them. It's safer to always
initialize so that other models don't forget and end up with undefined
behavior.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Remove TODO about block/vs non-block tensor mapping

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move is_vision_feature_layer logic into clip_hparams

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use a bool for append_token

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Remove unnecessary comment

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused get_model api

yikes!

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Rearrange helpers for g4v to be private members and use build_attn

Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix off-by-one in vision layer index

This was inherited from the Claude Code implementation that pushed the
negative index inversion down into the model file.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix norm/post_norm mixup in conversion

face. palm. :(

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: More descriptive tensor names

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Apply PR cleanup for new conversion changes

AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix(convert): Remove duplicate V_ENC_EMBD_IMGNL

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: append_token -> add_newline

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Comment cleanup

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Cleaner error handling/checking

NOTE: format_string is not available in granite.cpp (and including
clip-impl.h to get it doesn't compile, so I think it violates the intended
encapsulation), so std::stringstream is the simplest answer.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* model: fix build failed (ggml-org#24193)

* vulkan: add fwht support for Intel with shmem reduction (ggml-org#23964)

* vulkan: add fwht support for Intel with shmem reduction

* don't use N as workgroup size

* disable subgroup shuffle on MoltenVK AMD

* disable fwht shader on Intel Windows due to driver bug

* common/chat : unify and fix LFM2/LFM2.5 tool parser (ggml-org#24178)

* opencl: improve get_rows, cpy, concat and q6_k flat gemv (ggml-org#24160)

* opencl: allow multiple workgroups for large rows

* opencl: improve small cpy

* opencl: packed concat for small input

* opencl: tweak flat q6_K gemv, increase N_DST and remap threads

* context : fix off-by-one comparisons to n_gpu_layers (ggml-org#24208)

* model : rename local n_layer_all variable (ggml-org#24209)

* vulkan: check coopmat2 features before reporting support (ggml-org#24186)

* mtmd, server: add "placeholder bitmap" for counting tokens , add */input_tokens API (ggml-org#23913)

* mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing

* fast path skip preproc for placeholder

* fix build

* correct the api

* add server endpoint + tests

* add object name

* update docs

* add proxy handling

* fix build

* fix audio input path

* use is_placeholder in process_mtmd_prompt()

* nits

* nits (2)

* docs: clarify chat/completions/input_tokens is not official

* fix merge problem

* completion : fix format specifier in LOG_INF (ggml-org#24213)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* completion : remove useless statics (ggml-org#24226)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* mtmd: support "frame merge" for qwen-vl-based models (ggml-org#21858)

* feat: add video support for Qwen3.5

* various clean up

* revise the design

* fix llava-uhd case

* nits

* nits 2

---------

Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>

* common/chat : fix LFM2/LFM2.5 reasoning round-trip and <think> leak (ggml-org#24234)

* common/chat : fix LFM2 reasoning round-trip and stray <think> leak
* Gate by reasoning format and whether the template supports <think>

* docker : bump cuda13 to 13.3.0 (ggml-org#24228)

* convert : fix Gemma4 with no audio encoder (ggml-org#24242)

* arg: Skip mmproj download when user supplied mmproj (ggml-org#24239)

* chore(sync): adapt DFlash to hparams.n_layer() method post-ggml-org#24060

src/models/dflash.cpp had three direct uses of `hparams.n_layer`. The
upstream hparams refactor (ggml-org#24060) turned that into a
method `n_layer()` (effective count, excludes nextn layers).

DFlash drafter has no nextn layers, so `n_layer()` and the raw field
`n_layer_all` are numerically equal — picked `n_layer()` to match the
new accessor convention. Behavior-preserving.

* llama: Gemma 4 MTP

* fix multi-seq

* add assert that draft + shared kv should be on same device

* add Q rot when cache is quantized

* add temp hack to not use fit with gemma4, rm later

* add exception in test-llama-archs

* move assistant to separate file

* add unified assistant

* cont : adjust to hparams changes

* cont : avoid computations on the CPU

* cont : clean-up

* cont : clean-up

* cont : fix handling of unused tensors

* cont : fix undefined

* fix typo

* cont : enable gemma4 graph reuse

* cont : fix assert

* cont : fix quantized cache

* cont : fix names

* cont : fix names

* cont : add reference for draft positions

* cont : fix multi-modality

* cont : add comment about ctx_src

* cont : clean-up server fit logic

* cont : clean-up llama_context

* py : fix names

* cont : rename ctx_src -> ctx_other

* chore(sync): drop intermediate llama_set_mtp_source call

The first PR ggml-org#23398 commit added an `llama_set_mtp_source(ctx_dft, ctx_tgt)`
call after `llama_init_from_model`. Later cleanup commits in the same PR
removed that API and moved the wiring to `cparams.ctx_other = ctx_tgt`
set BEFORE init. Our keep-both resolution carried the intermediate call
forward; this drops it to match the PR's final API.

Drops 1 use of removed symbol, no behavior change (the rebased
cparams.ctx_other assignment is what's actually used).

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: Gerard Martinez <gmarzjr@proton.me>
Co-authored-by: Pascal <admin@serveurperso.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Yongyue Sun <abioy.sun@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Kartik Sirohi <99896785+sirohikartik@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: forforever73 <63285796+forforever73@users.noreply.github.com>
Co-authored-by: lvyichen <lvyichen@stepfun.com>
Co-authored-by: MagicExists <106458387+gugugiyu@users.noreply.github.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Bartowski <3266127+bartowski1182@users.noreply.github.com>
Co-authored-by: viggy <70774793+vignesh191@users.noreply.github.com>
Co-authored-by: Mason Milburn <masonmilby@gmail.com>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Mario <191101255+wariuccio@users.noreply.github.com>
Co-authored-by: therealkenc <therealkenc@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Ruben Ortlam <rortlam@redhat.com>
Co-authored-by: Tarek Dakhran <tarek@liquid.ai>
Co-authored-by: lhez <lih@qti.qualcomm.com>
Co-authored-by: Adrien Gallouët <angt@huggingface.co>
Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>
Co-authored-by: konradmb <konradmb@o2.pl>
Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
…#94)

The MTP draft memory-probe path (server-context.cpp ~line 856) creates
a throwaway llama_context with `cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP`
to measure context+compute bytes for fit_params. For the Gemma4-Assistant
arch, this throws because `cparams.ctx_other` is required and the target
context doesn't exist yet at probe time — upstream's own
src/llama-context.cpp init explicitly notes "this is normal during memory
fitting" in the exception message and carries a TODO to switch to a typed
llama_exception so the warning can be skipped.

Until that upstream change lands and flows in via a master sync, scan the
exception message for the self-identifying "normal during memory fitting"
marker and downgrade WRN -> DBG for that specific case. Real failures
(model load failed, etc.) still surface as SRV_WRN.

Eliminates the misleading "[spec] failed to measure draft model memory:
failed to create llama_context from model" line that appears on every
gemma-4-12b-qat-mtp pod start despite the model loading + running fine
at ~110 tok/s (verified Phase 6 on titan, image
unified-llm:mtp-pr23398-5e6dff22).

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
* scripts(pascal): P5200 build notes + bench harness + Vulkan baseline

Working notes for getting ht-llama.cpp running on the Quadro P5200
(Pascal sm_61, 16 GB). Toolkit wall: CUDA 13 dropped sm_61, so CUDA
backend requires aur/cuda-pascal 12.9.1 + gcc14. Driver 580 still runs
sm_61 binaries fine.

Vulkan baseline (Llama-3.1-8B Q4_K_M, ngl=99, fa=0, build f6feddb):
  pp128  269 t/s  pp512  278 t/s  pp2048 251 t/s
  tg32    35 t/s  tg128   35 t/s

CUDA results pending cuda-pascal install (gcc14 source build dominates).

Untracked primer (quadro-p5200-llamacpp-primer.md) referenced as the
source for the FP16-1/64-FP32, INT8 DP4A, and ggml-org#7188 FA-fix facts.

* scripts(pascal): CUDA backend bench results + complete install recipe

CUDA 12.9 toolkit built and benched on the Quadro P5200 (Pascal sm_61).
Five obstacles climbed on stock Arch:
1) CUDA 13 dropped Pascal       → installed 12.9 from runfile --extract
2) Runfile libxml2.so.2 missing → bypassed installer with --extract
3) gcc-15/16 too new for nvcc   → gcc-14 from archlinux-archive
4) gcc14 AUR source-build slow  → 51MB binary pkg.tar.zst (30s install)
5) glibc 2.43 cospi/sinpi clash → noexcept(true) patch on CUDA math.h+hpp

Full recipe in scripts/build-pascal-p5200.md.

Bench summary (Llama-2-7B Q4_0, ngl=99, build 5159fee, P5200 sm_61):
  CUDA fa=1: pp512=795, tg128=45.8 t/s
  Vulkan:    pp512=418, tg128=43.0 t/s
  CUDA wins pp ~1.9x, tg within 7% (bandwidth-bound).

ggml_cuda_init confirms: compute capability 6.1, VMM yes, GGML_CUDA_FORCE_MMQ
baked in (visible in nvcc cmdline). CC6.1 + MMQ-only + no cuBLAS fallback =
the INT8 dp4a path is what is running.

JSON artifacts committed alongside for replay/comparison.

* scripts(pascal): packaging recipe — rpath-clean runtime tarball for Omarchy ISO

Adds §7 Packaging covering the cmake-install + patchelf + symlink-chain +
stage-and-tar pipeline that produces pascal-cuda-artifacts.tar.zst (the
runtime fast-path consumed by hai-os-dev's autoinstall). Also drops the
stale "TODO — fill in once build-cuda completes" placeholder and moves
Sources to the true end of the doc.

Recipe reproduces the verified-clean tarball: rpath stripped on all
installed targets, libllama/libllama-common copied + patchelfed, symlinks
recreated, members rooted at opt/ for `tar -C / -xf` extraction, ld.so.conf.d
snippet documented so no LD_LIBRARY_PATH is needed at runtime.

* scripts(pascal): correct §7 tarball size + add reference sha256

Was: prose-estimate "~810 MB before zstd, ~470 MB after" — actual is
~816 MB unpacked, 512 MB compressed (110 members). Adds the reference
sha256 from the verified crystal build for hai-os-dev to byte-check
against. Notes zstd non-determinism so re-runs are expected to differ.

* scripts(pascal): field primer + Omarchy autoinstall handoff guide

Round out the Pascal/P5200 enablement bundle (PR #99) with the two
human-facing companions to scripts/build-pascal-p5200.md:

- quadro-p5200-llamacpp-primer.md: Pascal/GP104 + llama.cpp field guide
  (the two facts that drive every decision, CUDA vs Vulkan, measured
  1080-parity numbers, 16 GB VRAM sizing, optimization checklist).
- quadro-p5200-omarchy-autoinstall.md: 7-question handoff guide for
  hai-os-dev — extra packages (no AUR), CUDA-12.9 runfile pin, build
  flags, the five obstacles + fixes (glibc 2.43 noexcept patch incl.),
  pre-build at image time, HaiOS integration points, verified
  512 MB / sha256 0efed65... reference tarball, measured baseline.

Both docs reference the canonical recipe at scripts/build-pascal-p5200.md
and the verified tarball cached at crystal:/home/me/pascal-cuda-artifacts.tar.zst.

* scripts(pascal): v2 build flags (server+router) + Gemma4 MTP bench JSONs

Recipe update: add -DLLAMA_BUILD_SERVER=ON + -DLLAMA_BUILD_TESTS=OFF to
the CUDA configure step. Required for the llama-app unified router
(bin/llama) to link — without server-on, libllama-server-impl.so is not
built and llama-app link fails with `cannot find -lllama-server-impl`.
Also required for Gemma4 MTP: ctx_other wiring for the Gemma4Assistant
draft class lives only in tools/server/server-context.cpp; the
standalone llama-speculative-simple binary segfaults with
"Gemma4Assistant requires ctx_other to be set".

Rationale block also captures the spec-decode footgun: --spec-type
defaults to `none`, so -md <draft> alone is silently ignored. Must pass
--spec-type draft-mtp to engage. The /props
default_generation_settings.params["speculative.types"] field is
per-REQUEST sampler default, NOT the server engine state — the
canonical engagement read is server stderr (draft acceptance line +
statistics draft-mtp: ... summary).

Bench JSONs (crystal Pascal P5200, Gemma4 12B QAT Q4_K_XL, sm_61 CUDA
FORCE_MMQ, -fa on, -ngl 99, ctx 4096, greedy temp=0/top_k=1):

  baseline (no MTP, llama-bench):
    pp128=465.71 t/s, pp512=456.37 t/s
    tg32=25.54 t/s,  tg128=25.54 t/s  (flat — bandwidth-bound)

  MTP A/B via `llama serve` /completion (degenerate "0"×128 output):
    A baseline (--spec-type none):     25.26 t/s
    B MTP (--spec-type draft-mtp):    103.72 t/s   ← 4.11× CEILING
    draft acceptance: 1.00 (118/118)  — trivially predictable, not deployment

  MTP A/B via /v1/chat/completions (non-degenerate, 256 tokens):
    A baseline: 25.18 t/s
    B MTP:      76.06 t/s   ← 3.02× REPRESENTATIVE greedy speedup
    draft acceptance: 0.7627 (225/295)
    bit-identical content sha A vs B (greedy lossless property)

All three regimes labeled in the JSON so 4.11× isn't quoted as the
deployment number — the representative ~3× greedy or the
memory-recorded titan 1.66× (default sampling) are the honest reads.

* scripts(pascal): v2 server/MTP docs — §6/§7 scope flip + Gemma4 MTP numbers

Follow-on to 3662be4 (v2 build flags). Lands the doc side of the
LLAMA_BUILD_SERVER=ON v2 build into the two human-facing companions.

omarchy autoinstall guide §6:
- v1/v2 tarball table: v2 = pascal-cuda-artifacts-v2-server.tar.zst,
  sha 2528d952..., 515.5 MB, 121 members, server+router scope. v1
  (0efed65..., untouched) stays valid for non-serving bakes; v2 is the
  additive serving-capable successor, not a recall.
- serving footgun: --spec-type defaults to `none` (-md silently ignored);
  engagement proof is server stderr, not /props.
- Gemma4 MTP results, three clearly-labeled regimes (lossless A/B):
  4.11x degenerate ceiling / 3.02x representative greedy (headline) /
  1.66x sampling deployment ref.

build-pascal §7 packaging:
- version the tarball filename; never overwrite a live pull source.
- v1/v2 size+sha table.
- reconcile the stale "router not in this tarball" section to v2 reality:
  member-delta (+11), single-.so impls, lib64 prune, extraction-validate.
- note that bin/llama-server / bin/llama-cli are separate targets, not in
  llama-app's dep closure (reproducing v2 needs them in --target).

Also folds in a one-line build-target fix (line 80: add llama-server +
llama-cli to --target) that landed in the shared tree from the
fork-manager session concurrently — verified correct, kept so the recipe
reproduces v2.

* scripts(pascal): #100 bullets 1-3+6 bench evidence — server, router, gpu-only, vision+MTP

Closes 4 of 10 issue #100 bullets on Pascal P5200 (v2 server-capable build):

- bullet 1 (llama-server): standalone /opt/ht-llama-cuda/bin/llama-server
  → ready in 4s, /health 200, /completion 40 tok @ 25.84 t/s
- bullet 2 (llama-server router): unified `bin/llama serve` shim
  → ready in 4s, /health 200, /completion 40 tok @ 25.82 t/s
- bullet 3 (gpu only works): both runs above use -ngl 99 -fa on
- bullet 6 (gemma4 12b qat mtp all modalities): combined mmproj +
  draft-mtp + spec engine
  → A. coexistence: /v1/chat with image_url + --spec-type draft-mtp
    engaged → predicted=96, stderr draft acceptance = 0.66102 (78/118)
  → B. grounding (decoupled to mtmd-cli, avoids Gemma4 chat-template
    empty-content quirk): all 3 ground-truth features matched (PASCAL,
    P5200, red rectangle); requires --jinja (otherwise std::runtime_error
    custom-template-not-supported abort).

Methodology:
- regression band ±3% pinned vs committed baseline 25.54 tg / 76.05 MTP;
  both server-router runs in band (24.77-26.31 t/s window).
- engagement read on stderr (draft acceptance / draft-mtp stats), NOT
  /props (per --spec-type footgun memory).
- chat-content quirk explicitly noted in JSON so empty content does not
  read as fail or regression.

Bullets 4 (gpu+cpu offload) + 7-10 (qwen 27B/35B-MoE / gemma 26B/31B)
land in subsequent commits once the lithium IQ3-class + titan 31B IQ4_XS
transfers complete on crystal.

* scripts(pascal): #100 regression rerun + nit fold-ins (slug form, cross-harness note)

Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on
the v2 server-capable build to lock the no-regression gate against the
committed baseline (25.54 t/s) and MTP reference (76.05 t/s).
- Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54),
  in band.
- MTP /v1/chat greedy:  mean 75.07 t/s across 3 reps (-1.29% vs 76.05),
  in band.
- Draft acceptance: 0.76271 — bit-identical to committed regime-2
  (225/295 accepted/generated). Strong determinism proof.

Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per
crystal-assist's review):
- Memory slug citations switched from hyphen-form to underscore-form to
  match the actual slug names (feedback_spec_type_footgun,
  reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling.
- bench-pascal-server-router-smoke.json: added cross_harness_note clarifying
  the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS
  the no-regression claim (different harnesses, in band).

* scripts(pascal): #100 bullets 8, 4, 10 bench evidence + regression nit fold-ins

Three model bench JSONs from the v2 server-capable Pascal build:

bench-pascal-qwen3.6-27b-iq3-xxs.json (bullet 8): Qwen3.6-27B at
UD-IQ3_XXS (11.17 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu,
65/0/65 layers, gpu_residency_pct=95.45%. /completion 11.27 t/s,
/v1/chat 10.44 t/s, gpu free 4 GiB after load. Content reply:
"The capital of France is Paris." (qwen3.6 thinking mode active).

bench-pascal-gemma4-31b-iq4-xs-offload-{ngl40,ngl48}.json (bullets 4 + 10):
ngl=40 phase-1 → ngl=48 phase-2-verify accelerator (crystal-assist's
recipe): per_layer_combined = (gpu.model + gpu.context) / layers_gpu
at ngl=40 = 309 MiB; ngl_max = 40 + floor((2954 - 400) / 309) = 48.
Phase-2 verify PASS at -ngl=48: 49/62 layers GPU, 13/62 layers CPU,
4.95 t/s /completion, 4.98 t/s /v1/chat. Dense-layer partial-offload,
host_model=3967 MiB, host_context=768 MiB. Card 96% utilized.
gemma-4-31B-IQ4_XS is the smallest 31B quant available anywhere on
titan or lithium (sweep done by crystal-assist) — confirms 31B = the
documented offload-demo model, closes bullets 4 AND 10 in one bench.

Regression rerun JSON: minor wording fix — the bit-identical content
sha 01ba4719c80b6fe9 is sha256(b"\n") (single newline), not empty
string or null. Banks the harness blind-spot that hashing `jq -r .content`
output cannot distinguish JSON-null vs "" vs literal "null" vs "\n".
A==B determinism conclusion stands (per crystal-assist review).

* scripts(pascal): #100 bullets 7 + 9 bench evidence — qwen 35B MoE + gemma4 26B MoE

Closes the last two model bullets:

bench-pascal-qwen3.6-35b-a3b-iq3-xxs.json (bullet 7): Qwen3.6-35B-A3B
(MoE, 3B active) at UD-IQ3_XXS (12.30 GiB), -ngl 99 -fa on -c 4096 →
mode=full-gpu, 41/0/41 layers, gpu_residency_pct=96.85%. /completion
44.45 t/s, /v1/chat 40.96 t/s — fastest of any tested model (3B active
keeps per-token compute light). Content reply: "The capital of France
is Paris." VRAM 13003/16384 MiB after load (3 GiB headroom).

bench-pascal-gemma4-26b-a4b-iq4-xs.json (bullet 9): Gemma4-26B-A4B
(MoE, 128 experts / 8 active per token) at UD-IQ4_XS (12.66 GiB),
-ngl 99 -fa on -c 4096 → 31/0/31 layers on GPU, /completion 42.10 t/s,
/v1/chat 42.36 t/s. Content reply: "### Answer: The capital of France
is Paris." VRAM 14345/16384 MiB after load.

Classifier note (banked in JSON): the 26B host_model=748 MiB tripped
the harness's 600-MiB expert-MoE threshold. Inspection of the gemma4
config (vocab=262144, hidden=5120, IQ4_XS bytes/weight) confirms 748
is the embedding tensor + boundary buffers (≈671 MiB pure embedding),
NOT expert offload — all 128 experts are in gpu.model_mib=12952.
PRIMARY layer-count signal (31/0/31) correctly reads full-GPU. The
600 MiB threshold was calibrated to 12B embeddings (~540 MiB) and
under-scales for larger vocab×hidden_dim products. Mode patched to
full-gpu with classifier_note explaining the misfire + suggested
remediation (host_model_pct_of_total < 10-15% = embedding-pattern;
≥ that = real expert offload).

All 10 issue #100 bullets now have committed bench evidence.

---------

Co-authored-by: marksverdhei <marksverd@gmail.com>
Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… default (#101)

The heierchat snapshot at tools/server/public/ was wired into llama-server
as priority 1 in scripts/ui-assets.cmake, overriding the LLAMA_USE_PREBUILT_UI
HF-bucket download even though that flag defaults ON. The snapshot expects a
heierchat-app backend that isn't present on bare llama-server, which causes
the embedded webui to hang on "Initializing connection to heierchat server…"
when hit at llama-server's root.

Now that heierchat is a standalone product talking to llama-server over its
OpenAI-compatible API, the embedded UI reverts to the upstream llama.cpp
default — fetched as a prebuilt bundle from the llama-ui HF bucket at build
time via the existing tools/ui/ cmake scaffolding (no nodejs required, no
fork-side build).

Behaviour after this commit:
- copy_public_dist priority 1 → MISS (public/ removed)
- copy_src_dist priority 2 → MISS (tools/ui/src/ has no svelte sources)
- BUILD_UI npm priority 3 → no-op (no source to build)
- HF_ENABLED priority 4 → downloads upstream llama-ui bundle, embeds via
  llama-ui-embed (C++, host-compiler only)

Docs updated to reflect the new arrangement. The copy_public_dist cmake
function is preserved as a manual override for local experimentation.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…tor (#103)

The upstream sync (#93) vendored the llama_hparams getter refactor:
n_layer is now a getter returning n_layer_all - n_layer_nextn, with
the settable member renamed. test-dflash still assigned the getter,
breaking every CUDA/cpu CI build since 2026-06-07.

n_layer_nextn defaults to 0, so n_layer_all = 5 preserves the tests'
intent exactly. Verified: test-dflash compiles and all cases pass.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… logprobs partial-sort (#102)

* cuda(fattn): vec kernel instances for D == 512 with matched quantized KV types

Gemma 4 global attention layers (head size 512) previously always dispatched
to the tile (pre-Volta) or MMA (Turing+) kernels, both of which require K/V
dequantized to F16 -- with a quantized KV cache that staging pass re-reads
and re-writes the entire per-layer KV every decode step.

Add vec kernel instances for D == 512 with matched q4_0/q8_0 KV types (the
vec kernel reads quantized KV directly) and dispatch to them for the small
batch sizes the vec kernel already owns on each arch. Gated on matched
quantized types and logit_softcap == 0 (vec only compiles softcap variants
for D == 128/256).

test-backend-ops previously had no quantized-KV FA coverage above head size
72; add Gemma4-shaped hs=512 cases (q8_0/q4_0, GQA, nb 1/2/3/32, sinks).
All 2899 FLASH_ATTN_EXT cases pass on CUDA (sm_86) vs CPU reference.

* server: avoid full-vocab sort when computing token probabilities

get_token_probabilities() sorted the entire vocabulary (262k entries for
Gemma) by logit on every emitted token when n_probs > 0, and the caller
then linearly scanned the sorted vector again to find the sampled token's
probability.

The softmax normalization only needs the max and the sum of the logits --
both O(n) without sorting. Select the top n_probs tokens with a partial
sort and return the sampled token's probability directly from the same
pass: O(V log V + V) per token becomes O(V + k log k).

No output change: same top-k ordering, same normalization over the full
candidate set.

* tests: add D == 512 quantized-KV FA perf cases

Gemma 4 global-attention decode shapes (D=512, GQA=4, nb=1, q8_0/q4_0 KV)
for test-backend-ops perf mode. RTX 3090 (sm_86), MMA+dequant -> vec:

  kv=4096  q8_0: 155.0 -> 76.5 us/run (2.03x)   q4_0: 150.4 -> 90.9 us/run (1.65x)
  kv=8192  q8_0: 302.5 -> 145.3 us/run (2.08x)  q4_0: 277.8 -> 163.1 us/run (1.70x)
  kv=16384 q8_0: 558.5 -> 286.1 us/run (1.95x)  q4_0: 533.5 -> 298.3 us/run (1.79x)

* cuda(fattn): restrict D == 512 vec dispatch to gqa_ratio <= 4

The vec kernel re-reads K/V once per Q head; tile/MMA amortize K/V
reads across Q heads via the GQA optimization (at the cost of a
dequant-to-F16 staging pass for quantized KV). Measured crossover on
both sm_61 (TILE baseline) and sm_86 (MMA baseline): vec wins ~1.4-2.0x
per-op at gqa_ratio <= 4, but loses 1.1-2.5x at gqa_ratio == 16 -- the
Gemma 4 global-attention deployment shape (MQA, n_head_kv == 1).

Adds the deployment shape (nh=1, nr=16) to correctness and perf test
cases so the dispatch decision stays measurable.

---------

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
#74)

All SYCL jobs in this workflow are commented out upstream (PR
ggml-org#23705) to save Actions resources. With the upstream
auto-triggers (push to master, pull_request matching ggml/src/ggml-sycl/**)
still firing, every fork-sync that touches master creates a zero-job
'failure' run, polluting our Actions dashboard.

Strip the auto-triggers on ht; keep workflow_dispatch so the job stays
manually invokable if/when upstream re-enables SYCL CI. Will need to be
re-applied after future master syncs that pull in the upstream version
of this file.
Same pattern as PR #74 (build-sycl.yml): upstream PR ggml-org#23705 commented
out every job in this workflow, but the push:master + pull_request
auto-triggers stayed live. Any PR touching ggml/src/ggml-cann/** would
create a zero-job "failure" run.

Keep workflow_dispatch so the workflow stays available for manual
runs once jobs are re-enabled (which requires provisioning dedicated
runners per the upstream TODO).

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… embedding example path (#76)

Epoch #73 task 3 (docs review on tools/server/README.md).

Three broken in-tree links found:

* `chat.mjs` and `chat.sh` — both removed upstream by
  ggml-org#23870 (server: remove obsolete scripts). README
  still referenced them under '## More examples / Interactive mode'.
  Drop the entire section since both samples are gone; the OpenAI-compat
  endpoints documented elsewhere in this file cover the same surface.

* `../embedding` — embedding moved from `tools/embedding` to
  `examples/embedding` upstream. Fix the relative path.
Epoch #81 task 5 (docs review). Both rows in the server-specific
params table reference qwen3-omni TTS support that does not exist:

* Neither --talker-model nor --code2wav-model is registered in
  common/arg.cpp; no C++ source mentions the strings "talker" or
  "code2wav" (only model-conversion code in conversion/qwen3.py
  references them as tensor name prefixes, which is unrelated).
* The /v1/audio/speech endpoint the help text promises is also
  absent — only /v1/audio/transcriptions is wired up via
  routes.post_transcriptions_oai in server.cpp:198. The /v1/audio/speech
  string appears only in the webui's TTS *client* code (it calls out
  to a separate OpenAI-compatible TTS server, not back to llama-server).

The vocoder-related flags that DO exist (--model-vocoder,
--tts-use-guide-tokens) are still documented elsewhere and unchanged.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…t trace (#89)

Epoch #86 task 2 (docs review round 2 on README-dev.md).

The "Example trace of a request" section described two methods that
don't exist by those names: response->update() and response->to_json().
Greppers chasing those names find nothing. The real calls are
result->update(states[idx]) (inside server_response_reader::next at
server-queue.cpp:396) and result->to_json() (called from multiple
sites in server-context.cpp ~3854-3987).

Rewrites the two affected bullets to use the actual member names and
to thread the call sites through server_response_reader (which is the
component that does both calls — server_res_generator is what owns
the reader, but doesn't make these calls directly).

No other drift found in the doc — confirmed server_routes,
server_res_generator, launch_slot_with_task, task_result_state, and
handle_completions_impl all exist by those names in current source.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
The Backend & quantization table omitted two HT-specific speculative
decoding features that have shipped to ht:

- DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for
  partial-accept feature extraction) — landed via PR #62 (b0daec5),
  integrates the z-lab DFlash block-diffusion drafter against Gemma4
  31B targets.

- Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored
  via PR #93 (4c09765) ahead of upstream PR ggml-org#23398
  merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked
  with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and
  flows through a normal master sync.

Found during a §7 documentation freshness sweep — the inventory exists
to be authoritative ("consult it before assuming a behaviour is
upstream stock" per AGENTS.md), so omissions defeat the purpose.

Docs-only, no code touched.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…rt (#77)

Epoch #73 task 4 (bug-hunt on server-models.cpp).

Two race-condition latent bugs in router state machine. Both used
`std::map::operator[]` where `find()` was needed; silent default-insert
on miss produced false-positive predicate / phantom mapping entries.

## Bug 1: unload_lru() cv.wait predicate (line 762)

```
cv.wait(lk, [&]() {
    return mapping[lru_model_name].meta.status == SERVER_MODEL_STATUS_UNLOADED;
});
```

The default-init `server_model_meta` has `status = SERVER_MODEL_STATUS_UNLOADED`,
so this predicate is spuriously true if the model is missing — AND it
inserts a garbage entry into `mapping` as a side effect.

Race: another thread (or a reload) unloads/removes the model between
`unload(lru_model_name)` (which released its own lock) and our re-acquire.
Predicate returns true on the inserted default; we proceed thinking the
unload completed, but mapping is now polluted with a phantom entry.

Fix: `find()`; missing → treat as done.

## Bug 2: load() after cv.wait(!is_reloading) (line 779)

```
cv.wait(lk, [this]() { return !is_reloading; });
auto meta = mapping[name].meta;
if (meta.status != SERVER_MODEL_STATUS_UNLOADED) { ... return; }
```

`has_model()` was checked earlier under its own lock-cycle, then
`unload_lru()` cycled the lock, then we re-acquire here. If a reload
ran in that window and erased `name` from mapping, `mapping[name]`
silently inserts a default. The early-return is bypassed
(default status = UNLOADED) and we proceed to spawn a child with
empty preset args.

Fix: `find()`; missing → log + bail.

## Verified

- ✅ `cmake --build build --target llama-server` succeeds locally
- ✅ No behavior change on the non-racy path (find()+early-return on miss
   is operationally equivalent to operator[]+early-return on
   default-UNLOADED, minus the silent insertion)

## Out of scope

Other `mapping[]` sites in load() (lines 1020, 1031) are reachable only
while the lock is held continuously from the now-guarded read at 779, so
no race exists there. `mapping[name]` at line 1186 (`proxy_request`)
also reads under-lock after `get_meta()` confirmed presence.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…ids (#83)

Bug-hunt round 3 finding (sweep on tools/server/server-queue.cpp).
server_response::remove_waiting_task_ids (multi-id) only erased entries
from waiting_task_ids — unlike its single-id sibling remove_waiting_task_id
which also dropped any pending entries in queue_results for the same id.

When a reader tears down mid-stream (HTTP client disconnect, abort, or
ordinary destruction during streaming with results still in flight), the
asymmetric multi-id path leaves partial results stranded in
queue_results. Those entries can never be matched by a future recv()
(no caller waits on those ids any more), but the predicate
`!queue_results.empty()` in recv()'s cv.wait still fires immediately,
so the next recv() for an unrelated task id spin-waits at 100% CPU
until that task's own result arrives and the for-loop finds it.

Add the same erase-if pass to remove_waiting_task_ids and drop the
now-redundant per-id remove_waiting_task_id calls from
server_response_reader::stop() — they were the only thing covering
this orphan case before, and only on the cancel branch.

Caller audit: remove_waiting_task_ids has exactly one production
call site (server_response_reader::stop), so the cleanup is
limited-blast-radius. The cancel-loop's redundant single-id calls
were 1 extra mutex acquire per cancelled task; removing them
recovers that as a minor side effect.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…78)

Epoch #73 task 5 (DFlash perf scout, in lieu of titan-gated bench).

The top-5 logits selection at common/speculative.cpp:935 was
unconditional — `if (i == 1)` gated when (once per draft call) but not
whether. The LOG_INF below it is verbosity-gated, so on production the
log is suppressed, but the O(n_vocab * log 5) selection still runs.

On gemma-class vocabs (~256k tokens) the selection burns ~1ms per draft
call. At Round-10's measured ~8% accept rate, every output token costs
several draft calls — so this debug computation is in the steady-state
hot path.

Fix: extend the gate to `if (i == 1 && dflash_debug)`. `dflash_debug`
is the cached env-var probe already declared at line 883 (used by the
features-debug block immediately above). When LLAMA_DFLASH_DEBUG is set
the diagnostic still fires; production is unaffected.

Found during epoch #73 task 5 — DFlash hot-path scout. Local CPU build
verifies; behavior change only when LLAMA_DFLASH_DEBUG is set (was
unconditional → now gated; same code runs when enabled).

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… (#71)

Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512:

| cache-type | PPL    | vs f16 |
|------------|--------|--------|
| f16        | 19.08  | baseline |
| q8_0       | 19.08  | lossless |
| tbq3_0     | 1252.30 | 65x worse |
| tbq4_0     | 1393.00 | 73x worse |

TBQ KV-cache produces near-random output. Likely root cause is statistical:
TBQ's rotated-domain codebook was calibrated for weight distributions, not
the K/V tensor distributions seen during inference. The encoding scheme
itself cannot faithfully represent KV values.

Snoop-kube's cluster audit confirms zero deployments use tbq* KV-cache
(every host uses q8_0 or q4_0). DFlash also defaults to q8_0 (PR #65).
No production consumer exists.

This PR adds a one-line experimental note to the --cache-type-k/v and
--cache-type-k-draft/v-draft help text, referencing issue #70 for the
full data + recommendation. Code path stays in place — Markus may have
roadmap intent I'm not aware of; this just stops anyone reading --help
from assuming tbq* is a usable choice without checking.

Follow-ups if Markus prefers full removal:
* drop tbq3_0/tbq4_0 from common/arg.cpp's kv_cache_types list
* keep the ftypes (TBQ weight quantization is separate from KV use)
* close issues ggml-org#124 + ggml-org#125 as wont-fix
#82)

When generation hits `STOP_TYPE_LIMIT` (max_output_tokens / ctx-size cap),
the OAI Responses code paths hardcoded `"status": "completed"` everywhere
— top-level response, per-message output items, function_call items, and
the streaming `response.completed` SSE event. Agentic clients (Codex CLI,
etc.) couldn't tell a finished response from a truncated one and ended
up feeding partial output back into conversation history, triggering
infinite retry loops on JSON-parse failures (issue #19, Phase 2).

Per the OAI Responses spec, branch on the stop type in:

* `server_task_result_cmpl_final::to_json_oaicompat_resp()` — emit
  `"status": "incomplete"` on the top-level response, all output items
  inherit the same status, plus `"incomplete_details": {"reason":
  "max_output_tokens"}` at the top level.
* `to_json_oaicompat_resp_stream()` — same mapping on the per-item
  statuses, plus the final SSE event becomes `response.incomplete` (vs
  `response.completed`) with `incomplete_details` on the payload.

Doesn't address Phase 2 of the issue (HTTP 400 + actionable message
from `func_args_not_string`) — that requires typed exception plumbing
through common/chat.cpp into the server error path. Phase 1 alone
prevents the cascade in the first place: clients see truncation as
truncation, not as a malformed completed response.

Test coverage in test_compat_oai_responses.py:

* `test_responses_truncation_emits_incomplete_status` — non-streaming:
  `max_output_tokens: 2` on tinyllama2 reliably trips STOP_TYPE_LIMIT;
  assert status=incomplete + incomplete_details + per-item status.
* `test_responses_truncation_stream_emits_incomplete_event` — streaming:
  same setup, verify a `response.incomplete` event arrives with the
  same payload shape.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
* fix(server): per-slot byte cap on context checkpoints (closes #67)

The existing checkpoint cap is count-only (`--ctx-checkpoints`, default 32),
which lets a single slot accumulate ~20 GB of host-RAM checkpoints under
heierchat's long contexts and drives titan into SystemOOM (37 GB anon-rss
on the 46 GB / 3 GB-swap node).

Adds a per-slot byte budget:

* `--ctx-checkpoints-max-mib N` (env `LLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB`),
  default 4096 MiB / slot, 0 = disabled (count-only legacy behavior).
* Eviction in `create_checkpoint` now FIFO-evicts until BOTH caps satisfy.
  Whichever cap bites first is reported via `reason=count|bytes` in the
  warning log so it's diagnosable from titan logs.
* The success log now also reports `slot total = X MiB / Y MiB cap` so the
  current footprint is visible per checkpoint create.

A 4 GiB-per-slot default bounds total host-RAM checkpoint use at
`n_slots * 4 GiB`. With heierchat's typical `--parallel 1` (DFlash gate)
that's 4 GiB worst-case; with `--parallel 4` it's 16 GiB — both well
under titan's 46 GiB.

Follow-up (snoop-kube discussed): a dynamic cap based on
`/proc/meminfo MemAvailable * 0.3` would adapt better than the fixed
default — left for a separate change once the byte-cap mechanism is
in production.

* test(server): unit tests for per-slot checkpoint byte cap (#68)

Adds tools/server/tests/unit/test_ctx_checkpoints_bytes_cap.py with
four scenarios for --ctx-checkpoints-max-mib:

* default-args: server starts; if checkpoints are created the
  "slot total = X MiB / Y MiB cap" footprint marker appears in the
  create_checkpoint log line.
* --ctx-checkpoints-max-mib 0: byte cap disabled, server starts fine,
  request succeeds (count-only legacy behavior).
* negative value: arg parser rejects, ServerProcess.start() raises.
* tiny byte cap + multi-turn chat: when eviction fires, the log
  reports reason=bytes. Skipped if tinyllama2 doesn't accumulate
  any checkpoints in the short conversation (rather than flaking).

ServerProcess gains three knobs for the existing/new flags:
n_ctx_checkpoints, checkpoint_min_step, ctx_checkpoints_max_mib —
all default to None (use server defaults) and only emit a CLI flag
when set, so existing tests are unaffected.

* fix(server): bail early from create_checkpoint when --ctx-checkpoints=0

Bug-hunt finding during the PR #68 review territory: the eviction loop
unconditionally calls slot.prompt.checkpoints.front() / .erase(begin())
based only on size >= n_ctx_checkpoints. When n_ctx_checkpoints is 0 and
the list is empty (the user-likely "disable checkpoints" intent), both
calls hit empty-container UB.

The arg parser accepts 0 without complaint and silently wraps negative
ints via the size_t cast to SIZE_MAX (which is also a no-op cap). Rather
than tighten the arg parser and risk breaking unknown callers, treat
n_ctx_checkpoints <= 0 as "checkpoints disabled" at the call boundary —
a sensible interpretation that's also what the negative-wrap was de facto
delivering.

Adds test_ctx_checkpoints_zero_disables_creation to the existing
test_ctx_checkpoints_bytes_cap.py: drives the server with
--ctx-checkpoints 0 and asserts no "created context checkpoint" line
ever fires while requests still succeed.

---------

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
marksverdhei and others added 9 commits June 12, 2026 20:36
* scripts(dflash): Round-12 target-precision bench + parity scaffold + gguf guard

Three additive scripts for the DFlash accept-rate investigation (Round-12),
none touching tracked source so they sit cleanly alongside the PR #53 squash:

- gguf-meta.py: numpy-free GGUF header reader with --check-instruct, which
  refuses base-fine-tune and truncated/stub GGUFs. Prevents the base-vs-instruct
  confound (an -it-trained DFlash drafter benched against a base target).
- bench-dflash-target-sweep.sh: sweeps the TARGET quant (drafter fixed) to test
  whether target-side quant noise off the drafter's bf16 training distribution
  drives the 8% vs ~21% accept gap. Accept recomputed from raw n_accept/n_drafted
  counts; mean +/- sample stddev over N runs; REAL(>1sigma)/within-noise deltas.
- dflash-logit-parity.py: scaffold for FORWARD logit parity vs the z-lab PyTorch
  drafter (Round-7b only did weight parity). Constants read data-driven from the
  drafter config.json; reference forward marked TODO(zlab) pending the z-lab
  modeling code (HF repo ships weights only).

* scripts(dflash): gguf-meta --check-instruct rejects truncated tensor data

The guard validated the GGUF header but not that the tensor DATA was present, so
a file truncated mid-write (valid header, missing weights) passed --check-instruct
and would have been benched — loading garbage or crashing mid-run. Caught
empirically: the corrupt gemma-4-31B-it-Q5_K_M.gguf (1.5GB, header intact) slipped
through.

read_meta() now walks the tensor-info section, computes the minimum file size
implied by the tensor offsets + alignment, and sets _data_complete. --check-instruct
rejects when actual size < implied minimum. Same failure class as the HF-xet silent
shard drop the download step hit. Verified: corrupt Q5 (1.5GB < 21.7GB) REFUSED;
Q8_0/BF16/Q4_K_M/IQ4_XS all complete and ACCEPT.
* scripts(dflash): deployment-parity prompt suite + bench harness

15 prompts × 3 classes (MT-Bench / HumanEval / GSM8K, 5 each) targeting z-lab's
published Gemma τ table (MT 4.23 / HE 8.00 / GSM8K 7.53 at conc=1, BF16, block=16).
Greedy temp=0, --spec-draft-n-max=15, fixed seed for reproducibility.

bench-dflash-parity.sh runs the suite against llama-speculative-simple and
emits per-prompt {tau, n_accept, n_drafted, decode_tps} as JSON. snoop-kube
runs the SAME prompts against vLLM/SGLang with z-lab/gemma-4-31B-it-DFlash on
titan and emits the same shape — we diff cell-by-cell to localize the gap.

tau computed as n_predict / (n_predict - n_accept), the same convention as
z-lab's dflash_generate's acceptance_lengths.

* scripts(dflash): bench harness env knobs for CPU-offload + cluster protection

- DFLASH_PARITY_NGL / NGLD: override target/drafter -ngl (default 99).
  Needed to fit larger targets that don't fit a single 24G card; with -ngl 35
  the Q8_0 31B target loads alongside the 1.2G Q6_K drafter on one 3090.
- DFLASH_PARITY_TIMEOUT: per-prompt timeout (default 240s). CPU-offload runs
  for BF16 targets take minutes per prompt at low GPU layer counts.
- DFLASH_PARITY_THREADS: --threads cap. On centurion (etcd HA control-plane
  member) leave >=2 cores free so long CPU-offload runs don't add fsync
  latency that wobbles cluster heartbeat/leader-election.
- DFLASH_PARITY_NICE: nice -n prefix (0-19). Sets the bench at minimum
  priority on shared boxes.

Defaults preserve the prior full-GPU behavior; opt-in only.

---------

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…ep 1) (#69)

Adds an optional `out_bytes_per_device` output parameter to
`common_fit_params` (defaulting to nullptr so all existing callers are
unaffected). On SUCCESS, populates with the projected per-device byte
demand for the resolved plan — index 0 is the CPU device, indices 1..N
are GPU/accel devices in the same order as `tensor_split`.

This is the foundation for issue #66 (per-GPU-aware router fit). The
router currently admits candidates against the TOTAL VRAM pool via
`common_fit_params` + `--models-max`, ignoring that two models pinned
to CUDA0 collide on GPU0's 24 GiB even though the 48 GiB pool says fits.

Subsequent steps (separate PRs):
* (2) Track `reserved_per_device[]` in `server_models`, populated from
      this output on each load and decremented on unload.
* (3) Replace count-only `unload_lru()` with
      `unload_lru_for_devices(targeted, candidate_per_device)` that
      LRU-evicts from the constrained device set.
* (4) Wire admit decision against `free_device[d] - reserved[d]` per
      targeted device.

No behavior change in this PR — the new output is opt-in and the
existing planning logic is untouched. Both fit-impl return paths
(the early MoE-trivial path and the full-search path) populate
`out_bytes_per_device` identically with the final `mem` vector that
fit_impl already computes internally.
…#66 step 2 prep) (#72)

* feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep)

The router's per-GPU admit decision (#66) needs the per-device byte
demand for a candidate model BEFORE spawning the child subprocess.
PR #69 added the underlying `out_bytes_per_device` output to
`common_fit_params`; this PR exposes it via the existing
`tools/fit-params` CLI as a subprocess-friendly JSON output.

* New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`).
* On success, prints a single-line JSON object on stdout:
    {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T}
  plan[i] = i-th GPU/accel device, same order as tensor_split; CPU
  host memory NOT included. Empty plan for CPU-only builds.
* On fit failure, emits an explicit JSON failure marker and exits 1:
    {"error":"fit_failed","status":N}
* common/fit.cpp: populate `out_bytes_per_device` at the three early-
  return paths (the impl had three 'no changes needed' fast-paths that
  bypassed the main return point where PR #69 wrote the plan). Doc
  string in common/fit.h corrected — plan covers GPU devices only.

Designed to be subprocessed from `server_models::compute_admit_plan(name)`
(#66 step 2 — out-of-process approach per the architectural call on
issue #66 / task ggml-org#123). The router parses this JSON, tracks
`reserved[d]` for in-flight LOADING models, admits candidates against
`live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the
existing `--fit-print` mode; if both set, `--fit-print-plan` wins.

Local CPU build verified: `--help` renders the new flag, empty plan
returned for CPU-only build as expected. GPU verification deferred to
snoop-kube's canary-cycle.

* test(fit-params): smoke for --fit-print-plan JSON output (PR #72 coverage) (#75)

PR #72 added the --fit-print-plan flag to llama-fit-params without test
coverage. This adds a tools/fit-params/tests.sh (pattern lifted from
tools/gguf-split/tests.sh) that downloads a small Qwen3-0.6B GGUF and
verifies six invariants:

1. success-path emits single-line JSON
2. schema has per_device_bytes / n_devices / total_bytes with correct types
3. len(per_device_bytes) == n_devices
4. total_bytes == sum(per_device_bytes)
5. on CPU-only builds (n_devices==0): plan is empty, total is 0
6. fit-failure (nonexistent model) emits the documented "error":"fit_failed"
   JSON marker on stdout (not garbage) so subprocess callers can
   distinguish fit-failure from parse-failure

Run with: tools/fit-params/tests.sh path/to/build/bin

Verified locally on CPU-only build: ALL fit-params --fit-print-plan
smoke tests PASSED.
Bug-hunt round 4 finding on tools/server/server-http.cpp.

The api-key middleware's allowlist (/health, /v1/health, /models,
/v1/models, /, /index.html, /bundle.js, /bundle.css) was matched
against req.path verbatim. With --api-prefix /llama, the routes are
registered at "/llama/health" etc. (handlers attach to
path_prefix + path in server_http_context::{get,post}), so req.path
arrives as "/llama/health" and never matches the un-prefixed
allowlist. Health probes and the static webui assets then 401 under
--api-key, defeating the point of marking them public.

Capture path_prefix into the middleware lambda and strip it from
req.path before the allowlist lookup.

Caller audit:
- Only public_endpoints lookup needs the strip; the api-key validity
  check itself is correct regardless of prefix.
- The unrelated bundle-of-tests fix in utils.py adapts the health
  poll in ServerProcess.start() so tests can actually use api_prefix.

Tests in test_security.py:
- test_api_prefix_keeps_public_endpoints_public — drives all 4
  prefix+public endpoints with --api-key + --api-prefix and asserts
  no 401.
- test_api_prefix_still_requires_key_for_private_routes — guards the
  inverse: private endpoints under prefix MUST still 401 without a key.

20/20 security tests pass locally (was 18 before).

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…104)

turboq_rotate_block_forward/inverse looped to QK_K (256) while every
caller (tbq3_0/tbq4_0 quantize + dequantize) passes TBQ_BLK_SIZE (128)
float buffers — a 128-float OOB read+write per block. Confirmed with
ASAN (heap-buffer-overflow in matvec_row via quantize_row_tbq3_0_ref)
and the cause of the Windows x64 CI failures: test-quantize-fns
SEGFAULT / test-quantize-perf 0xc0000374 (STATUS_HEAP_CORRUPTION).

Loop bound fixed to TBQ_BLK_SIZE. No behavior change for the valid
region: the extra iteration only produced the out-of-bounds garbage.
test-quantize-fns + test-quantize-perf now pass under ASAN+UBSAN.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…#105)

* ci: unbreak fork CI — dead self-hosted labels + invalid sycl workflow

Three fork-side CI defects, all inherited from upstream's runner topology:

- build-cmake-pkg.yml: [self-hosted, Linux, CPU] matches no runner in this
  org (only ht-org-k8s-* with Linux,X64,k8s exists) -> the job queued
  forever and the 'CI (cpu)' workflow has never concluded since the
  2026-06-07 CI refactor sync. GitHub-hosted ubuntu-latest (repo is public).

- python-lint.yml: [self-hosted, fast] same story; flake8 dead-queued on
  every push since 2026-05-24. ubuntu-latest.

- build-sycl.yml: #74 stripped the auto-triggers, but the file itself is
  schema-INVALID (all jobs commented out = empty jobs map), and GitHub
  creates a zero-job 'failure' run on every push for invalid workflows
  regardless of triggers. Add a never-running placeholder job so the file
  parses; drop it when upstream re-enables SYCL CI.

* ci(lint): flake8 green — NP100 per-file-ignores for stdout CLIs + drop unused np binding

The linter has been dead-queued since 2026-05-24 (runner label); reviving
it surfaced 36 violations, all in downstream diagnostic scripts:

- scripts/{gguf-meta,compare-dflash-weights,dflash-logit-parity}.py write
  their reports to stdout by design — NP100 (no print(), use logging)
  targets library code. Per-file-ignores in .flake8.
- dflash-logit-parity.py: F841 — numpy presence-check kept, dead binding
  dropped.

---------

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…ifications (#106)

Every downstream change now carries a Why column — the reason it exists
on ht — and the section states the audit rule: a change we can no longer
justify gets dropped at the next upstream sync.

New coverage (previously undocumented): D=512 FA vec kernels, server
table (heierchat/router glue, ctx-checkpoint byte cap, --api-prefix
public endpoints, Responses incomplete status, logprobs partial-sort,
fit-params byte plan, router hardening, termd, log-noise fixes),
scripts & validation (DFlash bench suite, Pascal P5200 recipe,
downstream test coverage), and the full Build/CI delta (trigger strips,
runner re-targeting, fork meta).

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
Sync the ht fork onto upstream llama.cpp master: merge-base 0066404 ->
upstream d8a24cc, 114 upstream commits. Resolved 96 conflicts (63
modify/delete + 32 content + 1 add/add).

Key resolution decisions (full rationale in PR):
- tools/ui/* (63 files): kept deleted. The fork ships a prebuilt UI from the
  llama-ui HF bucket, not the upstream Svelte source tree.
- mtmd cluster (clip/mtmd/mtmd-helper, 6 files): converged to upstream. ht's
  mtmd was vendored upstream commits now behind master (wrapper-struct bitmap
  API, video/ffmpeg support, builder pattern); verified no ht-unique features
  lost; server callers updated to the wrapper API.
- Gemma4 MTP / DFlash / eagle3: merged both sides. Kept ht's DFlash arch,
  hparams, and speculative wiring; adopted upstream's now-merged eagle3 +
  masked-embedding support (vendored PR ggml-org#23398 converging upstream).
- KV cache: adopted upstream's v_cells_impl shared_ptr refactor, which
  mainlines the dual-context cell-sharing ht hacked in via `other`.
- llama-graph kq-mask: took upstream's buffer-guarded mask set (matches the
  shared-cells refactor); dropped the now-duplicate unconditional call.
- Embedding scale: kept ht's strict `ubatch.token` guard (Gemma4 vision must
  not scale raw image embeddings) over upstream's deepstack-aware variant.
- Server: kept ht features (router glue, #94 benign-warning downgrade, #87
  api-prefix stripping, DFlash draft + MTMD-draft wiring) merged with
  upstream's expanded public endpoints, mtmd wrapper API, and int64 counts.

Post-merge fixes (auto-merge artifacts the build surfaced):
- removed duplicate llama_model::fc member (both sides added it)
- restored `n_embd` in output_reserve for upstream's embeddings-layer-inp path

Validation: builds clean (118 targets, CPU Release); ctest -L main 47/47 green.
Deployment-shaped paths (dual-context speculative KV sharing, Gemma4 vision
embedding scale, mtmd video) need a real-model smoke before fleet rollout.
@marksverdhei

Copy link
Copy Markdown
Author

Superseded — the sync landed via linear rebase instead of this merge commit.

Per maintainer direction (keep ht a clean linear delta on top of upstream, no merge nodes), the resolved tree from this PR's merge commit a3c3b14ab was reconstructed as 8 clean per-feature commits directly on upstream d8a24ccee and force-pushed to ht:

5671fe017 chore(ht): fork infrastructure — build, CI, branding, drop embedded webui source
d3fdc2e54 feat(quant): 128-block TurboQ TBQ3_0/TBQ4_0 + CUDA flash-attention + D=512 KV vec
6cd34472f feat(decode): DFlash block-diffusion + Gemma4 MTP speculative decoding
43197aedc feat(server): downstream router + heierchat + fit-params + hardening
40ab9dd39 feat(termd): sandboxed terminal daemon for tool execution
676fda3d4 feat(convert): LoRA conversion support for DeepSeek MLA kv_b_proj
7b04528f1 build(pascal): P5200 CUDA build recipe + DFlash bench/parity harness + research notes
ad59cdb9c docs(readme): HT Fork Changes inventory + tool README corrections
  • ht's tree is byte-identical to this PR's merge tree (verified git diff origin/ht a3c3b14ab empty).
  • merge-base(ht, upstream/master) now correctly advances to d8a24ccee — the actual goal of the sync — while ht stays a 0-merge linear delta.
  • Validated: build clean (118 targets) + ctest -L main 47/47.
  • Rollback tag: ht-pre-rebase-2026-06-13 (exact pre-force-push tip 43870e770, on origin).

Closing as landed. The chore/upstream-sync-2026-06-13 branch can be deleted.

@marksverdhei marksverdhei deleted the chore/upstream-sync-2026-06-13 branch June 13, 2026 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant