chore(sync): upstream master → ht (114-commit sync 2026-06-13)#107
Closed
marksverdhei wants to merge 39 commits into
Closed
chore(sync): upstream master → ht (114-commit sync 2026-06-13)#107marksverdhei wants to merge 39 commits into
marksverdhei wants to merge 39 commits into
Conversation
The full webui source tree (tools/ui/.storybook, src/, static/, tests/, docs/, configs) is now maintained at heiervang-technologies/heierchat as a standalone repo. Drop it from ht-llama.cpp along with the upstream ui.yml CI workflow that builds it. The minimal build glue stays: tools/ui/CMakeLists.txt, tools/ui/embed.cpp, tools/ui/sources.cmake, scripts/ui-assets.cmake — enough to wire up llama-ui as a static lib so tools/server can keep linking it. With no source files present, scripts/ui-assets.cmake hits its 'no assets available' fallback and emits empty ui.cpp/ui.h; llama_ui_find_asset() returns nullptr and the server simply 404s for embedded UI routes. Heierchat is served separately (its own build pipeline), and the pre-built snapshot in tools/server/public/ is committed for any deployment that wants to wire its own filesystem-serving route.
…nels Adds two new ftypes — LLAMA_FTYPE_MOSTLY_TBQ3_0 (3.125 bpw) and LLAMA_FTYPE_MOSTLY_TBQ4_0 (4.125 bpw) — built on 128-element blocks (half of QK_K, finer quantization granularity at low bpw). CPU + CUDA implementations include native dot-products and dequant kernels. The CUDA flash-attention path consumes K/V directly in quantized format, eliminating the dequantize-to-f16 intermediate. Plumbed through include/llama.h ftypes, src/llama-quant.cpp, src/llama-graph.cpp, src/llama-kv-cache.cpp, tools/quantize, tools/llama-bench, and tests/test-backend-ops + test-quantize-fns. CLI and completion docs updated with the new cache-type values.
common/preset.cpp: scan the models directory for GGUF files whose
general.type=adapter, read their architecture / name / version, and
expose them via common_preset_context. Lets the server attach
LoRA adapters automatically instead of requiring per-preset wiring.
common/arg.cpp + common.h:
- Register TBQ3_0 / TBQ4_0 in the KV cache type table (used by
--cache-type-k / --cache-type-v alongside the TurboQ feature).
- Add --remap-developer-role / LLAMA_ARG_REMAP_DEVELOPER_ROLE so
requests with the OpenAI "developer" role rewrite to "system"
before chat-template application — needed for Qwen3.5 et al
whose templates reject unknown roles.
Carries ht's server-side customizations on top of upstream router mode: * server-models: discovered_adapters tracking + LoRA discovery wiring, pick_any_resident() for the "any" model sentinel, custom stop-timeout handling, sync load/unload via wait_until_ready / wait_until_unloaded, child-to-router CMD_CHILD_TO_ROUTER_INFO support, plus LLAMA_APP_CMD env-var injection in update_args() for child supervision. * server-chat: developer-role remap, image queue and unbundled-tool endpoint helpers, downstream chat-template fallbacks. * server-context / server-task / server.cpp: glue and routing enhancements for the above. * server-common: shared helpers (env injection, json utils). * tests/unit/test_router.py: regression coverage for the router. * README + README-dev: heierchat pointers (webui is no longer in-tree) and ht-side server architecture notes.
Rust service (tools/termd/) that runs in a hardened container alongside llama-server and exposes a websocket terminal for tool-call execution. Used by heierchat to run user-approved shell snippets with sandbox guardrails (docker isolation, allowlisted commands, per-session state). * HTTP control plane + WS streaming endpoint * per-session shell state, async I/O over tokio * sandbox_guard: command policy enforcement * docker.rs: container lifecycle (start, exec, kill) * ht-termd.service: systemd unit for managed deploys
The base model splits kv_b_proj into k_b_proj (transposed) and v_b_proj. LoraTorchTensor can't handle the required split+transpose on 3D tensors, so decompose the LoRA and apply the split/transpose to the raw A/B tensors directly before yielding both decomposed adapter tensors.
Ht-fork housekeeping: * README.md, CONTRIBUTING.md, AGENTS.md, media/ht-llama-banner.png: fork branding and contributor pointers. * .gitignore: downstream-only artifact ignores (deploy bundles etc.). * docs/research/: design notes carried in-tree (diff-edit tool, file-editing research, LLM diff-edit literature). * .github/workflows/: - aioc.yml — heiercloud compatibility probe - fork-sync.yml — automated upstream-master sync - python-lint.yml — fork-specific lint config - release.yml — fork release pipeline * tools/server/public/: pre-built heierchat bundle snapshot embedded so the default server experience works without a separate UI deploy. * scripts/snapdragon/qdc/requirements.txt: Dependabot security bumps (idna 3.15, urllib3 2.7.0, pytest 9.0.3).
scripts/ui-assets.cmake gets a new Priority 1 that copies the committed heierchat snapshot in <repo>/tools/server/public/ into the build's DIST_DIR, so llama-ui-embed picks it up and llama_ui_find_asset() can serve it. Existing tools/ui/dist priority is demoted to Priority 2 as a legacy / manual-override path. Add tools/server/public/loading.html (carried from upstream master's tools/ui/static/loading.html) so the 4-asset set is complete and the server has a loading screen during model load. README.md: collapse the eight WebUI / Desktop-shell subsections into a single section pointing at heierchat. Drop the stale internal links to tools/server/webui/* and tools/server/webui-tauri/* that no longer resolve in this repo.
) `tools/server/server-models.cpp:923` used `child_proc->return_status`, which is a POSIX-only field of `subprocess_s` — the Windows variant of the struct has `hProcess` instead, so this fails to compile on every Windows CI runner. Replace the direct field access with a portable `subprocess_join(child_proc.get(), &child_exit)` call (same pattern already used at line 908 of the same file). `subprocess_alive()` has already reaped the zombie on POSIX, so `subprocess_join()` returns immediately with the cached status.
PR #52 switched TBQ3_0/TBQ4_0 from 256-element to 128-element blocks, but tests/test-quantize-fns.cpp wasn't updated: * `test_tbq3_norm_scaling` allocated a single `block_tbq3_0` (128 elements) on the stack but passed `QK_K` (256) to `quantize_row_tbq3_0_ref`. The ref function writes `k / TBQ_BLK_SIZE` = 2 blocks, overrunning the single-block buffer. x86 silently scribbled past the local; arm64 stack canaries caught it as '*** stack smashing detected ***' and aborted the whole test binary. Fix: pass `TBQ_BLK_SIZE` and assert against `sqrtf(TBQ_BLK_SIZE)`. * Bumped tolerances slightly: - `MAX_QUANTIZATION_TOTAL_ERROR_TBQ4` 0.0025 → 0.0035 - `MAX_DOT_PRODUCT_ERROR_TBQ3` 0.05 → 0.06 - Added `MAX_DOT_PRODUCT_ERROR_TBQ4` = 0.03 (TBQ4 was falling through to the default 0.02, which the 128-block path now exceeds). The threshold bumps are tight (~20%) — worth a follow-up to confirm the 128-block migration isn't masking a real quality regression on uniform random data. Real-model evals (perplexity, MMLU) should govern accept/ reject of the migration; these tests are just smoke.
Adds DFlash speculative decoding as a per-arch model class:
* src/models/dflash.cpp (new): `llama_model_dflash` — block-diffusion
drafter that proposes N tokens per step against the target model's
context-conditional embeddings. SWA-aware attention mask, n_block
noise tokens layered against ctx tokens, per-layer is_swa_impl
routing.
* src/llama-arch.{cpp,h}: LLM_ARCH_DFLASH registered (NeoX rope).
* src/llama-graph.{cpp,h}: `llm_graph_input_dflash` carries target
hidden + masks, sets them on the graph each ubatch.
* src/llama-model{.cpp,.h,-loader.cpp}: arch dispatch, tensor loader
hooks, std::array<int,5> instantiation for the noise schedule.
* src/llama-cparams.h + llama-hparams.h + llama-context.h +
llama.cpp: cparams flag + ctx wiring + arch-dispatch entry.
* src/models/{gemma4,llama,models}.* : hooks so the existing target
archs cooperate with the dflash drafter.
* common/speculative.cpp + speculative.h: COMMON_SPECULATIVE_TYPE_DFLASH
config; auto-enable logic that gates 'draft' type on actual draft-
model path presence while leaving room for dflash / eagle3 / mtp.
* tools/server: router-aware integration (last_used_ms field,
pick_any_resident wired with dflash flag), server-task.h/cpp + queue
+ server.cpp glue, --parallel 1 gate left in place per current
Round-11 status.
* tests/test-dflash.cpp + scripts/* + HANDOFF.md: smoke harness,
bench, weight-compare, regression artifacts.
Squashed from 42 commits on the feat/dflash-integration branch (the
previous Round-11 lifecycle iterations). Rebased onto the post-rewrite
ht baseline; conflicts resolved against the per-arch class hierarchy
that's now upstream-stock (renamed swa_layers → is_swa_impl, single
remap_developer_role definition in server_chat_params, has_draft_simple
auto-enable kept dflash-aware).
…65) Per Markus 2026-06-04: DFlash quality measurement should use a Q8_0 target rather than Q4_K_M, since Q4_K_M introduces enough target-side quantization noise to confound DFlash's own accept-rate signal. Q8_0 fits in 38 GB total, well within titan A100 80 GB. * Default `TARGET` is now `gemma-4-31B-it-Q8_0.gguf`. Override via `--target PATH` or `DFLASH_BENCH_TARGET` env var. * Also added `DFLASH_BENCH_DRAFTER_DIR` env var for consistency. * Comment block documents VRAM math for Q4_K_M / Q8_0 / BF16 targets so future runs can pick the right card.
…l-org#23398 vendor (#93) * build : use umbrella Headers directory for XCFramework module map (ggml-org#23974) The XCFramework generated by build-xcframework.sh creates a module map that manually lists public headers. That list can fall out of sync with the framework's Headers directory. The module map is currently missing ggml-opt.h, which is present in the framework headers. This can cause downstream Apple builds to fail with: Include of non-modular header inside framework module 'llama' Use the framework's Headers directory itself as the module map umbrella instead of maintaining a manual header list. This makes all public headers under the generated framework's Headers directory part of the llama module. * webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065) * webui: fix tool selector toggle/counter, key tools by stable identity Key the disabled set, counts and toggles by a stable per-tool key instead of bare function name, deduped from one canonical list. Per-tool checkboxes become presentational (single row handler, no nested button), category checkboxes drop the tristate (n/total carries partial). One getEnabledToolsForLLM keeps normalized MCP schemas and dedupes by name. * ui: use SvelteSet and SvelteMap for local tool collections to satisfy svelte/prefer-svelte-reactivity * agents: refactor, include more guidelines (ggml-org#24111) * agents: refactor, include more guidelines * better example * rephrase a bit * add more examples * nits * server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110) * server: avoid unnecessary checkpoint restore when new tokens are present The pos_min_thold calculation unconditionally subtracts 1 to ensure at least one token is evaluated for logits when no new tokens exist. However, when the request contains new tokens beyond the cached prefix, this -1 is overly conservative and may trigger an unnecessary checkpoint restore. Conditionally apply the -1 only when n_past >= task.n_tokens() (no new tokens), avoiding redundant KV state restoration when there is actual work to do. * cont : add ref --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209) * ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 Optimize the inner loop of ggml_vec_dot_q4_1_q8_1_generic using WASM SIMD128 intrinsics, gated behind #ifdef __wasm_simd128__ so non-wasm builds are completely unaffected. Approach: - single wasm_v128_load covers all 32 packed 4-bit weights - nibbles unpacked via AND/SHR into two u8x16 registers - widened to i16 before multiply (WASM SIMD has no i8*i8 instruction) - 4x wasm_i32x4_dot_i16x8 calls accumulate all 32 element pairs - horizontal reduce via 4x wasm_i32x4_extract_lane Benchmark (node v25, emcc -O3 -msimd128, 64 blocks x QK8_1=32, 200k iterations): | impl | ns/call | speedup | |--------|---------|---------| | scalar | 880.7 | 1.00x | | simd | 257.8 | 3.42x | Correctness verified against scalar reference across 10 random seeds with exact output match. * ggml: move q4_1_q8_1 WASM SIMD implementation to wasm backend Relocate the SIMD128 implementation of ggml_vec_dot_q4_1_q8_1 to ggml/src/ggml-cpu/arch/wasm/quants.c to follow architecture-specific layout. Restore the generic implementation in ggml/src/ggml-cpu/quants.c. Move for loop in the else block. * ggml: use generic q4_1_q8_1 fallback in wasm backend * convert: Fix Gemma 4 Unified conversion (ggml-org#24118) * Fix Gemma 4 Unified conversion * Set audio hidden size to audio_embed_dim * return filter to save memory (ggml-org#24125) Co-authored-by: lvyichen <lvyichen@stepfun.com> * ui: added single line reasoning preview (ggml-org#23601) * webui: added single line reasoning preview. * patch: reduce width slightly for the previewing section * refactor: move formatter constants to the right file * feat: reimplement reasoning preview with throttled dynamic per-line rendering * chore: fix spacing Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: refactor to requested changes * refactor: grouped by capture pattern instead of block-level + inline * ui: fax interrupt state only trigger for 1st reasoning message * chore: make reasoning preview respects showThoughtInProgress setting * chore; newline at EOF Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: thread rawContent so collapsible content can handle compute preview * patch: showThoughtInProgress accidentally blocks rawContent being passed * chore: fix lint * chore: change smoke test --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: Fixed packages (ggml-org#24119) * chore(ui): pin package versions to currently installed - Update all dependencies and devDependencies to match exactly what's in package-lock.json - This ensures reproducible builds by locking to specific versions rather than semver ranges * chore: Update packages * chore: Move remaining dependencies to devDependencies * fix: Add missing `mermaid` package * chore: Update `cookie` package to `v1.1.1` * chore: Formatting * test: Update test configs * Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata * webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132) * use child snippets for landing and chat message elements * make ... icon visible in conversation history menu * conversation history forward tab fix * add snippet fix for fork icon in conversation history * focus/keyboard fix for attachment x icon and scroll left/right * formatting * fix scroll down issue * simply Statistics and pointer events in scrolldown * create storybook tests and move to folder * improve tests to actually assert on element * arg: fix double mtp downloads (ggml-org#24128) * server : disable on-device spec checkpoints (ggml-org#24108) * sycl : port multi-column MMVQ from CUDA backend (ggml-org#21845) mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too. * ci : build-msys job slimming [no ci] (ggml-org#24157) This PR attempts to slim down the dependencies for build-msys jobs making the same changes that we applied in whisper.cpp to reduce the size of the github actions cache, and should also improve the run time due to fewer dependencies that need to be installed. I realize this is a scheduled job but I think it would still make sense to apply these changes. Refs: ggml-org/whisper.cpp#3858 * CUDA: enroll mul_mat_vec_q_moe into pdl (ggml-org#24087) * Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9 ``` After ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7 ``` Server launched with: ``` ➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \ -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -ngl all \ -fa on \ --host 0.0.0.0 \ --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" ``` * LC to overlap with following kernels * kleidiai : dynamic chunck-based scheduling for hybrid execution (ggml-org#23819) * hparams : refactor `hparams.n_layer` (ggml-org#24060) * hparams : refactor hparams.n_layer * cont : remove `n_layer_kv()`, use n_layer_all instead * cont : type consistency * pi : update SYSTEM.md * models : fix Step3.5 MTP * cont : remove duplicate switch cases * cont : explicitly set `false` to extra layers for `is_swa` and `is_recr` * cont : fix nextn layer count handling Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * minor : fix lint issues (ggml-org#24165) * docs: Update quantization readme (ggml-org#24133) * Update quantization readme * install requirements * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * ui: add ignore-scripts=true to npmrc (ggml-org#24149) * Fix link to available UI settings (ggml-org#24169) The current link is to a non-existent file. I had a look at the repo, spotted the file containing the UI configuration key and updated the link * ui: run npm install when package-lock.json is newer than node_modules (ggml-org#24171) * model : fix llama_model::n_gpu_layers() (ggml-org#24188) * cli: fix model params not propagated (ggml-org#23893) Fixes ggml-org#23847 * TP: round up granularity to 128 (ggml-org#24180) * TP: round up granularity to 128 * remove assert * model, mtmd: Granite4 Vision (ggml-org#23545) * feat(convert): Get language model conversion working for 4.1 vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb python-side vision projector names and mappings There are several awkward things here: 1. Most of these are essentially identical to the audio qformer tensors. On the c++ side, that's mapped using the prefix, so the rest of the GGUF name needs to align, but on the python side there's no prefix notion, so they all get duplicated. 2. There are a couple of net-new tensors for vision, in particular PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as belonging to the qformer portion, but the GGUF name is simply proj_norm which conflicts with the ideal name for this new PROJ_NORM that is not qualified as part of the qformer. To get around this, I used "proj_layernorm" as the GGUF name. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python side architecture name Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add python-side plumbing for setting FEATURE_LAYERS hparam Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side tensor naming defines NOTE: Usage of these hasn't been updated to include prefix yet Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Convert vision_feature_layer to an ordered vector We need to preserve the ordering of these feature index values so that they can be mapped to the sub-tensors within the stacked projectors. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(mtmd): Add architecture label plumbing Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(wip): Add partial conversion for mmproj This handles stacking the projector tensors and setting the new harams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add gguf_writer and constant support for new hparams and deepstack layer arr Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Full conversion for mmproj w/ tensor mappings Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add lm_head skip for mmproj for 4.0 Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias text_config architecture in convert_lora_to_gguf.py Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add --trust-remote-code arg to convert_lora_to_gguf.py This defaults to False, but allows a user to enable it programmaticly instead of using the interactive prompt. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: De-alias model.language_model. -> model. for lora adapters Branch: Granite4Vision AI-usage: full (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Extend language model tensor dealiasing in adapters Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary registration for GraniteSpeech in language model Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Plumb through mm prefix formatting for qformer tensors Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Refactor vision projector tensors to use predictor ID as the block This is cleaner than stacking them. The modeling file hard-codes single-layer qformers, so we can punt on the multiipule multi-layer projectors problem. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add spatial offests array hparam conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add stub plumbing for granite vision in mtmd Branch: Granite4Vision AI-usage: draft (OpenCode + qwen3.5:122b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new hparam and tensor naming in clip-impl.h New hparams: - KEY_PROJ_SAMPLE_QUERY_SIDE - KEY_PROJ_SAMPLE_WINDOW_SIDE - KEY_PROJ_SPATIAL_OFFSETS New tensors: - TN_MULTI_PROJ_IMG_POS - TN_MULTI_PROJ_QUERY - TN_MULTI_PROJ_LAYERNORM - TN_MULTI_PROJ_LINEAR - TN_MULTI_PROJ_NORM Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Move deepstack_layer_arr to llm hparam instead of mmproj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove IS_DEEPSTACK_LAYERS This appears to have been added during Qwen3 VL (ggml-org#16780), but it was never actually used. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: n_deepstack_layers -> deepstack_layer_arr The old logic hard coded a correspondence between the first N layers of the LLM and the 1->N entries in the input embeddings. Now, that relationship is maintained at loading time if the GGUF value is single-valued. If it is multi-valued, it loads directly allowing for deepstack layers to be spaced out throughout the model. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use try/catch for single/multi valued deepstack info The alternative would be to use get_key_or_arr, but then the single value would be populated through the entire array and we'd need to detect that and update it with the right correspondence. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add deepstack injection point for granite LLM The use of ggml_add here assumes that the elements of inp_embd will be pre- arranged to be the full embedding length with only the vision-mask'ed portions non-zero from the projector. This matches how Qwen3VL does it. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: add missing vision attn layernorm eps Branch: Granite4Vision AI-usage: full (OpenCode + Qwen 3.6-35B) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix missing prefix template for TN_QF_PROJ_LINEAR It's not strictly necessary since vision uses the blockwise version, but it makes the loading consistent. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add embedding scale and image grid pinpoints hparams in conversion Also remove dead parsing for self._deepstack_layer_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add mtmd KEY_ section for hparams shared with the LLM In this case, we need the EMBEDDING_SCALE so we can unscale the image embeddings to compensate for applying embedding scale to the input embeddings Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Implement c++ hparam parsing Branch: Granite4Vision AI-usage: draft (Claude Code) Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Flatten pinpoints in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing break Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: No reason to have modality prefix for img_pos Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tensor loading Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the right portion of speech for tensor loading! Also plumb through the layernorm -> post_norm naming change Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add logging of deepstack_layers_arr if set I also changed the print_f output type to int32_t to avoid printing overflow values for -1. This could cause overflows on the other side, but I can't imagine a value for any of the current array hparams that would trigger that. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Make sure input embeddings are cont before f_embedding_scale Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add init and mmproj_embd cases for g4v The n_mmproj_embd is 1+ to make space for the text embedding and all 8 projectors Branch: Granite4Vision AI-usage: draft (Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Invert (h, w) -> (w, h) pinpoints Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Reorder projectors based on llm index and skip the first injection The multi-projector stack has a strange asymmetry based on how it's currently implemented for qwen3vl: on the mmproj side, it's all N projectors, but the output of the "first" (by inp_embd index) projector is automatically consumed as if it were a standard single-projector mmproj, so the deepstack portion needs to only contain the 1-N entries. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix mmproj hparams in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix ordering/logic for deepstack injection in granite Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix preprocessing config to match what the model needs Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * wip: Partial port of Eli's implementation This is still pretty broken, but it's getting closer. It now happily generates tokens, but the values are quite incorrect still. I suspect it's caused by the mapping of projectors from safetensors to their respective orders here. Also, this implementation breaks encapsulation pretty badly in mtmd_encode. This will need a big refactor to put the G4V-specific encoding logic somewhere more appropriate. Branch: Granite4Vision AI-usage: draft (Claude Code, Bob) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com> * fix: Fix the pre-scaling on the input embeddings to correctly invert the scale We've got tokens! They still don't line up quite right, so something's a little off, but we're getting much closer now. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: invert embedding multiplier -> base_scale at load Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix setting image_resize_pad after new enum introduced Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add G4V to mmproj mapping in conversion Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Re-add padding disable for non-hybrid hybrid models Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Simplify G4V n_tokens computation This is slightly more efficient and flexible for when we implement the unpad cropping. IMO, it's also clearer that it is adding the number of image_newline tokens (embeddings) to the grid, rather than recomputing the entire count. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add new clip APIs for post-tile-encoding assembly Granite 4 Vision uses llava-next style pack-and-unpad which requires injecting the learned newline after each row of the tile grid. A row here is a single row of the grid which is composed of (grid_x * cols_per_tile) * (grid_y * rows_per_tile), so the result is newlines injected in between individual tile rows, thus not something that can be handled with the standard llava-uhd block-wise endcoding. Branch: Granite4Vision AI-usage: draft (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add model interfaces for granite 4 vision assembler I'm on the fence about the best organization of this. These free functions allow the per-architecture logic in clip.cpp to access the model-specific graph building, but they still require a fair bit of model-specific logic in clip.cpp which is not ideal. I think a better approach may be to replicate what is done with the graph builders themselves (and possibly even make the assembler part of the model's existing graph builder). Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(mtmd): Consolidate assembler logic into clip_assembler class family Just like `clip_graph` is the base class for building the model-specific encoder graphs, `clip_assembler` will be the base class for building the model-specific assembler graphs. This allows the assembly pattern to follow how the encoder pattern is implemented where the model-specific logic lives in a subclass co-located with the encoder graph builder that gets constructed by a simple factory method. Branch: Granite4Vision AI-usage: full (Claude Code + Opus 4.7) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment improvement Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: granite_vision -> granite4_vision Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack These pieces were never used on the c++ side (removed there in an earlier commit), so this is just cleanup that I missed before. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Oops! I did not mean to commit one of my prompt files But now it's too far back in history to effectively rebase out, even with interactive and --rebase-merges :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing <algorithm> include for std::find It seems that this was already pulled in on some platforms, but not on others Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix Flake8 warnings in granite conversion module Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove clip_assembler in favor of clip_image_f32.append_token Per conversation in the PR, the clip_assembler pattern was too invasive. This is a compromise that limits model-specific blocks to add_media where each preprocessed tile is annotated with an injection type, after which all the token counting logic is generic and the newline injection itself is handled in the graph based on the value for the given tile image. Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(convert): Split n_deepstack_layers and deepstack_layers (array) Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys Branch: Granite4Vision AI-usage: draft (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix GGUF key for deepstack_layers_arr Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs This follows how gemma3 and gemma4 handle embedding scaling by skipping the multiplier for raw input embeddings. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr) Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Fully revert changes to n_deepstack_layers and qwen3vl* Since we're going to keep the GGUF KVs separate, it makes sense to just keep the hparams separate too to limit the scope of this branch. The down side is that n_deepstack_layers and deepstack_mapping_arr are potentially conflicting. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Revert removal of "is_deepstack_layers" GGUF KV This KV is not used at all on the c++ side, so it's fully dead, but there's also no need to conflate this cleanup with the addition of G4V. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary ggml_cont and build_forward_expand in cbx Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Clean up comments Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Tighter and more flexible code for g4v_build_block This could be refactored to look a lot more like granite-speech, but the overall block constructs before/after the qformer are pretty different, so for now I'm going to leave it as is and just tighten a bit. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary `unordered_set` include Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add architecture guard on deepstack_mapping_arr printout Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unnecessary AI-gen comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always initialize deepstack_mapping_arr with -1 values This was causing `test-llama-archs` to fail, likely due to trying to save the uninitialized values, then re-loading them. It's safer to always initialize so that other models don't forget and end up with undefined behavior. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove TODO about block/vs non-block tensor mapping Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move is_vision_feature_layer logic into clip_hparams Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use a bool for append_token Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Remove unnecessary comment Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove unused get_model api yikes! Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Rearrange helpers for g4v to be private members and use build_attn Branch: Granite4Vision AI-usage: full (Bob, OpenCode + Qwen3.6-35b) Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one in vision layer index This was inherited from the Claude Code implementation that pushed the negative index inversion down into the model file. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix norm/post_norm mixup in conversion face. palm. :( Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: More descriptive tensor names Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Apply PR cleanup for new conversion changes AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix(convert): Remove duplicate V_ENC_EMBD_IMGNL Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: append_token -> add_newline Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: Comment cleanup Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Cleaner error handling/checking NOTE: format_string is not available in granite.cpp (and including clip-impl.h to get it doesn't compile, so I think it violates the intended encapsulation), so std::stringstream is the simplest answer. Branch: Granite4Vision AI-usage: none Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * model: fix build failed (ggml-org#24193) * vulkan: add fwht support for Intel with shmem reduction (ggml-org#23964) * vulkan: add fwht support for Intel with shmem reduction * don't use N as workgroup size * disable subgroup shuffle on MoltenVK AMD * disable fwht shader on Intel Windows due to driver bug * common/chat : unify and fix LFM2/LFM2.5 tool parser (ggml-org#24178) * opencl: improve get_rows, cpy, concat and q6_k flat gemv (ggml-org#24160) * opencl: allow multiple workgroups for large rows * opencl: improve small cpy * opencl: packed concat for small input * opencl: tweak flat q6_K gemv, increase N_DST and remap threads * context : fix off-by-one comparisons to n_gpu_layers (ggml-org#24208) * model : rename local n_layer_all variable (ggml-org#24209) * vulkan: check coopmat2 features before reporting support (ggml-org#24186) * mtmd, server: add "placeholder bitmap" for counting tokens , add */input_tokens API (ggml-org#23913) * mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing * fast path skip preproc for placeholder * fix build * correct the api * add server endpoint + tests * add object name * update docs * add proxy handling * fix build * fix audio input path * use is_placeholder in process_mtmd_prompt() * nits * nits (2) * docs: clarify chat/completions/input_tokens is not official * fix merge problem * completion : fix format specifier in LOG_INF (ggml-org#24213) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * completion : remove useless statics (ggml-org#24226) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * mtmd: support "frame merge" for qwen-vl-based models (ggml-org#21858) * feat: add video support for Qwen3.5 * various clean up * revise the design * fix llava-uhd case * nits * nits 2 --------- Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com> * common/chat : fix LFM2/LFM2.5 reasoning round-trip and <think> leak (ggml-org#24234) * common/chat : fix LFM2 reasoning round-trip and stray <think> leak * Gate by reasoning format and whether the template supports <think> * docker : bump cuda13 to 13.3.0 (ggml-org#24228) * convert : fix Gemma4 with no audio encoder (ggml-org#24242) * arg: Skip mmproj download when user supplied mmproj (ggml-org#24239) * chore(sync): adapt DFlash to hparams.n_layer() method post-ggml-org#24060 src/models/dflash.cpp had three direct uses of `hparams.n_layer`. The upstream hparams refactor (ggml-org#24060) turned that into a method `n_layer()` (effective count, excludes nextn layers). DFlash drafter has no nextn layers, so `n_layer()` and the raw field `n_layer_all` are numerically equal — picked `n_layer()` to match the new accessor convention. Behavior-preserving. * llama: Gemma 4 MTP * fix multi-seq * add assert that draft + shared kv should be on same device * add Q rot when cache is quantized * add temp hack to not use fit with gemma4, rm later * add exception in test-llama-archs * move assistant to separate file * add unified assistant * cont : adjust to hparams changes * cont : avoid computations on the CPU * cont : clean-up * cont : clean-up * cont : fix handling of unused tensors * cont : fix undefined * fix typo * cont : enable gemma4 graph reuse * cont : fix assert * cont : fix quantized cache * cont : fix names * cont : fix names * cont : add reference for draft positions * cont : fix multi-modality * cont : add comment about ctx_src * cont : clean-up server fit logic * cont : clean-up llama_context * py : fix names * cont : rename ctx_src -> ctx_other * chore(sync): drop intermediate llama_set_mtp_source call The first PR ggml-org#23398 commit added an `llama_set_mtp_source(ctx_dft, ctx_tgt)` call after `llama_init_from_model`. Later cleanup commits in the same PR removed that API and moved the wiring to `cparams.ctx_other = ctx_tgt` set BEFORE init. Our keep-both resolution carried the intermediate call forward; this drops it to match the PR's final API. Drops 1 use of removed symbol, no behavior change (the rebased cparams.ctx_other assignment is what's actually used). --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Signed-off-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: Gerard Martinez <gmarzjr@proton.me> Co-authored-by: Pascal <admin@serveurperso.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Yongyue Sun <abioy.sun@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Kartik Sirohi <99896785+sirohikartik@users.noreply.github.com> Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: forforever73 <63285796+forforever73@users.noreply.github.com> Co-authored-by: lvyichen <lvyichen@stepfun.com> Co-authored-by: MagicExists <106458387+gugugiyu@users.noreply.github.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Bartowski <3266127+bartowski1182@users.noreply.github.com> Co-authored-by: viggy <70774793+vignesh191@users.noreply.github.com> Co-authored-by: Mason Milburn <masonmilby@gmail.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: Charles Xu <charles.xu@arm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Mario <191101255+wariuccio@users.noreply.github.com> Co-authored-by: therealkenc <therealkenc@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Ruben Ortlam <rortlam@redhat.com> Co-authored-by: Tarek Dakhran <tarek@liquid.ai> Co-authored-by: lhez <lih@qti.qualcomm.com> Co-authored-by: Adrien Gallouët <angt@huggingface.co> Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com> Co-authored-by: konradmb <konradmb@o2.pl> Co-authored-by: marksverdhei <mark.sverdhei@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com>
…#94) The MTP draft memory-probe path (server-context.cpp ~line 856) creates a throwaway llama_context with `cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP` to measure context+compute bytes for fit_params. For the Gemma4-Assistant arch, this throws because `cparams.ctx_other` is required and the target context doesn't exist yet at probe time — upstream's own src/llama-context.cpp init explicitly notes "this is normal during memory fitting" in the exception message and carries a TODO to switch to a typed llama_exception so the warning can be skipped. Until that upstream change lands and flows in via a master sync, scan the exception message for the self-identifying "normal during memory fitting" marker and downgrade WRN -> DBG for that specific case. Real failures (model load failed, etc.) still surface as SRV_WRN. Eliminates the misleading "[spec] failed to measure draft model memory: failed to create llama_context from model" line that appears on every gemma-4-12b-qat-mtp pod start despite the model loading + running fine at ~110 tok/s (verified Phase 6 on titan, image unified-llm:mtp-pr23398-5e6dff22). Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
* scripts(pascal): P5200 build notes + bench harness + Vulkan baseline Working notes for getting ht-llama.cpp running on the Quadro P5200 (Pascal sm_61, 16 GB). Toolkit wall: CUDA 13 dropped sm_61, so CUDA backend requires aur/cuda-pascal 12.9.1 + gcc14. Driver 580 still runs sm_61 binaries fine. Vulkan baseline (Llama-3.1-8B Q4_K_M, ngl=99, fa=0, build f6feddb): pp128 269 t/s pp512 278 t/s pp2048 251 t/s tg32 35 t/s tg128 35 t/s CUDA results pending cuda-pascal install (gcc14 source build dominates). Untracked primer (quadro-p5200-llamacpp-primer.md) referenced as the source for the FP16-1/64-FP32, INT8 DP4A, and ggml-org#7188 FA-fix facts. * scripts(pascal): CUDA backend bench results + complete install recipe CUDA 12.9 toolkit built and benched on the Quadro P5200 (Pascal sm_61). Five obstacles climbed on stock Arch: 1) CUDA 13 dropped Pascal → installed 12.9 from runfile --extract 2) Runfile libxml2.so.2 missing → bypassed installer with --extract 3) gcc-15/16 too new for nvcc → gcc-14 from archlinux-archive 4) gcc14 AUR source-build slow → 51MB binary pkg.tar.zst (30s install) 5) glibc 2.43 cospi/sinpi clash → noexcept(true) patch on CUDA math.h+hpp Full recipe in scripts/build-pascal-p5200.md. Bench summary (Llama-2-7B Q4_0, ngl=99, build 5159fee, P5200 sm_61): CUDA fa=1: pp512=795, tg128=45.8 t/s Vulkan: pp512=418, tg128=43.0 t/s CUDA wins pp ~1.9x, tg within 7% (bandwidth-bound). ggml_cuda_init confirms: compute capability 6.1, VMM yes, GGML_CUDA_FORCE_MMQ baked in (visible in nvcc cmdline). CC6.1 + MMQ-only + no cuBLAS fallback = the INT8 dp4a path is what is running. JSON artifacts committed alongside for replay/comparison. * scripts(pascal): packaging recipe — rpath-clean runtime tarball for Omarchy ISO Adds §7 Packaging covering the cmake-install + patchelf + symlink-chain + stage-and-tar pipeline that produces pascal-cuda-artifacts.tar.zst (the runtime fast-path consumed by hai-os-dev's autoinstall). Also drops the stale "TODO — fill in once build-cuda completes" placeholder and moves Sources to the true end of the doc. Recipe reproduces the verified-clean tarball: rpath stripped on all installed targets, libllama/libllama-common copied + patchelfed, symlinks recreated, members rooted at opt/ for `tar -C / -xf` extraction, ld.so.conf.d snippet documented so no LD_LIBRARY_PATH is needed at runtime. * scripts(pascal): correct §7 tarball size + add reference sha256 Was: prose-estimate "~810 MB before zstd, ~470 MB after" — actual is ~816 MB unpacked, 512 MB compressed (110 members). Adds the reference sha256 from the verified crystal build for hai-os-dev to byte-check against. Notes zstd non-determinism so re-runs are expected to differ. * scripts(pascal): field primer + Omarchy autoinstall handoff guide Round out the Pascal/P5200 enablement bundle (PR #99) with the two human-facing companions to scripts/build-pascal-p5200.md: - quadro-p5200-llamacpp-primer.md: Pascal/GP104 + llama.cpp field guide (the two facts that drive every decision, CUDA vs Vulkan, measured 1080-parity numbers, 16 GB VRAM sizing, optimization checklist). - quadro-p5200-omarchy-autoinstall.md: 7-question handoff guide for hai-os-dev — extra packages (no AUR), CUDA-12.9 runfile pin, build flags, the five obstacles + fixes (glibc 2.43 noexcept patch incl.), pre-build at image time, HaiOS integration points, verified 512 MB / sha256 0efed65... reference tarball, measured baseline. Both docs reference the canonical recipe at scripts/build-pascal-p5200.md and the verified tarball cached at crystal:/home/me/pascal-cuda-artifacts.tar.zst. * scripts(pascal): v2 build flags (server+router) + Gemma4 MTP bench JSONs Recipe update: add -DLLAMA_BUILD_SERVER=ON + -DLLAMA_BUILD_TESTS=OFF to the CUDA configure step. Required for the llama-app unified router (bin/llama) to link — without server-on, libllama-server-impl.so is not built and llama-app link fails with `cannot find -lllama-server-impl`. Also required for Gemma4 MTP: ctx_other wiring for the Gemma4Assistant draft class lives only in tools/server/server-context.cpp; the standalone llama-speculative-simple binary segfaults with "Gemma4Assistant requires ctx_other to be set". Rationale block also captures the spec-decode footgun: --spec-type defaults to `none`, so -md <draft> alone is silently ignored. Must pass --spec-type draft-mtp to engage. The /props default_generation_settings.params["speculative.types"] field is per-REQUEST sampler default, NOT the server engine state — the canonical engagement read is server stderr (draft acceptance line + statistics draft-mtp: ... summary). Bench JSONs (crystal Pascal P5200, Gemma4 12B QAT Q4_K_XL, sm_61 CUDA FORCE_MMQ, -fa on, -ngl 99, ctx 4096, greedy temp=0/top_k=1): baseline (no MTP, llama-bench): pp128=465.71 t/s, pp512=456.37 t/s tg32=25.54 t/s, tg128=25.54 t/s (flat — bandwidth-bound) MTP A/B via `llama serve` /completion (degenerate "0"×128 output): A baseline (--spec-type none): 25.26 t/s B MTP (--spec-type draft-mtp): 103.72 t/s ← 4.11× CEILING draft acceptance: 1.00 (118/118) — trivially predictable, not deployment MTP A/B via /v1/chat/completions (non-degenerate, 256 tokens): A baseline: 25.18 t/s B MTP: 76.06 t/s ← 3.02× REPRESENTATIVE greedy speedup draft acceptance: 0.7627 (225/295) bit-identical content sha A vs B (greedy lossless property) All three regimes labeled in the JSON so 4.11× isn't quoted as the deployment number — the representative ~3× greedy or the memory-recorded titan 1.66× (default sampling) are the honest reads. * scripts(pascal): v2 server/MTP docs — §6/§7 scope flip + Gemma4 MTP numbers Follow-on to 3662be4 (v2 build flags). Lands the doc side of the LLAMA_BUILD_SERVER=ON v2 build into the two human-facing companions. omarchy autoinstall guide §6: - v1/v2 tarball table: v2 = pascal-cuda-artifacts-v2-server.tar.zst, sha 2528d952..., 515.5 MB, 121 members, server+router scope. v1 (0efed65..., untouched) stays valid for non-serving bakes; v2 is the additive serving-capable successor, not a recall. - serving footgun: --spec-type defaults to `none` (-md silently ignored); engagement proof is server stderr, not /props. - Gemma4 MTP results, three clearly-labeled regimes (lossless A/B): 4.11x degenerate ceiling / 3.02x representative greedy (headline) / 1.66x sampling deployment ref. build-pascal §7 packaging: - version the tarball filename; never overwrite a live pull source. - v1/v2 size+sha table. - reconcile the stale "router not in this tarball" section to v2 reality: member-delta (+11), single-.so impls, lib64 prune, extraction-validate. - note that bin/llama-server / bin/llama-cli are separate targets, not in llama-app's dep closure (reproducing v2 needs them in --target). Also folds in a one-line build-target fix (line 80: add llama-server + llama-cli to --target) that landed in the shared tree from the fork-manager session concurrently — verified correct, kept so the recipe reproduces v2. * scripts(pascal): #100 bullets 1-3+6 bench evidence — server, router, gpu-only, vision+MTP Closes 4 of 10 issue #100 bullets on Pascal P5200 (v2 server-capable build): - bullet 1 (llama-server): standalone /opt/ht-llama-cuda/bin/llama-server → ready in 4s, /health 200, /completion 40 tok @ 25.84 t/s - bullet 2 (llama-server router): unified `bin/llama serve` shim → ready in 4s, /health 200, /completion 40 tok @ 25.82 t/s - bullet 3 (gpu only works): both runs above use -ngl 99 -fa on - bullet 6 (gemma4 12b qat mtp all modalities): combined mmproj + draft-mtp + spec engine → A. coexistence: /v1/chat with image_url + --spec-type draft-mtp engaged → predicted=96, stderr draft acceptance = 0.66102 (78/118) → B. grounding (decoupled to mtmd-cli, avoids Gemma4 chat-template empty-content quirk): all 3 ground-truth features matched (PASCAL, P5200, red rectangle); requires --jinja (otherwise std::runtime_error custom-template-not-supported abort). Methodology: - regression band ±3% pinned vs committed baseline 25.54 tg / 76.05 MTP; both server-router runs in band (24.77-26.31 t/s window). - engagement read on stderr (draft acceptance / draft-mtp stats), NOT /props (per --spec-type footgun memory). - chat-content quirk explicitly noted in JSON so empty content does not read as fail or regression. Bullets 4 (gpu+cpu offload) + 7-10 (qwen 27B/35B-MoE / gemma 26B/31B) land in subsequent commits once the lithium IQ3-class + titan 31B IQ4_XS transfers complete on crystal. * scripts(pascal): #100 regression rerun + nit fold-ins (slug form, cross-harness note) Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on the v2 server-capable build to lock the no-regression gate against the committed baseline (25.54 t/s) and MTP reference (76.05 t/s). - Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54), in band. - MTP /v1/chat greedy: mean 75.07 t/s across 3 reps (-1.29% vs 76.05), in band. - Draft acceptance: 0.76271 — bit-identical to committed regime-2 (225/295 accepted/generated). Strong determinism proof. Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per crystal-assist's review): - Memory slug citations switched from hyphen-form to underscore-form to match the actual slug names (feedback_spec_type_footgun, reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling. - bench-pascal-server-router-smoke.json: added cross_harness_note clarifying the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS the no-regression claim (different harnesses, in band). * scripts(pascal): #100 bullets 8, 4, 10 bench evidence + regression nit fold-ins Three model bench JSONs from the v2 server-capable Pascal build: bench-pascal-qwen3.6-27b-iq3-xxs.json (bullet 8): Qwen3.6-27B at UD-IQ3_XXS (11.17 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu, 65/0/65 layers, gpu_residency_pct=95.45%. /completion 11.27 t/s, /v1/chat 10.44 t/s, gpu free 4 GiB after load. Content reply: "The capital of France is Paris." (qwen3.6 thinking mode active). bench-pascal-gemma4-31b-iq4-xs-offload-{ngl40,ngl48}.json (bullets 4 + 10): ngl=40 phase-1 → ngl=48 phase-2-verify accelerator (crystal-assist's recipe): per_layer_combined = (gpu.model + gpu.context) / layers_gpu at ngl=40 = 309 MiB; ngl_max = 40 + floor((2954 - 400) / 309) = 48. Phase-2 verify PASS at -ngl=48: 49/62 layers GPU, 13/62 layers CPU, 4.95 t/s /completion, 4.98 t/s /v1/chat. Dense-layer partial-offload, host_model=3967 MiB, host_context=768 MiB. Card 96% utilized. gemma-4-31B-IQ4_XS is the smallest 31B quant available anywhere on titan or lithium (sweep done by crystal-assist) — confirms 31B = the documented offload-demo model, closes bullets 4 AND 10 in one bench. Regression rerun JSON: minor wording fix — the bit-identical content sha 01ba4719c80b6fe9 is sha256(b"\n") (single newline), not empty string or null. Banks the harness blind-spot that hashing `jq -r .content` output cannot distinguish JSON-null vs "" vs literal "null" vs "\n". A==B determinism conclusion stands (per crystal-assist review). * scripts(pascal): #100 bullets 7 + 9 bench evidence — qwen 35B MoE + gemma4 26B MoE Closes the last two model bullets: bench-pascal-qwen3.6-35b-a3b-iq3-xxs.json (bullet 7): Qwen3.6-35B-A3B (MoE, 3B active) at UD-IQ3_XXS (12.30 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu, 41/0/41 layers, gpu_residency_pct=96.85%. /completion 44.45 t/s, /v1/chat 40.96 t/s — fastest of any tested model (3B active keeps per-token compute light). Content reply: "The capital of France is Paris." VRAM 13003/16384 MiB after load (3 GiB headroom). bench-pascal-gemma4-26b-a4b-iq4-xs.json (bullet 9): Gemma4-26B-A4B (MoE, 128 experts / 8 active per token) at UD-IQ4_XS (12.66 GiB), -ngl 99 -fa on -c 4096 → 31/0/31 layers on GPU, /completion 42.10 t/s, /v1/chat 42.36 t/s. Content reply: "### Answer: The capital of France is Paris." VRAM 14345/16384 MiB after load. Classifier note (banked in JSON): the 26B host_model=748 MiB tripped the harness's 600-MiB expert-MoE threshold. Inspection of the gemma4 config (vocab=262144, hidden=5120, IQ4_XS bytes/weight) confirms 748 is the embedding tensor + boundary buffers (≈671 MiB pure embedding), NOT expert offload — all 128 experts are in gpu.model_mib=12952. PRIMARY layer-count signal (31/0/31) correctly reads full-GPU. The 600 MiB threshold was calibrated to 12B embeddings (~540 MiB) and under-scales for larger vocab×hidden_dim products. Mode patched to full-gpu with classifier_note explaining the misfire + suggested remediation (host_model_pct_of_total < 10-15% = embedding-pattern; ≥ that = real expert offload). All 10 issue #100 bullets now have committed bench evidence. --------- Co-authored-by: marksverdhei <marksverd@gmail.com> Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… default (#101) The heierchat snapshot at tools/server/public/ was wired into llama-server as priority 1 in scripts/ui-assets.cmake, overriding the LLAMA_USE_PREBUILT_UI HF-bucket download even though that flag defaults ON. The snapshot expects a heierchat-app backend that isn't present on bare llama-server, which causes the embedded webui to hang on "Initializing connection to heierchat server…" when hit at llama-server's root. Now that heierchat is a standalone product talking to llama-server over its OpenAI-compatible API, the embedded UI reverts to the upstream llama.cpp default — fetched as a prebuilt bundle from the llama-ui HF bucket at build time via the existing tools/ui/ cmake scaffolding (no nodejs required, no fork-side build). Behaviour after this commit: - copy_public_dist priority 1 → MISS (public/ removed) - copy_src_dist priority 2 → MISS (tools/ui/src/ has no svelte sources) - BUILD_UI npm priority 3 → no-op (no source to build) - HF_ENABLED priority 4 → downloads upstream llama-ui bundle, embeds via llama-ui-embed (C++, host-compiler only) Docs updated to reflect the new arrangement. The copy_public_dist cmake function is preserved as a manual override for local experimentation. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…tor (#103) The upstream sync (#93) vendored the llama_hparams getter refactor: n_layer is now a getter returning n_layer_all - n_layer_nextn, with the settable member renamed. test-dflash still assigned the getter, breaking every CUDA/cpu CI build since 2026-06-07. n_layer_nextn defaults to 0, so n_layer_all = 5 preserves the tests' intent exactly. Verified: test-dflash compiles and all cases pass. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… logprobs partial-sort (#102) * cuda(fattn): vec kernel instances for D == 512 with matched quantized KV types Gemma 4 global attention layers (head size 512) previously always dispatched to the tile (pre-Volta) or MMA (Turing+) kernels, both of which require K/V dequantized to F16 -- with a quantized KV cache that staging pass re-reads and re-writes the entire per-layer KV every decode step. Add vec kernel instances for D == 512 with matched q4_0/q8_0 KV types (the vec kernel reads quantized KV directly) and dispatch to them for the small batch sizes the vec kernel already owns on each arch. Gated on matched quantized types and logit_softcap == 0 (vec only compiles softcap variants for D == 128/256). test-backend-ops previously had no quantized-KV FA coverage above head size 72; add Gemma4-shaped hs=512 cases (q8_0/q4_0, GQA, nb 1/2/3/32, sinks). All 2899 FLASH_ATTN_EXT cases pass on CUDA (sm_86) vs CPU reference. * server: avoid full-vocab sort when computing token probabilities get_token_probabilities() sorted the entire vocabulary (262k entries for Gemma) by logit on every emitted token when n_probs > 0, and the caller then linearly scanned the sorted vector again to find the sampled token's probability. The softmax normalization only needs the max and the sum of the logits -- both O(n) without sorting. Select the top n_probs tokens with a partial sort and return the sampled token's probability directly from the same pass: O(V log V + V) per token becomes O(V + k log k). No output change: same top-k ordering, same normalization over the full candidate set. * tests: add D == 512 quantized-KV FA perf cases Gemma 4 global-attention decode shapes (D=512, GQA=4, nb=1, q8_0/q4_0 KV) for test-backend-ops perf mode. RTX 3090 (sm_86), MMA+dequant -> vec: kv=4096 q8_0: 155.0 -> 76.5 us/run (2.03x) q4_0: 150.4 -> 90.9 us/run (1.65x) kv=8192 q8_0: 302.5 -> 145.3 us/run (2.08x) q4_0: 277.8 -> 163.1 us/run (1.70x) kv=16384 q8_0: 558.5 -> 286.1 us/run (1.95x) q4_0: 533.5 -> 298.3 us/run (1.79x) * cuda(fattn): restrict D == 512 vec dispatch to gqa_ratio <= 4 The vec kernel re-reads K/V once per Q head; tile/MMA amortize K/V reads across Q heads via the GQA optimization (at the cost of a dequant-to-F16 staging pass for quantized KV). Measured crossover on both sm_61 (TILE baseline) and sm_86 (MMA baseline): vec wins ~1.4-2.0x per-op at gqa_ratio <= 4, but loses 1.1-2.5x at gqa_ratio == 16 -- the Gemma 4 global-attention deployment shape (MQA, n_head_kv == 1). Adds the deployment shape (nh=1, nr=16) to correctness and perf test cases so the dispatch decision stays measurable. --------- Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
#74) All SYCL jobs in this workflow are commented out upstream (PR ggml-org#23705) to save Actions resources. With the upstream auto-triggers (push to master, pull_request matching ggml/src/ggml-sycl/**) still firing, every fork-sync that touches master creates a zero-job 'failure' run, polluting our Actions dashboard. Strip the auto-triggers on ht; keep workflow_dispatch so the job stays manually invokable if/when upstream re-enables SYCL CI. Will need to be re-applied after future master syncs that pull in the upstream version of this file.
Same pattern as PR #74 (build-sycl.yml): upstream PR ggml-org#23705 commented out every job in this workflow, but the push:master + pull_request auto-triggers stayed live. Any PR touching ggml/src/ggml-cann/** would create a zero-job "failure" run. Keep workflow_dispatch so the workflow stays available for manual runs once jobs are re-enabled (which requires provisioning dedicated runners per the upstream TODO). Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… embedding example path (#76) Epoch #73 task 3 (docs review on tools/server/README.md). Three broken in-tree links found: * `chat.mjs` and `chat.sh` — both removed upstream by ggml-org#23870 (server: remove obsolete scripts). README still referenced them under '## More examples / Interactive mode'. Drop the entire section since both samples are gone; the OpenAI-compat endpoints documented elsewhere in this file cover the same surface. * `../embedding` — embedding moved from `tools/embedding` to `examples/embedding` upstream. Fix the relative path.
Epoch #81 task 5 (docs review). Both rows in the server-specific params table reference qwen3-omni TTS support that does not exist: * Neither --talker-model nor --code2wav-model is registered in common/arg.cpp; no C++ source mentions the strings "talker" or "code2wav" (only model-conversion code in conversion/qwen3.py references them as tensor name prefixes, which is unrelated). * The /v1/audio/speech endpoint the help text promises is also absent — only /v1/audio/transcriptions is wired up via routes.post_transcriptions_oai in server.cpp:198. The /v1/audio/speech string appears only in the webui's TTS *client* code (it calls out to a separate OpenAI-compatible TTS server, not back to llama-server). The vocoder-related flags that DO exist (--model-vocoder, --tts-use-guide-tokens) are still documented elsewhere and unchanged. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…t trace (#89) Epoch #86 task 2 (docs review round 2 on README-dev.md). The "Example trace of a request" section described two methods that don't exist by those names: response->update() and response->to_json(). Greppers chasing those names find nothing. The real calls are result->update(states[idx]) (inside server_response_reader::next at server-queue.cpp:396) and result->to_json() (called from multiple sites in server-context.cpp ~3854-3987). Rewrites the two affected bullets to use the actual member names and to thread the call sites through server_response_reader (which is the component that does both calls — server_res_generator is what owns the reader, but doesn't make these calls directly). No other drift found in the doc — confirmed server_routes, server_res_generator, launch_slot_with_task, task_result_state, and handle_completions_impl all exist by those names in current source. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
The Backend & quantization table omitted two HT-specific speculative decoding features that have shipped to ht: - DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for partial-accept feature extraction) — landed via PR #62 (b0daec5), integrates the z-lab DFlash block-diffusion drafter against Gemma4 31B targets. - Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored via PR #93 (4c09765) ahead of upstream PR ggml-org#23398 merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and flows through a normal master sync. Found during a §7 documentation freshness sweep — the inventory exists to be authoritative ("consult it before assuming a behaviour is upstream stock" per AGENTS.md), so omissions defeat the purpose. Docs-only, no code touched. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…rt (#77) Epoch #73 task 4 (bug-hunt on server-models.cpp). Two race-condition latent bugs in router state machine. Both used `std::map::operator[]` where `find()` was needed; silent default-insert on miss produced false-positive predicate / phantom mapping entries. ## Bug 1: unload_lru() cv.wait predicate (line 762) ``` cv.wait(lk, [&]() { return mapping[lru_model_name].meta.status == SERVER_MODEL_STATUS_UNLOADED; }); ``` The default-init `server_model_meta` has `status = SERVER_MODEL_STATUS_UNLOADED`, so this predicate is spuriously true if the model is missing — AND it inserts a garbage entry into `mapping` as a side effect. Race: another thread (or a reload) unloads/removes the model between `unload(lru_model_name)` (which released its own lock) and our re-acquire. Predicate returns true on the inserted default; we proceed thinking the unload completed, but mapping is now polluted with a phantom entry. Fix: `find()`; missing → treat as done. ## Bug 2: load() after cv.wait(!is_reloading) (line 779) ``` cv.wait(lk, [this]() { return !is_reloading; }); auto meta = mapping[name].meta; if (meta.status != SERVER_MODEL_STATUS_UNLOADED) { ... return; } ``` `has_model()` was checked earlier under its own lock-cycle, then `unload_lru()` cycled the lock, then we re-acquire here. If a reload ran in that window and erased `name` from mapping, `mapping[name]` silently inserts a default. The early-return is bypassed (default status = UNLOADED) and we proceed to spawn a child with empty preset args. Fix: `find()`; missing → log + bail. ## Verified - ✅ `cmake --build build --target llama-server` succeeds locally - ✅ No behavior change on the non-racy path (find()+early-return on miss is operationally equivalent to operator[]+early-return on default-UNLOADED, minus the silent insertion) ## Out of scope Other `mapping[]` sites in load() (lines 1020, 1031) are reachable only while the lock is held continuously from the now-guarded read at 779, so no race exists there. `mapping[name]` at line 1186 (`proxy_request`) also reads under-lock after `get_meta()` confirmed presence. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…ids (#83) Bug-hunt round 3 finding (sweep on tools/server/server-queue.cpp). server_response::remove_waiting_task_ids (multi-id) only erased entries from waiting_task_ids — unlike its single-id sibling remove_waiting_task_id which also dropped any pending entries in queue_results for the same id. When a reader tears down mid-stream (HTTP client disconnect, abort, or ordinary destruction during streaming with results still in flight), the asymmetric multi-id path leaves partial results stranded in queue_results. Those entries can never be matched by a future recv() (no caller waits on those ids any more), but the predicate `!queue_results.empty()` in recv()'s cv.wait still fires immediately, so the next recv() for an unrelated task id spin-waits at 100% CPU until that task's own result arrives and the for-loop finds it. Add the same erase-if pass to remove_waiting_task_ids and drop the now-redundant per-id remove_waiting_task_id calls from server_response_reader::stop() — they were the only thing covering this orphan case before, and only on the cancel branch. Caller audit: remove_waiting_task_ids has exactly one production call site (server_response_reader::stop), so the cleanup is limited-blast-radius. The cancel-loop's redundant single-id calls were 1 extra mutex acquire per cancelled task; removing them recovers that as a minor side effect. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…78) Epoch #73 task 5 (DFlash perf scout, in lieu of titan-gated bench). The top-5 logits selection at common/speculative.cpp:935 was unconditional — `if (i == 1)` gated when (once per draft call) but not whether. The LOG_INF below it is verbosity-gated, so on production the log is suppressed, but the O(n_vocab * log 5) selection still runs. On gemma-class vocabs (~256k tokens) the selection burns ~1ms per draft call. At Round-10's measured ~8% accept rate, every output token costs several draft calls — so this debug computation is in the steady-state hot path. Fix: extend the gate to `if (i == 1 && dflash_debug)`. `dflash_debug` is the cached env-var probe already declared at line 883 (used by the features-debug block immediately above). When LLAMA_DFLASH_DEBUG is set the diagnostic still fires; production is unaffected. Found during epoch #73 task 5 — DFlash hot-path scout. Local CPU build verifies; behavior change only when LLAMA_DFLASH_DEBUG is set (was unconditional → now gated; same code runs when enabled). Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
… (#71) Measured perplexity on Qwen3.5-0.8B-BF16 / wikitext-2 / ctx=512: | cache-type | PPL | vs f16 | |------------|--------|--------| | f16 | 19.08 | baseline | | q8_0 | 19.08 | lossless | | tbq3_0 | 1252.30 | 65x worse | | tbq4_0 | 1393.00 | 73x worse | TBQ KV-cache produces near-random output. Likely root cause is statistical: TBQ's rotated-domain codebook was calibrated for weight distributions, not the K/V tensor distributions seen during inference. The encoding scheme itself cannot faithfully represent KV values. Snoop-kube's cluster audit confirms zero deployments use tbq* KV-cache (every host uses q8_0 or q4_0). DFlash also defaults to q8_0 (PR #65). No production consumer exists. This PR adds a one-line experimental note to the --cache-type-k/v and --cache-type-k-draft/v-draft help text, referencing issue #70 for the full data + recommendation. Code path stays in place — Markus may have roadmap intent I'm not aware of; this just stops anyone reading --help from assuming tbq* is a usable choice without checking. Follow-ups if Markus prefers full removal: * drop tbq3_0/tbq4_0 from common/arg.cpp's kv_cache_types list * keep the ftypes (TBQ weight quantization is separate from KV use) * close issues ggml-org#124 + ggml-org#125 as wont-fix
#82) When generation hits `STOP_TYPE_LIMIT` (max_output_tokens / ctx-size cap), the OAI Responses code paths hardcoded `"status": "completed"` everywhere — top-level response, per-message output items, function_call items, and the streaming `response.completed` SSE event. Agentic clients (Codex CLI, etc.) couldn't tell a finished response from a truncated one and ended up feeding partial output back into conversation history, triggering infinite retry loops on JSON-parse failures (issue #19, Phase 2). Per the OAI Responses spec, branch on the stop type in: * `server_task_result_cmpl_final::to_json_oaicompat_resp()` — emit `"status": "incomplete"` on the top-level response, all output items inherit the same status, plus `"incomplete_details": {"reason": "max_output_tokens"}` at the top level. * `to_json_oaicompat_resp_stream()` — same mapping on the per-item statuses, plus the final SSE event becomes `response.incomplete` (vs `response.completed`) with `incomplete_details` on the payload. Doesn't address Phase 2 of the issue (HTTP 400 + actionable message from `func_args_not_string`) — that requires typed exception plumbing through common/chat.cpp into the server error path. Phase 1 alone prevents the cascade in the first place: clients see truncation as truncation, not as a malformed completed response. Test coverage in test_compat_oai_responses.py: * `test_responses_truncation_emits_incomplete_status` — non-streaming: `max_output_tokens: 2` on tinyllama2 reliably trips STOP_TYPE_LIMIT; assert status=incomplete + incomplete_details + per-item status. * `test_responses_truncation_stream_emits_incomplete_event` — streaming: same setup, verify a `response.incomplete` event arrives with the same payload shape. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
* fix(server): per-slot byte cap on context checkpoints (closes #67) The existing checkpoint cap is count-only (`--ctx-checkpoints`, default 32), which lets a single slot accumulate ~20 GB of host-RAM checkpoints under heierchat's long contexts and drives titan into SystemOOM (37 GB anon-rss on the 46 GB / 3 GB-swap node). Adds a per-slot byte budget: * `--ctx-checkpoints-max-mib N` (env `LLAMA_ARG_CTX_CHECKPOINTS_MAX_MIB`), default 4096 MiB / slot, 0 = disabled (count-only legacy behavior). * Eviction in `create_checkpoint` now FIFO-evicts until BOTH caps satisfy. Whichever cap bites first is reported via `reason=count|bytes` in the warning log so it's diagnosable from titan logs. * The success log now also reports `slot total = X MiB / Y MiB cap` so the current footprint is visible per checkpoint create. A 4 GiB-per-slot default bounds total host-RAM checkpoint use at `n_slots * 4 GiB`. With heierchat's typical `--parallel 1` (DFlash gate) that's 4 GiB worst-case; with `--parallel 4` it's 16 GiB — both well under titan's 46 GiB. Follow-up (snoop-kube discussed): a dynamic cap based on `/proc/meminfo MemAvailable * 0.3` would adapt better than the fixed default — left for a separate change once the byte-cap mechanism is in production. * test(server): unit tests for per-slot checkpoint byte cap (#68) Adds tools/server/tests/unit/test_ctx_checkpoints_bytes_cap.py with four scenarios for --ctx-checkpoints-max-mib: * default-args: server starts; if checkpoints are created the "slot total = X MiB / Y MiB cap" footprint marker appears in the create_checkpoint log line. * --ctx-checkpoints-max-mib 0: byte cap disabled, server starts fine, request succeeds (count-only legacy behavior). * negative value: arg parser rejects, ServerProcess.start() raises. * tiny byte cap + multi-turn chat: when eviction fires, the log reports reason=bytes. Skipped if tinyllama2 doesn't accumulate any checkpoints in the short conversation (rather than flaking). ServerProcess gains three knobs for the existing/new flags: n_ctx_checkpoints, checkpoint_min_step, ctx_checkpoints_max_mib — all default to None (use server defaults) and only emit a CLI flag when set, so existing tests are unaffected. * fix(server): bail early from create_checkpoint when --ctx-checkpoints=0 Bug-hunt finding during the PR #68 review territory: the eviction loop unconditionally calls slot.prompt.checkpoints.front() / .erase(begin()) based only on size >= n_ctx_checkpoints. When n_ctx_checkpoints is 0 and the list is empty (the user-likely "disable checkpoints" intent), both calls hit empty-container UB. The arg parser accepts 0 without complaint and silently wraps negative ints via the size_t cast to SIZE_MAX (which is also a no-op cap). Rather than tighten the arg parser and risk breaking unknown callers, treat n_ctx_checkpoints <= 0 as "checkpoints disabled" at the call boundary — a sensible interpretation that's also what the negative-wrap was de facto delivering. Adds test_ctx_checkpoints_zero_disables_creation to the existing test_ctx_checkpoints_bytes_cap.py: drives the server with --ctx-checkpoints 0 and asserts no "created context checkpoint" line ever fires while requests still succeed. --------- Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
* scripts(dflash): Round-12 target-precision bench + parity scaffold + gguf guard Three additive scripts for the DFlash accept-rate investigation (Round-12), none touching tracked source so they sit cleanly alongside the PR #53 squash: - gguf-meta.py: numpy-free GGUF header reader with --check-instruct, which refuses base-fine-tune and truncated/stub GGUFs. Prevents the base-vs-instruct confound (an -it-trained DFlash drafter benched against a base target). - bench-dflash-target-sweep.sh: sweeps the TARGET quant (drafter fixed) to test whether target-side quant noise off the drafter's bf16 training distribution drives the 8% vs ~21% accept gap. Accept recomputed from raw n_accept/n_drafted counts; mean +/- sample stddev over N runs; REAL(>1sigma)/within-noise deltas. - dflash-logit-parity.py: scaffold for FORWARD logit parity vs the z-lab PyTorch drafter (Round-7b only did weight parity). Constants read data-driven from the drafter config.json; reference forward marked TODO(zlab) pending the z-lab modeling code (HF repo ships weights only). * scripts(dflash): gguf-meta --check-instruct rejects truncated tensor data The guard validated the GGUF header but not that the tensor DATA was present, so a file truncated mid-write (valid header, missing weights) passed --check-instruct and would have been benched — loading garbage or crashing mid-run. Caught empirically: the corrupt gemma-4-31B-it-Q5_K_M.gguf (1.5GB, header intact) slipped through. read_meta() now walks the tensor-info section, computes the minimum file size implied by the tensor offsets + alignment, and sets _data_complete. --check-instruct rejects when actual size < implied minimum. Same failure class as the HF-xet silent shard drop the download step hit. Verified: corrupt Q5 (1.5GB < 21.7GB) REFUSED; Q8_0/BF16/Q4_K_M/IQ4_XS all complete and ACCEPT.
* scripts(dflash): deployment-parity prompt suite + bench harness
15 prompts × 3 classes (MT-Bench / HumanEval / GSM8K, 5 each) targeting z-lab's
published Gemma τ table (MT 4.23 / HE 8.00 / GSM8K 7.53 at conc=1, BF16, block=16).
Greedy temp=0, --spec-draft-n-max=15, fixed seed for reproducibility.
bench-dflash-parity.sh runs the suite against llama-speculative-simple and
emits per-prompt {tau, n_accept, n_drafted, decode_tps} as JSON. snoop-kube
runs the SAME prompts against vLLM/SGLang with z-lab/gemma-4-31B-it-DFlash on
titan and emits the same shape — we diff cell-by-cell to localize the gap.
tau computed as n_predict / (n_predict - n_accept), the same convention as
z-lab's dflash_generate's acceptance_lengths.
* scripts(dflash): bench harness env knobs for CPU-offload + cluster protection
- DFLASH_PARITY_NGL / NGLD: override target/drafter -ngl (default 99).
Needed to fit larger targets that don't fit a single 24G card; with -ngl 35
the Q8_0 31B target loads alongside the 1.2G Q6_K drafter on one 3090.
- DFLASH_PARITY_TIMEOUT: per-prompt timeout (default 240s). CPU-offload runs
for BF16 targets take minutes per prompt at low GPU layer counts.
- DFLASH_PARITY_THREADS: --threads cap. On centurion (etcd HA control-plane
member) leave >=2 cores free so long CPU-offload runs don't add fsync
latency that wobbles cluster heartbeat/leader-election.
- DFLASH_PARITY_NICE: nice -n prefix (0-19). Sets the bench at minimum
priority on shared boxes.
Defaults preserve the prior full-GPU behavior; opt-in only.
---------
Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…ep 1) (#69) Adds an optional `out_bytes_per_device` output parameter to `common_fit_params` (defaulting to nullptr so all existing callers are unaffected). On SUCCESS, populates with the projected per-device byte demand for the resolved plan — index 0 is the CPU device, indices 1..N are GPU/accel devices in the same order as `tensor_split`. This is the foundation for issue #66 (per-GPU-aware router fit). The router currently admits candidates against the TOTAL VRAM pool via `common_fit_params` + `--models-max`, ignoring that two models pinned to CUDA0 collide on GPU0's 24 GiB even though the 48 GiB pool says fits. Subsequent steps (separate PRs): * (2) Track `reserved_per_device[]` in `server_models`, populated from this output on each load and decremented on unload. * (3) Replace count-only `unload_lru()` with `unload_lru_for_devices(targeted, candidate_per_device)` that LRU-evicts from the constrained device set. * (4) Wire admit decision against `free_device[d] - reserved[d]` per targeted device. No behavior change in this PR — the new output is opt-in and the existing planning logic is untouched. Both fit-impl return paths (the early MoE-trivial path and the full-search path) populate `out_bytes_per_device` identically with the final `mem` vector that fit_impl already computes internally.
…#66 step 2 prep) (#72) * feat(fit-params): --fit-print-plan emits per-device byte plan as JSON (#66 step 2 prep) The router's per-GPU admit decision (#66) needs the per-device byte demand for a candidate model BEFORE spawning the child subprocess. PR #69 added the underlying `out_bytes_per_device` output to `common_fit_params`; this PR exposes it via the existing `tools/fit-params` CLI as a subprocess-friendly JSON output. * New CLI flag `--fit-print-plan` (env `LLAMA_ARG_FIT_PRINT_PLAN`). * On success, prints a single-line JSON object on stdout: {"per_device_bytes":[N0,N1,...],"n_devices":K,"total_bytes":T} plan[i] = i-th GPU/accel device, same order as tensor_split; CPU host memory NOT included. Empty plan for CPU-only builds. * On fit failure, emits an explicit JSON failure marker and exits 1: {"error":"fit_failed","status":N} * common/fit.cpp: populate `out_bytes_per_device` at the three early- return paths (the impl had three 'no changes needed' fast-paths that bypassed the main return point where PR #69 wrote the plan). Doc string in common/fit.h corrected — plan covers GPU devices only. Designed to be subprocessed from `server_models::compute_admit_plan(name)` (#66 step 2 — out-of-process approach per the architectural call on issue #66 / task ggml-org#123). The router parses this JSON, tracks `reserved[d]` for in-flight LOADING models, admits candidates against `live_cudaMemGetInfo(d) - reserved[d]`. Mutually exclusive with the existing `--fit-print` mode; if both set, `--fit-print-plan` wins. Local CPU build verified: `--help` renders the new flag, empty plan returned for CPU-only build as expected. GPU verification deferred to snoop-kube's canary-cycle. * test(fit-params): smoke for --fit-print-plan JSON output (PR #72 coverage) (#75) PR #72 added the --fit-print-plan flag to llama-fit-params without test coverage. This adds a tools/fit-params/tests.sh (pattern lifted from tools/gguf-split/tests.sh) that downloads a small Qwen3-0.6B GGUF and verifies six invariants: 1. success-path emits single-line JSON 2. schema has per_device_bytes / n_devices / total_bytes with correct types 3. len(per_device_bytes) == n_devices 4. total_bytes == sum(per_device_bytes) 5. on CPU-only builds (n_devices==0): plan is empty, total is 0 6. fit-failure (nonexistent model) emits the documented "error":"fit_failed" JSON marker on stdout (not garbage) so subprocess callers can distinguish fit-failure from parse-failure Run with: tools/fit-params/tests.sh path/to/build/bin Verified locally on CPU-only build: ALL fit-params --fit-print-plan smoke tests PASSED.
Bug-hunt round 4 finding on tools/server/server-http.cpp.
The api-key middleware's allowlist (/health, /v1/health, /models,
/v1/models, /, /index.html, /bundle.js, /bundle.css) was matched
against req.path verbatim. With --api-prefix /llama, the routes are
registered at "/llama/health" etc. (handlers attach to
path_prefix + path in server_http_context::{get,post}), so req.path
arrives as "/llama/health" and never matches the un-prefixed
allowlist. Health probes and the static webui assets then 401 under
--api-key, defeating the point of marking them public.
Capture path_prefix into the middleware lambda and strip it from
req.path before the allowlist lookup.
Caller audit:
- Only public_endpoints lookup needs the strip; the api-key validity
check itself is correct regardless of prefix.
- The unrelated bundle-of-tests fix in utils.py adapts the health
poll in ServerProcess.start() so tests can actually use api_prefix.
Tests in test_security.py:
- test_api_prefix_keeps_public_endpoints_public — drives all 4
prefix+public endpoints with --api-key + --api-prefix and asserts
no 401.
- test_api_prefix_still_requires_key_for_private_routes — guards the
inverse: private endpoints under prefix MUST still 401 without a key.
20/20 security tests pass locally (was 18 before).
Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…104) turboq_rotate_block_forward/inverse looped to QK_K (256) while every caller (tbq3_0/tbq4_0 quantize + dequantize) passes TBQ_BLK_SIZE (128) float buffers — a 128-float OOB read+write per block. Confirmed with ASAN (heap-buffer-overflow in matvec_row via quantize_row_tbq3_0_ref) and the cause of the Windows x64 CI failures: test-quantize-fns SEGFAULT / test-quantize-perf 0xc0000374 (STATUS_HEAP_CORRUPTION). Loop bound fixed to TBQ_BLK_SIZE. No behavior change for the valid region: the extra iteration only produced the out-of-bounds garbage. test-quantize-fns + test-quantize-perf now pass under ASAN+UBSAN. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…#105) * ci: unbreak fork CI — dead self-hosted labels + invalid sycl workflow Three fork-side CI defects, all inherited from upstream's runner topology: - build-cmake-pkg.yml: [self-hosted, Linux, CPU] matches no runner in this org (only ht-org-k8s-* with Linux,X64,k8s exists) -> the job queued forever and the 'CI (cpu)' workflow has never concluded since the 2026-06-07 CI refactor sync. GitHub-hosted ubuntu-latest (repo is public). - python-lint.yml: [self-hosted, fast] same story; flake8 dead-queued on every push since 2026-05-24. ubuntu-latest. - build-sycl.yml: #74 stripped the auto-triggers, but the file itself is schema-INVALID (all jobs commented out = empty jobs map), and GitHub creates a zero-job 'failure' run on every push for invalid workflows regardless of triggers. Add a never-running placeholder job so the file parses; drop it when upstream re-enables SYCL CI. * ci(lint): flake8 green — NP100 per-file-ignores for stdout CLIs + drop unused np binding The linter has been dead-queued since 2026-05-24 (runner label); reviving it surfaced 36 violations, all in downstream diagnostic scripts: - scripts/{gguf-meta,compare-dflash-weights,dflash-logit-parity}.py write their reports to stdout by design — NP100 (no print(), use logging) targets library code. Per-file-ignores in .flake8. - dflash-logit-parity.py: F841 — numpy presence-check kept, dead binding dropped. --------- Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
…ifications (#106) Every downstream change now carries a Why column — the reason it exists on ht — and the section states the audit rule: a change we can no longer justify gets dropped at the next upstream sync. New coverage (previously undocumented): D=512 FA vec kernels, server table (heierchat/router glue, ctx-checkpoint byte cap, --api-prefix public endpoints, Responses incomplete status, logprobs partial-sort, fit-params byte plan, router hardening, termd, log-noise fixes), scripts & validation (DFlash bench suite, Pascal P5200 recipe, downstream test coverage), and the full Build/CI delta (trigger strips, runner re-targeting, fork meta). Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
Sync the ht fork onto upstream llama.cpp master: merge-base 0066404 -> upstream d8a24cc, 114 upstream commits. Resolved 96 conflicts (63 modify/delete + 32 content + 1 add/add). Key resolution decisions (full rationale in PR): - tools/ui/* (63 files): kept deleted. The fork ships a prebuilt UI from the llama-ui HF bucket, not the upstream Svelte source tree. - mtmd cluster (clip/mtmd/mtmd-helper, 6 files): converged to upstream. ht's mtmd was vendored upstream commits now behind master (wrapper-struct bitmap API, video/ffmpeg support, builder pattern); verified no ht-unique features lost; server callers updated to the wrapper API. - Gemma4 MTP / DFlash / eagle3: merged both sides. Kept ht's DFlash arch, hparams, and speculative wiring; adopted upstream's now-merged eagle3 + masked-embedding support (vendored PR ggml-org#23398 converging upstream). - KV cache: adopted upstream's v_cells_impl shared_ptr refactor, which mainlines the dual-context cell-sharing ht hacked in via `other`. - llama-graph kq-mask: took upstream's buffer-guarded mask set (matches the shared-cells refactor); dropped the now-duplicate unconditional call. - Embedding scale: kept ht's strict `ubatch.token` guard (Gemma4 vision must not scale raw image embeddings) over upstream's deepstack-aware variant. - Server: kept ht features (router glue, #94 benign-warning downgrade, #87 api-prefix stripping, DFlash draft + MTMD-draft wiring) merged with upstream's expanded public endpoints, mtmd wrapper API, and int64 counts. Post-merge fixes (auto-merge artifacts the build surfaced): - removed duplicate llama_model::fc member (both sides added it) - restored `n_embd` in output_reserve for upstream's embeddings-layer-inp path Validation: builds clean (118 targets, CPU Release); ctest -L main 47/47 green. Deployment-shaped paths (dual-context speculative KV sharing, Gemma4 vision embedding scale, mtmd video) need a real-model smoke before fleet rollout.
Author
|
Superseded — the sync landed via linear rebase instead of this merge commit. Per maintainer direction (keep ht a clean linear delta on top of upstream, no merge nodes), the resolved tree from this PR's merge commit
Closing as landed. The |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Upstream sync: ht ← upstream/master (114 commits)
Syncs the
htfork onto the latest upstreamllama.cppmaster.006640408→ upstreamd8a24ccee(114 upstream commits)d8a24ccee, so the next sync measures its delta correctly (a vendored/squashed merge would not — see below).Important
Merge this with a MERGE COMMIT (or fast-forward), NOT squash. This is a true 2-parent merge commit; squashing flattens it, the merge-base would not advance, and the next sync would again over-count its delta. This is the one PR where the repo's default squash-merge policy must be overridden.
Validation
-Werror.ctest -L main— 47/47 green (incl.test-quantize-fns,test-llama-archscovering the DFlash+eagle3 arch merge,test-backend-ops).test_basic5/6,test_completion38/40. The 2 failures are sandbox-environment artifacts, not regressions: webuiindex.htmlcouldn't download from thellama-uiHF bucket offline (test_no_webui), and one test needs a model not in the offline cache (..._stream_..._stops).Key resolution decisions
tools/ui/*(63 files)llama-uiHF bucket, not the upstream Svelte source tree.tools/mtmd/*(6 files)v_cells_implrefactorshared_ptrcells mainline the dual-context cell-sharing ht hacked in viaother; the init-list shares the source cache's storage whenotheris set.llama-graphkq-maskubatch.tokenguardPost-merge fixes (auto-merge artifacts the build surfaced, not conflicts): removed a duplicate
llama_model::fcmember; restoredn_embdinoutput_reservefor upstream's embeddings-layer-inp path.ctest+ the small-model server smoke don't exercise the deployment-shaped paths most affected by the merge. Before deploying to titan/crystal:v_cells_implrefactor changed the cell-sharing mechanism.Rollback: tag
ht-pre-sync-2026-06-12(immutable).