feat: LoRA adapter auto-discovery and frontend UI#15
Closed
marksverdhei wants to merge 15 commits into
Closed
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text and mmproj GGUF conversion. Handle config structure difference where the Thinker-only variant has vision/audio configs at the top level. Add pooling type detection for embedding use cases. Fix audio tensor routing to base MmprojModel class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
) * docs: add ht-fork documentation, branding, and discussion links Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * convert: support LoRA conversion for MLA kv_b_proj Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add fork sync automation * feat: add --remap-developer-role flag to translate developer→system Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: support LCO-Embedding-Omni (Qwen2.5 Omni Thinker) GGUF conversion Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text and mmproj GGUF conversion. Handle config structure difference where the Thinker-only variant has vision/audio configs at the top level. Add pooling type detection for embedding use cases. Fix audio tensor routing to base MmprojModel class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add ht branch to flake8 lint workflow triggers * feat: welcome agentic contributions, remove upstream AI restrictions - Delete AGENTS.md (upstream's anti-AI contributor guidelines) - Replace restrictive AI Usage Policy with welcoming Agentic Contributions section - Update README to highlight fork's pragmatic stance on AI contributions Unlike upstream, we evaluate code by quality, not by how it was written. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* webui: add cancel button for in-progress model loading Allow users to cancel a model that is stuck loading or taking too long in the router mode model selector. The cancel button appears next to the loading spinner in both the model selector dropdown/sheet trigger and within individual model option rows. Uses the existing /models/unload endpoint which already supports unloading models in LOADING state. The frontend polling loop is interrupted via AbortController to prevent stale error toasts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * webui: add cancelling state indicator and fix cancel polling - Show orange "Cancelling" indicator with spinner while cancel is in progress - Poll until server confirms model is no longer in LOADING state before clearing the cancelling indicator - Guard against redundant unload calls on already-unloaded models - Keep loadingModelId alive during cancel so selector trigger shows the cancelling state correctly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(webui): color-coded spinners for model load/unload/cancel states - Loading: green spinner, clockwise - Unloading: red spinner, reverse direction with "Unloading" label - Cancelling: orange spinner, reverse direction - Track unloading state separately in models store Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(webui): address PR review feedback for cancel model loading - Remove duplicated cancel logic from ModelsSelector and ModelsSelectorSheet by deriving loading/cancelling state from the store (issue #1) - Fix race condition: no longer set isLoadingModel=false before cancel completes, preventing brief UI flash (issue #2) - Add MAX_CANCEL_POLL_ATTEMPTS (60) timeout to cancel polling loop to prevent infinite polling if server never transitions (issue #3) - Replace div cancel buttons with proper <button> elements for keyboard accessibility and screen reader support (issue #4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
When using --models-dir (router mode), the server now reads GGUF metadata from .gguf files to distinguish LoRA adapters from models. Adapters with general.type="adapter" are collected separately and automatically passed to child model instances via --lora and --lora-init-without-apply when their general.architecture matches the model being loaded. Changes: - Add common_lora_adapter_info struct and common_models_dir_result to preset.h - Add load_from_models_dir_with_lora() that uses gguf_init_from_file() to read metadata and classify files as models or adapters - Inject matching LoRA adapters into child process args at spawn time - Expose discovered adapters in GET /v1/models response as lora_adapters array - Add discovered_adapters storage and get_discovered_adapters() to server_models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add frontend UI for discovering and managing LoRA adapters loaded on the server. The UI only appears when adapters are available (zero disruption otherwise) and integrates with the existing chat completion flow. New files: - lora.service.ts: API service for GET/POST /lora-adapters - lora.svelte.ts: Reactive store with adapter state, toggle, scale - LoraAdapters.svelte: Collapsible panel with per-adapter switch + slider Integration: - LoRA panel renders above ChatFormActions inside the chat input area - Active adapter scales are included in chat completion requests via the lora field in getApiOptions() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix proxy_post to accept model name from query params (needed for endpoints like POST /lora-adapters where body is an array) - Pass selected model ID to LoRA service/store in router mode - Re-fetch adapters when model changes via $effect - Add /lora-adapters to vite dev proxy config Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5772904 to
22e54b4
Compare
6846da3 to
139f68e
Compare
Author
|
Closing as superseded — |
marksverdhei
pushed a commit
that referenced
this pull request
Jun 9, 2026
…ss-harness note) Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on the v2 server-capable build to lock the no-regression gate against the committed baseline (25.54 t/s) and MTP reference (76.05 t/s). - Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54), in band. - MTP /v1/chat greedy: mean 75.07 t/s across 3 reps (-1.29% vs 76.05), in band. - Draft acceptance: 0.76271 — bit-identical to committed regime-2 (225/295 accepted/generated). Strong determinism proof. Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per crystal-assist's review): - Memory slug citations switched from hyphen-form to underscore-form to match the actual slug names (feedback_spec_type_footgun, reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling. - bench-pascal-server-router-smoke.json: added cross_harness_note clarifying the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS the no-regression claim (different harnesses, in band).
marksverdhei
added a commit
that referenced
this pull request
Jun 9, 2026
* scripts(pascal): P5200 build notes + bench harness + Vulkan baseline Working notes for getting ht-llama.cpp running on the Quadro P5200 (Pascal sm_61, 16 GB). Toolkit wall: CUDA 13 dropped sm_61, so CUDA backend requires aur/cuda-pascal 12.9.1 + gcc14. Driver 580 still runs sm_61 binaries fine. Vulkan baseline (Llama-3.1-8B Q4_K_M, ngl=99, fa=0, build f6feddb): pp128 269 t/s pp512 278 t/s pp2048 251 t/s tg32 35 t/s tg128 35 t/s CUDA results pending cuda-pascal install (gcc14 source build dominates). Untracked primer (quadro-p5200-llamacpp-primer.md) referenced as the source for the FP16-1/64-FP32, INT8 DP4A, and ggml-org#7188 FA-fix facts. * scripts(pascal): CUDA backend bench results + complete install recipe CUDA 12.9 toolkit built and benched on the Quadro P5200 (Pascal sm_61). Five obstacles climbed on stock Arch: 1) CUDA 13 dropped Pascal → installed 12.9 from runfile --extract 2) Runfile libxml2.so.2 missing → bypassed installer with --extract 3) gcc-15/16 too new for nvcc → gcc-14 from archlinux-archive 4) gcc14 AUR source-build slow → 51MB binary pkg.tar.zst (30s install) 5) glibc 2.43 cospi/sinpi clash → noexcept(true) patch on CUDA math.h+hpp Full recipe in scripts/build-pascal-p5200.md. Bench summary (Llama-2-7B Q4_0, ngl=99, build 5159fee, P5200 sm_61): CUDA fa=1: pp512=795, tg128=45.8 t/s Vulkan: pp512=418, tg128=43.0 t/s CUDA wins pp ~1.9x, tg within 7% (bandwidth-bound). ggml_cuda_init confirms: compute capability 6.1, VMM yes, GGML_CUDA_FORCE_MMQ baked in (visible in nvcc cmdline). CC6.1 + MMQ-only + no cuBLAS fallback = the INT8 dp4a path is what is running. JSON artifacts committed alongside for replay/comparison. * scripts(pascal): packaging recipe — rpath-clean runtime tarball for Omarchy ISO Adds §7 Packaging covering the cmake-install + patchelf + symlink-chain + stage-and-tar pipeline that produces pascal-cuda-artifacts.tar.zst (the runtime fast-path consumed by hai-os-dev's autoinstall). Also drops the stale "TODO — fill in once build-cuda completes" placeholder and moves Sources to the true end of the doc. Recipe reproduces the verified-clean tarball: rpath stripped on all installed targets, libllama/libllama-common copied + patchelfed, symlinks recreated, members rooted at opt/ for `tar -C / -xf` extraction, ld.so.conf.d snippet documented so no LD_LIBRARY_PATH is needed at runtime. * scripts(pascal): correct §7 tarball size + add reference sha256 Was: prose-estimate "~810 MB before zstd, ~470 MB after" — actual is ~816 MB unpacked, 512 MB compressed (110 members). Adds the reference sha256 from the verified crystal build for hai-os-dev to byte-check against. Notes zstd non-determinism so re-runs are expected to differ. * scripts(pascal): field primer + Omarchy autoinstall handoff guide Round out the Pascal/P5200 enablement bundle (PR #99) with the two human-facing companions to scripts/build-pascal-p5200.md: - quadro-p5200-llamacpp-primer.md: Pascal/GP104 + llama.cpp field guide (the two facts that drive every decision, CUDA vs Vulkan, measured 1080-parity numbers, 16 GB VRAM sizing, optimization checklist). - quadro-p5200-omarchy-autoinstall.md: 7-question handoff guide for hai-os-dev — extra packages (no AUR), CUDA-12.9 runfile pin, build flags, the five obstacles + fixes (glibc 2.43 noexcept patch incl.), pre-build at image time, HaiOS integration points, verified 512 MB / sha256 0efed65... reference tarball, measured baseline. Both docs reference the canonical recipe at scripts/build-pascal-p5200.md and the verified tarball cached at crystal:/home/me/pascal-cuda-artifacts.tar.zst. * scripts(pascal): v2 build flags (server+router) + Gemma4 MTP bench JSONs Recipe update: add -DLLAMA_BUILD_SERVER=ON + -DLLAMA_BUILD_TESTS=OFF to the CUDA configure step. Required for the llama-app unified router (bin/llama) to link — without server-on, libllama-server-impl.so is not built and llama-app link fails with `cannot find -lllama-server-impl`. Also required for Gemma4 MTP: ctx_other wiring for the Gemma4Assistant draft class lives only in tools/server/server-context.cpp; the standalone llama-speculative-simple binary segfaults with "Gemma4Assistant requires ctx_other to be set". Rationale block also captures the spec-decode footgun: --spec-type defaults to `none`, so -md <draft> alone is silently ignored. Must pass --spec-type draft-mtp to engage. The /props default_generation_settings.params["speculative.types"] field is per-REQUEST sampler default, NOT the server engine state — the canonical engagement read is server stderr (draft acceptance line + statistics draft-mtp: ... summary). Bench JSONs (crystal Pascal P5200, Gemma4 12B QAT Q4_K_XL, sm_61 CUDA FORCE_MMQ, -fa on, -ngl 99, ctx 4096, greedy temp=0/top_k=1): baseline (no MTP, llama-bench): pp128=465.71 t/s, pp512=456.37 t/s tg32=25.54 t/s, tg128=25.54 t/s (flat — bandwidth-bound) MTP A/B via `llama serve` /completion (degenerate "0"×128 output): A baseline (--spec-type none): 25.26 t/s B MTP (--spec-type draft-mtp): 103.72 t/s ← 4.11× CEILING draft acceptance: 1.00 (118/118) — trivially predictable, not deployment MTP A/B via /v1/chat/completions (non-degenerate, 256 tokens): A baseline: 25.18 t/s B MTP: 76.06 t/s ← 3.02× REPRESENTATIVE greedy speedup draft acceptance: 0.7627 (225/295) bit-identical content sha A vs B (greedy lossless property) All three regimes labeled in the JSON so 4.11× isn't quoted as the deployment number — the representative ~3× greedy or the memory-recorded titan 1.66× (default sampling) are the honest reads. * scripts(pascal): v2 server/MTP docs — §6/§7 scope flip + Gemma4 MTP numbers Follow-on to 3662be4 (v2 build flags). Lands the doc side of the LLAMA_BUILD_SERVER=ON v2 build into the two human-facing companions. omarchy autoinstall guide §6: - v1/v2 tarball table: v2 = pascal-cuda-artifacts-v2-server.tar.zst, sha 2528d952..., 515.5 MB, 121 members, server+router scope. v1 (0efed65..., untouched) stays valid for non-serving bakes; v2 is the additive serving-capable successor, not a recall. - serving footgun: --spec-type defaults to `none` (-md silently ignored); engagement proof is server stderr, not /props. - Gemma4 MTP results, three clearly-labeled regimes (lossless A/B): 4.11x degenerate ceiling / 3.02x representative greedy (headline) / 1.66x sampling deployment ref. build-pascal §7 packaging: - version the tarball filename; never overwrite a live pull source. - v1/v2 size+sha table. - reconcile the stale "router not in this tarball" section to v2 reality: member-delta (+11), single-.so impls, lib64 prune, extraction-validate. - note that bin/llama-server / bin/llama-cli are separate targets, not in llama-app's dep closure (reproducing v2 needs them in --target). Also folds in a one-line build-target fix (line 80: add llama-server + llama-cli to --target) that landed in the shared tree from the fork-manager session concurrently — verified correct, kept so the recipe reproduces v2. * scripts(pascal): #100 bullets 1-3+6 bench evidence — server, router, gpu-only, vision+MTP Closes 4 of 10 issue #100 bullets on Pascal P5200 (v2 server-capable build): - bullet 1 (llama-server): standalone /opt/ht-llama-cuda/bin/llama-server → ready in 4s, /health 200, /completion 40 tok @ 25.84 t/s - bullet 2 (llama-server router): unified `bin/llama serve` shim → ready in 4s, /health 200, /completion 40 tok @ 25.82 t/s - bullet 3 (gpu only works): both runs above use -ngl 99 -fa on - bullet 6 (gemma4 12b qat mtp all modalities): combined mmproj + draft-mtp + spec engine → A. coexistence: /v1/chat with image_url + --spec-type draft-mtp engaged → predicted=96, stderr draft acceptance = 0.66102 (78/118) → B. grounding (decoupled to mtmd-cli, avoids Gemma4 chat-template empty-content quirk): all 3 ground-truth features matched (PASCAL, P5200, red rectangle); requires --jinja (otherwise std::runtime_error custom-template-not-supported abort). Methodology: - regression band ±3% pinned vs committed baseline 25.54 tg / 76.05 MTP; both server-router runs in band (24.77-26.31 t/s window). - engagement read on stderr (draft acceptance / draft-mtp stats), NOT /props (per --spec-type footgun memory). - chat-content quirk explicitly noted in JSON so empty content does not read as fail or regression. Bullets 4 (gpu+cpu offload) + 7-10 (qwen 27B/35B-MoE / gemma 26B/31B) land in subsequent commits once the lithium IQ3-class + titan 31B IQ4_XS transfers complete on crystal. * scripts(pascal): #100 regression rerun + nit fold-ins (slug form, cross-harness note) Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on the v2 server-capable build to lock the no-regression gate against the committed baseline (25.54 t/s) and MTP reference (76.05 t/s). - Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54), in band. - MTP /v1/chat greedy: mean 75.07 t/s across 3 reps (-1.29% vs 76.05), in band. - Draft acceptance: 0.76271 — bit-identical to committed regime-2 (225/295 accepted/generated). Strong determinism proof. Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per crystal-assist's review): - Memory slug citations switched from hyphen-form to underscore-form to match the actual slug names (feedback_spec_type_footgun, reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling. - bench-pascal-server-router-smoke.json: added cross_harness_note clarifying the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS the no-regression claim (different harnesses, in band). * scripts(pascal): #100 bullets 8, 4, 10 bench evidence + regression nit fold-ins Three model bench JSONs from the v2 server-capable Pascal build: bench-pascal-qwen3.6-27b-iq3-xxs.json (bullet 8): Qwen3.6-27B at UD-IQ3_XXS (11.17 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu, 65/0/65 layers, gpu_residency_pct=95.45%. /completion 11.27 t/s, /v1/chat 10.44 t/s, gpu free 4 GiB after load. Content reply: "The capital of France is Paris." (qwen3.6 thinking mode active). bench-pascal-gemma4-31b-iq4-xs-offload-{ngl40,ngl48}.json (bullets 4 + 10): ngl=40 phase-1 → ngl=48 phase-2-verify accelerator (crystal-assist's recipe): per_layer_combined = (gpu.model + gpu.context) / layers_gpu at ngl=40 = 309 MiB; ngl_max = 40 + floor((2954 - 400) / 309) = 48. Phase-2 verify PASS at -ngl=48: 49/62 layers GPU, 13/62 layers CPU, 4.95 t/s /completion, 4.98 t/s /v1/chat. Dense-layer partial-offload, host_model=3967 MiB, host_context=768 MiB. Card 96% utilized. gemma-4-31B-IQ4_XS is the smallest 31B quant available anywhere on titan or lithium (sweep done by crystal-assist) — confirms 31B = the documented offload-demo model, closes bullets 4 AND 10 in one bench. Regression rerun JSON: minor wording fix — the bit-identical content sha 01ba4719c80b6fe9 is sha256(b"\n") (single newline), not empty string or null. Banks the harness blind-spot that hashing `jq -r .content` output cannot distinguish JSON-null vs "" vs literal "null" vs "\n". A==B determinism conclusion stands (per crystal-assist review). * scripts(pascal): #100 bullets 7 + 9 bench evidence — qwen 35B MoE + gemma4 26B MoE Closes the last two model bullets: bench-pascal-qwen3.6-35b-a3b-iq3-xxs.json (bullet 7): Qwen3.6-35B-A3B (MoE, 3B active) at UD-IQ3_XXS (12.30 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu, 41/0/41 layers, gpu_residency_pct=96.85%. /completion 44.45 t/s, /v1/chat 40.96 t/s — fastest of any tested model (3B active keeps per-token compute light). Content reply: "The capital of France is Paris." VRAM 13003/16384 MiB after load (3 GiB headroom). bench-pascal-gemma4-26b-a4b-iq4-xs.json (bullet 9): Gemma4-26B-A4B (MoE, 128 experts / 8 active per token) at UD-IQ4_XS (12.66 GiB), -ngl 99 -fa on -c 4096 → 31/0/31 layers on GPU, /completion 42.10 t/s, /v1/chat 42.36 t/s. Content reply: "### Answer: The capital of France is Paris." VRAM 14345/16384 MiB after load. Classifier note (banked in JSON): the 26B host_model=748 MiB tripped the harness's 600-MiB expert-MoE threshold. Inspection of the gemma4 config (vocab=262144, hidden=5120, IQ4_XS bytes/weight) confirms 748 is the embedding tensor + boundary buffers (≈671 MiB pure embedding), NOT expert offload — all 128 experts are in gpu.model_mib=12952. PRIMARY layer-count signal (31/0/31) correctly reads full-GPU. The 600 MiB threshold was calibrated to 12B embeddings (~540 MiB) and under-scales for larger vocab×hidden_dim products. Mode patched to full-gpu with classifier_note explaining the misfire + suggested remediation (host_model_pct_of_total < 10-15% = embedding-pattern; ≥ that = real expert offload). All 10 issue #100 bullets now have committed bench evidence. --------- Co-authored-by: marksverdhei <marksverd@gmail.com> Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--models-dirby reading GGUF metadata (general.type = "adapter") to distinguish adapters from models--lora-init-without-applywith architecture-based matching/v1/modelsAPI responselorafieldBackend changes (Closes #14)
common/preset.h— newcommon_lora_adapter_infoandcommon_models_dir_resultstructscommon/preset.cpp—gguf_is_lora_adapter()helper reads GGUF metadata (no tensor loading);load_from_models_dir_with_lora()classifies files as models vs adapterstools/server/server-models.cpp— discovery at startup, architecture matching at spawn time, adapter list in/v1/modelsresponsetools/server/server-models.h—discovered_adaptersstorage and accessorFrontend changes (Closes #13)
tools/server/webui/src/lib/services/lora.service.ts— GET/POST/lora-adaptersAPI callstools/server/webui/src/lib/stores/lora.svelte.ts— Svelte 5 reactive store with toggle (remembers previous scale), scale adjustment, change trackingtools/server/webui/src/lib/components/app/lora/LoraAdapters.svelte— collapsible panel with switch toggles and scale sliders per adaptertools/server/webui/src/lib/stores/chat.svelte.ts— includeslorafield in completion requests when adapters are activeDesign decisions
general.architecturefield (e.g. "llama"), validated at runtime by tensor shape comparison--lora-init-without-apply(scale=0 until toggled)Test plan
--lora <adapter.gguf>and verify adapter panel appears--models-dirnpm run buildin webui)🤖 Generated with Claude Code