Skip to content

feat: LoRA adapter auto-discovery and frontend UI#15

Closed
marksverdhei wants to merge 15 commits into
htfrom
feat/lora-discovery-and-ui
Closed

feat: LoRA adapter auto-discovery and frontend UI#15
marksverdhei wants to merge 15 commits into
htfrom
feat/lora-discovery-and-ui

Conversation

@marksverdhei

Copy link
Copy Markdown

Summary

  • Auto-discover LoRA adapters from --models-dir by reading GGUF metadata (general.type = "adapter") to distinguish adapters from models
  • Inject matching adapters into child model processes using --lora-init-without-apply with architecture-based matching
  • Expose discovered adapters in the /v1/models API response
  • Add frontend LoRA adapter panel with per-adapter toggle, scale slider (0.0-2.0), and apply button
  • Pass LoRA config in chat requests via the per-request lora field

Backend changes (Closes #14)

  • common/preset.h — new common_lora_adapter_info and common_models_dir_result structs
  • common/preset.cppgguf_is_lora_adapter() helper reads GGUF metadata (no tensor loading); load_from_models_dir_with_lora() classifies files as models vs adapters
  • tools/server/server-models.cpp — discovery at startup, architecture matching at spawn time, adapter list in /v1/models response
  • tools/server/server-models.hdiscovered_adapters storage and accessor

Frontend changes (Closes #13)

  • tools/server/webui/src/lib/services/lora.service.ts — GET/POST /lora-adapters API calls
  • tools/server/webui/src/lib/stores/lora.svelte.ts — Svelte 5 reactive store with toggle (remembers previous scale), scale adjustment, change tracking
  • tools/server/webui/src/lib/components/app/lora/LoraAdapters.svelte — collapsible panel with switch toggles and scale sliders per adapter
  • tools/server/webui/src/lib/stores/chat.svelte.ts — includes lora field in completion requests when adapters are active

Design decisions

  • Zero disruption: LoRA UI is invisible when no adapters are loaded
  • Architecture matching: adapters are matched to models by general.architecture field (e.g. "llama"), validated at runtime by tensor shape comparison
  • Multiple adapters: supports arbitrary number of simultaneous adapters with independent scales
  • Lazy loading: adapters loaded with --lora-init-without-apply (scale=0 until toggled)

Test plan

  • Verify no UI changes when server has no LoRA adapters
  • Start server with --lora <adapter.gguf> and verify adapter panel appears
  • Test toggle and scale slider functionality
  • Test per-request LoRA in chat completions
  • Test auto-discovery with adapters in --models-dir
  • Verify build passes (npm run build in webui)

🤖 Generated with Claude Code

marksverdhei and others added 15 commits March 11, 2026 09:31
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text
and mmproj GGUF conversion. Handle config structure difference where the
Thinker-only variant has vision/audio configs at the top level. Add pooling
type detection for embedding use cases. Fix audio tensor routing to base
MmprojModel class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
)

* docs: add ht-fork documentation, branding, and discussion links

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* convert: support LoRA conversion for MLA kv_b_proj

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add fork sync automation

* feat: add --remap-developer-role flag to translate developer→system

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: support LCO-Embedding-Omni (Qwen2.5 Omni Thinker) GGUF conversion

Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text
and mmproj GGUF conversion. Handle config structure difference where the
Thinker-only variant has vision/audio configs at the top level. Add pooling
type detection for embedding use cases. Fix audio tensor routing to base
MmprojModel class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add ht branch to flake8 lint workflow triggers

* feat: welcome agentic contributions, remove upstream AI restrictions

- Delete AGENTS.md (upstream's anti-AI contributor guidelines)
- Replace restrictive AI Usage Policy with welcoming Agentic Contributions section
- Update README to highlight fork's pragmatic stance on AI contributions

Unlike upstream, we evaluate code by quality, not by how it was written.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* webui: add cancel button for in-progress model loading

Allow users to cancel a model that is stuck loading or taking too long
in the router mode model selector. The cancel button appears next to
the loading spinner in both the model selector dropdown/sheet trigger
and within individual model option rows.

Uses the existing /models/unload endpoint which already supports
unloading models in LOADING state. The frontend polling loop is
interrupted via AbortController to prevent stale error toasts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* webui: add cancelling state indicator and fix cancel polling

- Show orange "Cancelling" indicator with spinner while cancel is in progress
- Poll until server confirms model is no longer in LOADING state before
  clearing the cancelling indicator
- Guard against redundant unload calls on already-unloaded models
- Keep loadingModelId alive during cancel so selector trigger shows
  the cancelling state correctly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(webui): color-coded spinners for model load/unload/cancel states

- Loading: green spinner, clockwise
- Unloading: red spinner, reverse direction with "Unloading" label
- Cancelling: orange spinner, reverse direction
- Track unloading state separately in models store

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(webui): address PR review feedback for cancel model loading

- Remove duplicated cancel logic from ModelsSelector and ModelsSelectorSheet
  by deriving loading/cancelling state from the store (issue #1)
- Fix race condition: no longer set isLoadingModel=false before cancel
  completes, preventing brief UI flash (issue #2)
- Add MAX_CANCEL_POLL_ATTEMPTS (60) timeout to cancel polling loop
  to prevent infinite polling if server never transitions (issue #3)
- Replace div cancel buttons with proper <button> elements for
  keyboard accessibility and screen reader support (issue #4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
)

- Rename all frontend references from "llama.cpp" to "ht-llama.cpp"
- Dark mode: turquoise-tinted backgrounds, purple-tinted text
- Light mode: inverted — turquoise backgrounds, purple text
- Add reverse spin animation utility class

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
When using --models-dir (router mode), the server now reads GGUF metadata
from .gguf files to distinguish LoRA adapters from models. Adapters with
general.type="adapter" are collected separately and automatically passed
to child model instances via --lora and --lora-init-without-apply when
their general.architecture matches the model being loaded.

Changes:
- Add common_lora_adapter_info struct and common_models_dir_result to preset.h
- Add load_from_models_dir_with_lora() that uses gguf_init_from_file() to
  read metadata and classify files as models or adapters
- Inject matching LoRA adapters into child process args at spawn time
- Expose discovered adapters in GET /v1/models response as lora_adapters array
- Add discovered_adapters storage and get_discovered_adapters() to server_models

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add frontend UI for discovering and managing LoRA adapters loaded on the
server. The UI only appears when adapters are available (zero disruption
otherwise) and integrates with the existing chat completion flow.

New files:
- lora.service.ts: API service for GET/POST /lora-adapters
- lora.svelte.ts: Reactive store with adapter state, toggle, scale
- LoraAdapters.svelte: Collapsible panel with per-adapter switch + slider

Integration:
- LoRA panel renders above ChatFormActions inside the chat input area
- Active adapter scales are included in chat completion requests via the
  lora field in getApiOptions()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix proxy_post to accept model name from query params (needed for
  endpoints like POST /lora-adapters where body is an array)
- Pass selected model ID to LoRA service/store in router mode
- Re-fetch adapters when model changes via $effect
- Add /lora-adapters to vite dev proxy config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@marksverdhei marksverdhei force-pushed the ht branch 5 times, most recently from 5772904 to 22e54b4 Compare March 31, 2026 11:50
@marksverdhei marksverdhei force-pushed the ht branch 3 times, most recently from 6846da3 to 139f68e Compare April 12, 2026 09:32
@marksverdhei

Copy link
Copy Markdown
Author

Closing as superseded — b824b323a feat(webui): LoRA adapter auto-discovery and UI is already on ht via direct merge. The branch is 653 commits behind and rebasing it would just reproduce work that's already shipped.

marksverdhei pushed a commit that referenced this pull request Jun 9, 2026
…ss-harness note)

Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on
the v2 server-capable build to lock the no-regression gate against the
committed baseline (25.54 t/s) and MTP reference (76.05 t/s).
- Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54),
  in band.
- MTP /v1/chat greedy:  mean 75.07 t/s across 3 reps (-1.29% vs 76.05),
  in band.
- Draft acceptance: 0.76271 — bit-identical to committed regime-2
  (225/295 accepted/generated). Strong determinism proof.

Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per
crystal-assist's review):
- Memory slug citations switched from hyphen-form to underscore-form to
  match the actual slug names (feedback_spec_type_footgun,
  reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling.
- bench-pascal-server-router-smoke.json: added cross_harness_note clarifying
  the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS
  the no-regression claim (different harnesses, in band).
marksverdhei added a commit that referenced this pull request Jun 9, 2026
* scripts(pascal): P5200 build notes + bench harness + Vulkan baseline

Working notes for getting ht-llama.cpp running on the Quadro P5200
(Pascal sm_61, 16 GB). Toolkit wall: CUDA 13 dropped sm_61, so CUDA
backend requires aur/cuda-pascal 12.9.1 + gcc14. Driver 580 still runs
sm_61 binaries fine.

Vulkan baseline (Llama-3.1-8B Q4_K_M, ngl=99, fa=0, build f6feddb):
  pp128  269 t/s  pp512  278 t/s  pp2048 251 t/s
  tg32    35 t/s  tg128   35 t/s

CUDA results pending cuda-pascal install (gcc14 source build dominates).

Untracked primer (quadro-p5200-llamacpp-primer.md) referenced as the
source for the FP16-1/64-FP32, INT8 DP4A, and ggml-org#7188 FA-fix facts.

* scripts(pascal): CUDA backend bench results + complete install recipe

CUDA 12.9 toolkit built and benched on the Quadro P5200 (Pascal sm_61).
Five obstacles climbed on stock Arch:
1) CUDA 13 dropped Pascal       → installed 12.9 from runfile --extract
2) Runfile libxml2.so.2 missing → bypassed installer with --extract
3) gcc-15/16 too new for nvcc   → gcc-14 from archlinux-archive
4) gcc14 AUR source-build slow  → 51MB binary pkg.tar.zst (30s install)
5) glibc 2.43 cospi/sinpi clash → noexcept(true) patch on CUDA math.h+hpp

Full recipe in scripts/build-pascal-p5200.md.

Bench summary (Llama-2-7B Q4_0, ngl=99, build 5159fee, P5200 sm_61):
  CUDA fa=1: pp512=795, tg128=45.8 t/s
  Vulkan:    pp512=418, tg128=43.0 t/s
  CUDA wins pp ~1.9x, tg within 7% (bandwidth-bound).

ggml_cuda_init confirms: compute capability 6.1, VMM yes, GGML_CUDA_FORCE_MMQ
baked in (visible in nvcc cmdline). CC6.1 + MMQ-only + no cuBLAS fallback =
the INT8 dp4a path is what is running.

JSON artifacts committed alongside for replay/comparison.

* scripts(pascal): packaging recipe — rpath-clean runtime tarball for Omarchy ISO

Adds §7 Packaging covering the cmake-install + patchelf + symlink-chain +
stage-and-tar pipeline that produces pascal-cuda-artifacts.tar.zst (the
runtime fast-path consumed by hai-os-dev's autoinstall). Also drops the
stale "TODO — fill in once build-cuda completes" placeholder and moves
Sources to the true end of the doc.

Recipe reproduces the verified-clean tarball: rpath stripped on all
installed targets, libllama/libllama-common copied + patchelfed, symlinks
recreated, members rooted at opt/ for `tar -C / -xf` extraction, ld.so.conf.d
snippet documented so no LD_LIBRARY_PATH is needed at runtime.

* scripts(pascal): correct §7 tarball size + add reference sha256

Was: prose-estimate "~810 MB before zstd, ~470 MB after" — actual is
~816 MB unpacked, 512 MB compressed (110 members). Adds the reference
sha256 from the verified crystal build for hai-os-dev to byte-check
against. Notes zstd non-determinism so re-runs are expected to differ.

* scripts(pascal): field primer + Omarchy autoinstall handoff guide

Round out the Pascal/P5200 enablement bundle (PR #99) with the two
human-facing companions to scripts/build-pascal-p5200.md:

- quadro-p5200-llamacpp-primer.md: Pascal/GP104 + llama.cpp field guide
  (the two facts that drive every decision, CUDA vs Vulkan, measured
  1080-parity numbers, 16 GB VRAM sizing, optimization checklist).
- quadro-p5200-omarchy-autoinstall.md: 7-question handoff guide for
  hai-os-dev — extra packages (no AUR), CUDA-12.9 runfile pin, build
  flags, the five obstacles + fixes (glibc 2.43 noexcept patch incl.),
  pre-build at image time, HaiOS integration points, verified
  512 MB / sha256 0efed65... reference tarball, measured baseline.

Both docs reference the canonical recipe at scripts/build-pascal-p5200.md
and the verified tarball cached at crystal:/home/me/pascal-cuda-artifacts.tar.zst.

* scripts(pascal): v2 build flags (server+router) + Gemma4 MTP bench JSONs

Recipe update: add -DLLAMA_BUILD_SERVER=ON + -DLLAMA_BUILD_TESTS=OFF to
the CUDA configure step. Required for the llama-app unified router
(bin/llama) to link — without server-on, libllama-server-impl.so is not
built and llama-app link fails with `cannot find -lllama-server-impl`.
Also required for Gemma4 MTP: ctx_other wiring for the Gemma4Assistant
draft class lives only in tools/server/server-context.cpp; the
standalone llama-speculative-simple binary segfaults with
"Gemma4Assistant requires ctx_other to be set".

Rationale block also captures the spec-decode footgun: --spec-type
defaults to `none`, so -md <draft> alone is silently ignored. Must pass
--spec-type draft-mtp to engage. The /props
default_generation_settings.params["speculative.types"] field is
per-REQUEST sampler default, NOT the server engine state — the
canonical engagement read is server stderr (draft acceptance line +
statistics draft-mtp: ... summary).

Bench JSONs (crystal Pascal P5200, Gemma4 12B QAT Q4_K_XL, sm_61 CUDA
FORCE_MMQ, -fa on, -ngl 99, ctx 4096, greedy temp=0/top_k=1):

  baseline (no MTP, llama-bench):
    pp128=465.71 t/s, pp512=456.37 t/s
    tg32=25.54 t/s,  tg128=25.54 t/s  (flat — bandwidth-bound)

  MTP A/B via `llama serve` /completion (degenerate "0"×128 output):
    A baseline (--spec-type none):     25.26 t/s
    B MTP (--spec-type draft-mtp):    103.72 t/s   ← 4.11× CEILING
    draft acceptance: 1.00 (118/118)  — trivially predictable, not deployment

  MTP A/B via /v1/chat/completions (non-degenerate, 256 tokens):
    A baseline: 25.18 t/s
    B MTP:      76.06 t/s   ← 3.02× REPRESENTATIVE greedy speedup
    draft acceptance: 0.7627 (225/295)
    bit-identical content sha A vs B (greedy lossless property)

All three regimes labeled in the JSON so 4.11× isn't quoted as the
deployment number — the representative ~3× greedy or the
memory-recorded titan 1.66× (default sampling) are the honest reads.

* scripts(pascal): v2 server/MTP docs — §6/§7 scope flip + Gemma4 MTP numbers

Follow-on to 3662be4 (v2 build flags). Lands the doc side of the
LLAMA_BUILD_SERVER=ON v2 build into the two human-facing companions.

omarchy autoinstall guide §6:
- v1/v2 tarball table: v2 = pascal-cuda-artifacts-v2-server.tar.zst,
  sha 2528d952..., 515.5 MB, 121 members, server+router scope. v1
  (0efed65..., untouched) stays valid for non-serving bakes; v2 is the
  additive serving-capable successor, not a recall.
- serving footgun: --spec-type defaults to `none` (-md silently ignored);
  engagement proof is server stderr, not /props.
- Gemma4 MTP results, three clearly-labeled regimes (lossless A/B):
  4.11x degenerate ceiling / 3.02x representative greedy (headline) /
  1.66x sampling deployment ref.

build-pascal §7 packaging:
- version the tarball filename; never overwrite a live pull source.
- v1/v2 size+sha table.
- reconcile the stale "router not in this tarball" section to v2 reality:
  member-delta (+11), single-.so impls, lib64 prune, extraction-validate.
- note that bin/llama-server / bin/llama-cli are separate targets, not in
  llama-app's dep closure (reproducing v2 needs them in --target).

Also folds in a one-line build-target fix (line 80: add llama-server +
llama-cli to --target) that landed in the shared tree from the
fork-manager session concurrently — verified correct, kept so the recipe
reproduces v2.

* scripts(pascal): #100 bullets 1-3+6 bench evidence — server, router, gpu-only, vision+MTP

Closes 4 of 10 issue #100 bullets on Pascal P5200 (v2 server-capable build):

- bullet 1 (llama-server): standalone /opt/ht-llama-cuda/bin/llama-server
  → ready in 4s, /health 200, /completion 40 tok @ 25.84 t/s
- bullet 2 (llama-server router): unified `bin/llama serve` shim
  → ready in 4s, /health 200, /completion 40 tok @ 25.82 t/s
- bullet 3 (gpu only works): both runs above use -ngl 99 -fa on
- bullet 6 (gemma4 12b qat mtp all modalities): combined mmproj +
  draft-mtp + spec engine
  → A. coexistence: /v1/chat with image_url + --spec-type draft-mtp
    engaged → predicted=96, stderr draft acceptance = 0.66102 (78/118)
  → B. grounding (decoupled to mtmd-cli, avoids Gemma4 chat-template
    empty-content quirk): all 3 ground-truth features matched (PASCAL,
    P5200, red rectangle); requires --jinja (otherwise std::runtime_error
    custom-template-not-supported abort).

Methodology:
- regression band ±3% pinned vs committed baseline 25.54 tg / 76.05 MTP;
  both server-router runs in band (24.77-26.31 t/s window).
- engagement read on stderr (draft acceptance / draft-mtp stats), NOT
  /props (per --spec-type footgun memory).
- chat-content quirk explicitly noted in JSON so empty content does not
  read as fail or regression.

Bullets 4 (gpu+cpu offload) + 7-10 (qwen 27B/35B-MoE / gemma 26B/31B)
land in subsequent commits once the lithium IQ3-class + titan 31B IQ4_XS
transfers complete on crystal.

* scripts(pascal): #100 regression rerun + nit fold-ins (slug form, cross-harness note)

Regression bench (task #15): re-ran the gemma4-12B-QAT bench from regime-2 on
the v2 server-capable build to lock the no-regression gate against the
committed baseline (25.54 t/s) and MTP reference (76.05 t/s).
- Baseline /v1/chat greedy: mean 24.96 t/s across 3 reps (-2.31% vs 25.54),
  in band.
- MTP /v1/chat greedy:  mean 75.07 t/s across 3 reps (-1.29% vs 76.05),
  in band.
- Draft acceptance: 0.76271 — bit-identical to committed regime-2
  (225/295 accepted/generated). Strong determinism proof.

Fold-in nits on the 2 already-committed JSONs (no dedicated fix commit per
crystal-assist's review):
- Memory slug citations switched from hyphen-form to underscore-form to
  match the actual slug names (feedback_spec_type_footgun,
  reference_mtmd_cli_jinja_required) — resolves to exact-match in tooling.
- bench-pascal-server-router-smoke.json: added cross_harness_note clarifying
  the +1.2% server-endpoint vs llama-bench tg128 agreement STRENGTHENS
  the no-regression claim (different harnesses, in band).

* scripts(pascal): #100 bullets 8, 4, 10 bench evidence + regression nit fold-ins

Three model bench JSONs from the v2 server-capable Pascal build:

bench-pascal-qwen3.6-27b-iq3-xxs.json (bullet 8): Qwen3.6-27B at
UD-IQ3_XXS (11.17 GiB), -ngl 99 -fa on -c 4096 → mode=full-gpu,
65/0/65 layers, gpu_residency_pct=95.45%. /completion 11.27 t/s,
/v1/chat 10.44 t/s, gpu free 4 GiB after load. Content reply:
"The capital of France is Paris." (qwen3.6 thinking mode active).

bench-pascal-gemma4-31b-iq4-xs-offload-{ngl40,ngl48}.json (bullets 4 + 10):
ngl=40 phase-1 → ngl=48 phase-2-verify accelerator (crystal-assist's
recipe): per_layer_combined = (gpu.model + gpu.context) / layers_gpu
at ngl=40 = 309 MiB; ngl_max = 40 + floor((2954 - 400) / 309) = 48.
Phase-2 verify PASS at -ngl=48: 49/62 layers GPU, 13/62 layers CPU,
4.95 t/s /completion, 4.98 t/s /v1/chat. Dense-layer partial-offload,
host_model=3967 MiB, host_context=768 MiB. Card 96% utilized.
gemma-4-31B-IQ4_XS is the smallest 31B quant available anywhere on
titan or lithium (sweep done by crystal-assist) — confirms 31B = the
documented offload-demo model, closes bullets 4 AND 10 in one bench.

Regression rerun JSON: minor wording fix — the bit-identical content
sha 01ba4719c80b6fe9 is sha256(b"\n") (single newline), not empty
string or null. Banks the harness blind-spot that hashing `jq -r .content`
output cannot distinguish JSON-null vs "" vs literal "null" vs "\n".
A==B determinism conclusion stands (per crystal-assist review).

* scripts(pascal): #100 bullets 7 + 9 bench evidence — qwen 35B MoE + gemma4 26B MoE

Closes the last two model bullets:

bench-pascal-qwen3.6-35b-a3b-iq3-xxs.json (bullet 7): Qwen3.6-35B-A3B
(MoE, 3B active) at UD-IQ3_XXS (12.30 GiB), -ngl 99 -fa on -c 4096 →
mode=full-gpu, 41/0/41 layers, gpu_residency_pct=96.85%. /completion
44.45 t/s, /v1/chat 40.96 t/s — fastest of any tested model (3B active
keeps per-token compute light). Content reply: "The capital of France
is Paris." VRAM 13003/16384 MiB after load (3 GiB headroom).

bench-pascal-gemma4-26b-a4b-iq4-xs.json (bullet 9): Gemma4-26B-A4B
(MoE, 128 experts / 8 active per token) at UD-IQ4_XS (12.66 GiB),
-ngl 99 -fa on -c 4096 → 31/0/31 layers on GPU, /completion 42.10 t/s,
/v1/chat 42.36 t/s. Content reply: "### Answer: The capital of France
is Paris." VRAM 14345/16384 MiB after load.

Classifier note (banked in JSON): the 26B host_model=748 MiB tripped
the harness's 600-MiB expert-MoE threshold. Inspection of the gemma4
config (vocab=262144, hidden=5120, IQ4_XS bytes/weight) confirms 748
is the embedding tensor + boundary buffers (≈671 MiB pure embedding),
NOT expert offload — all 128 experts are in gpu.model_mib=12952.
PRIMARY layer-count signal (31/0/31) correctly reads full-GPU. The
600 MiB threshold was calibrated to 12B embeddings (~540 MiB) and
under-scales for larger vocab×hidden_dim products. Mode patched to
full-gpu with classifier_note explaining the misfire + suggested
remediation (host_model_pct_of_total < 10-15% = embedding-pattern;
≥ that = real expert offload).

All 10 issue #100 bullets now have committed bench evidence.

---------

Co-authored-by: marksverdhei <marksverd@gmail.com>
Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: auto-discover LoRA adapters from models directory [Frontend] Add lora-adapter toggler / selection list for selected model

1 participant