feat: steering hints — mid-inference context injection#18
Closed
marksverdhei wants to merge 15 commits into
Closed
feat: steering hints — mid-inference context injection#18marksverdhei wants to merge 15 commits into
marksverdhei wants to merge 15 commits into
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text and mmproj GGUF conversion. Handle config structure difference where the Thinker-only variant has vision/audio configs at the top level. Add pooling type detection for embedding use cases. Fix audio tensor routing to base MmprojModel class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
) * docs: add ht-fork documentation, branding, and discussion links Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * convert: support LoRA conversion for MLA kv_b_proj Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add fork sync automation * feat: add --remap-developer-role flag to translate developer→system Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: support LCO-Embedding-Omni (Qwen2.5 Omni Thinker) GGUF conversion Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text and mmproj GGUF conversion. Handle config structure difference where the Thinker-only variant has vision/audio configs at the top level. Add pooling type detection for embedding use cases. Fix audio tensor routing to base MmprojModel class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add ht branch to flake8 lint workflow triggers * feat: welcome agentic contributions, remove upstream AI restrictions - Delete AGENTS.md (upstream's anti-AI contributor guidelines) - Replace restrictive AI Usage Policy with welcoming Agentic Contributions section - Update README to highlight fork's pragmatic stance on AI contributions Unlike upstream, we evaluate code by quality, not by how it was written. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* webui: add cancel button for in-progress model loading Allow users to cancel a model that is stuck loading or taking too long in the router mode model selector. The cancel button appears next to the loading spinner in both the model selector dropdown/sheet trigger and within individual model option rows. Uses the existing /models/unload endpoint which already supports unloading models in LOADING state. The frontend polling loop is interrupted via AbortController to prevent stale error toasts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * webui: add cancelling state indicator and fix cancel polling - Show orange "Cancelling" indicator with spinner while cancel is in progress - Poll until server confirms model is no longer in LOADING state before clearing the cancelling indicator - Guard against redundant unload calls on already-unloaded models - Keep loadingModelId alive during cancel so selector trigger shows the cancelling state correctly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(webui): color-coded spinners for model load/unload/cancel states - Loading: green spinner, clockwise - Unloading: red spinner, reverse direction with "Unloading" label - Cancelling: orange spinner, reverse direction - Track unloading state separately in models store Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(webui): address PR review feedback for cancel model loading - Remove duplicated cancel logic from ModelsSelector and ModelsSelectorSheet by deriving loading/cancelling state from the store (issue #1) - Fix race condition: no longer set isLoadingModel=false before cancel completes, preventing brief UI flash (issue #2) - Add MAX_CANCEL_POLL_ATTEMPTS (60) timeout to cancel polling loop to prevent infinite polling if server never transitions (issue #3) - Replace div cancel buttons with proper <button> elements for keyboard accessibility and screen reader support (issue #4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Auto-discover LoRA adapters from the models directory by scanning GGUF metadata (general.type = "adapter") and match them to models by architecture. Adapters are loaded with --lora-init-without-apply so they start disabled and can be toggled on via the UI. Frontend adds a Popover-based LoRA dropdown in the chat action bar (next to model selector) with multi-select checkboxes and scale inputs. Includes "Show only matching" toggle to view all discovered adapters. Works in both MODEL and ROUTER mode. Backend changes: - Add GGUF metadata scanning for adapter classification (preset.cpp) - Auto-inject matching LoRA adapters into child process args (server-models.cpp) - Include discovered adapters in /v1/models response - Fix router proxy for /lora-adapters POST (array body fallback to query param) Frontend changes: - New LoraAdapters popover component with checkbox multi-select - LoRA service with router mode support (query param routing) - Reactive store with toggle, scale, change tracking, apply - Integration in ChatFormActions bar and chat completion requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive research and design for mid-inference context injection via KV cache manipulation. Covers literature review (CAA, SVF, FASB, RepE, EasySteer, AI Steerability 360), llama.cpp architecture analysis (KV cache, batch positions, K-shift, chat templates), upstream landscape (control vectors, seq operations, server architecture), identified gaps, and phased implementation plan. Closes #17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core library (src/llama-steering.h, src/llama-steering.cpp):
- llama_steering_hint_inject(): shifts KV cache positions via seq_add,
builds a batch at the gap positions, and decodes hint tokens
- llama_steering_hint_prepare(): wraps text with chat template and
tokenizes with special token parsing for use with inject
Public API (include/llama.h):
- Added steering hints section between adapter cvec and memory APIs
Server integration (tools/server/):
- New task type SERVER_TASK_TYPE_STEERING_INJECT
- POST /v1/steering/inject and /steering/inject endpoints
- Accepts: id_slot, text, role (default: system), position (default: -1)
- Updates slot token tracking after injection
- Returns: {"success": true, "n_injected": N}
Tested with Qwen2.5-0.5B-Instruct:
- Chat completions streaming + mid-generation steering hint injection
- 18 tokens injected successfully during active generation
- Model behavior changed after injection (counting disrupted)
- Responses API works, steering requires slot to be in generating state
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add context size overflow check before injection - Add decode failure rollback (undo position shift on error) - Remove unused <cstring> include - Add automated pytest suite (6 tests covering error cases and injection during active generation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5772904 to
22e54b4
Compare
cb142b0 to
f40775c
Compare
marksverdhei
pushed a commit
that referenced
this pull request
Apr 12, 2026
) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Author
|
Closing as stale — branch is 645 commits behind |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements steering hints — mid-inference context injection via KV cache manipulation. This allows injecting user text into active generation at a specific context position, creating overlapping activations that steer model output without interrupting reasoning. "Telepathy for LLMs."
Core mechanism
seq_addto create a gapllama_decode()Key insight: llama.cpp's causal attention mask is purely position-based — it doesn't care when a KV entry was written, only what position it claims.
API
Library functions (
include/llama.h):llama_steering_hint_inject()— inject pre-tokenized hint tokens into KV cache at a given positionllama_steering_hint_prepare()— wrap text with chat template and tokenize for use with injectServer endpoints:
POST /steering/injectandPOST /v1/steering/inject{ "id_slot": 0, "text": "...", "role": "system", "position": -1 }{ "success": true, "n_injected": N }Safety checks
Files changed
docs/steering-hints.mdinclude/llama.hsrc/llama-steering.hsrc/llama-steering.cppsrc/CMakeLists.txttools/server/server-task.htools/server/server-task.cpptools/server/server-context.htools/server/server-context.cpptools/server/server.cpptools/server/tests/unit/test_steering.pyKnown limitations
Relates to #17
Test plan
test_steering.py)🤖 Generated with Claude Code