feat: steering hints — mid-inference context injection by marksverdhei · Pull Request #18 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-03-12T10:42:57Z

Summary

Implements steering hints — mid-inference context injection via KV cache manipulation. This allows injecting user text into active generation at a specific context position, creating overlapping activations that steer model output without interrupting reasoning. "Telepathy for LLMs."

Core mechanism

User sends hint text while model is actively generating
Hint text is wrapped with chat template tags (full open+close)
KV cache positions are shifted via seq_add to create a gap
Hint tokens are decoded into the gap via llama_decode()
K-shift re-rotates RoPE — model sees hint as native context
Generation resumes with hint influencing all future tokens

Key insight: llama.cpp's causal attention mask is purely position-based — it doesn't care when a KV entry was written, only what position it claims.

API

Library functions (include/llama.h):

llama_steering_hint_inject() — inject pre-tokenized hint tokens into KV cache at a given position
llama_steering_hint_prepare() — wrap text with chat template and tokenize for use with inject

Server endpoints:

POST /steering/inject and POST /v1/steering/inject
Body: { "id_slot": 0, "text": "...", "role": "system", "position": -1 }
Returns: { "success": true, "n_injected": N }

Safety checks

Context size overflow check before injection
Slot must be in generating state
Error recovery: if decode fails, position shift is rolled back
Input validation for required fields

Files changed

File	Change
`docs/steering-hints.md`	Comprehensive design document with literature review
`include/llama.h`	Public API declarations
`src/llama-steering.h`	Internal header
`src/llama-steering.cpp`	Core implementation
`src/CMakeLists.txt`	Build integration
`tools/server/server-task.h`	Task type and result struct
`tools/server/server-task.cpp`	Result serialization
`tools/server/server-context.h`	Route handler declaration
`tools/server/server-context.cpp`	Task handler + HTTP route
`tools/server/server.cpp`	Endpoint registration
`tools/server/tests/unit/test_steering.py`	Automated tests

Known limitations

Injection timing is critical — on fast hardware, the slot may finish generating before the inject request arrives
Very small models may produce repetitive output after injection (observed with 3B model)
No queuing mechanism for inject requests that arrive after generation completes
First commit message says "Closes Steering hints: mid-inference context injection via overlapping activations #17" but issue should remain open for follow-up work

Relates to #17

Test plan

Build succeeds (library + server)
Manual test: inject during streaming generation (CPU)
Manual test: inject during streaming generation (GPU/RTX 3090)
Error cases: invalid slot, idle slot, missing fields
Context size overflow protection
Decode failure rollback
Automated pytest suite (test_steering.py)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text and mmproj GGUF conversion. Handle config structure difference where the Thinker-only variant has vision/audio configs at the top level. Add pooling type detection for embedding use cases. Fix audio tensor routing to base MmprojModel class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

)

) * docs: add ht-fork documentation, branding, and discussion links Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * convert: support LoRA conversion for MLA kv_b_proj Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add fork sync automation * feat: add --remap-developer-role flag to translate developer→system Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: support LCO-Embedding-Omni (Qwen2.5 Omni Thinker) GGUF conversion Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text and mmproj GGUF conversion. Handle config structure difference where the Thinker-only variant has vision/audio configs at the top level. Add pooling type detection for embedding use cases. Fix audio tensor routing to base MmprojModel class. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: add ht branch to flake8 lint workflow triggers * feat: welcome agentic contributions, remove upstream AI restrictions - Delete AGENTS.md (upstream's anti-AI contributor guidelines) - Replace restrictive AI Usage Policy with welcoming Agentic Contributions section - Update README to highlight fork's pragmatic stance on AI contributions Unlike upstream, we evaluate code by quality, not by how it was written. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* webui: add cancel button for in-progress model loading Allow users to cancel a model that is stuck loading or taking too long in the router mode model selector. The cancel button appears next to the loading spinner in both the model selector dropdown/sheet trigger and within individual model option rows. Uses the existing /models/unload endpoint which already supports unloading models in LOADING state. The frontend polling loop is interrupted via AbortController to prevent stale error toasts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * webui: add cancelling state indicator and fix cancel polling - Show orange "Cancelling" indicator with spinner while cancel is in progress - Poll until server confirms model is no longer in LOADING state before clearing the cancelling indicator - Guard against redundant unload calls on already-unloaded models - Keep loadingModelId alive during cancel so selector trigger shows the cancelling state correctly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(webui): color-coded spinners for model load/unload/cancel states - Loading: green spinner, clockwise - Unloading: red spinner, reverse direction with "Unloading" label - Cancelling: orange spinner, reverse direction - Track unloading state separately in models store Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(webui): address PR review feedback for cancel model loading - Remove duplicated cancel logic from ModelsSelector and ModelsSelectorSheet by deriving loading/cancelling state from the store (issue #1) - Fix race condition: no longer set isLoadingModel=false before cancel completes, preventing brief UI flash (issue #2) - Add MAX_CANCEL_POLL_ATTEMPTS (60) timeout to cancel polling loop to prevent infinite polling if server never transitions (issue #3) - Replace div cancel buttons with proper <button> elements for keyboard accessibility and screen reader support (issue #4) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

) - Rename all frontend references from "llama.cpp" to "ht-llama.cpp" - Dark mode: turquoise-tinted backgrounds, purple-tinted text - Light mode: inverted — turquoise backgrounds, purple text - Add reverse spin animation utility class Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Auto-discover LoRA adapters from the models directory by scanning GGUF metadata (general.type = "adapter") and match them to models by architecture. Adapters are loaded with --lora-init-without-apply so they start disabled and can be toggled on via the UI. Frontend adds a Popover-based LoRA dropdown in the chat action bar (next to model selector) with multi-select checkboxes and scale inputs. Includes "Show only matching" toggle to view all discovered adapters. Works in both MODEL and ROUTER mode. Backend changes: - Add GGUF metadata scanning for adapter classification (preset.cpp) - Auto-inject matching LoRA adapters into child process args (server-models.cpp) - Include discovered adapters in /v1/models response - Fix router proxy for /lora-adapters POST (array body fallback to query param) Frontend changes: - New LoraAdapters popover component with checkbox multi-select - LoRA service with router mode support (query param routing) - Reactive store with toggle, scale, change tracking, apply - Integration in ChatFormActions bar and chat completion requests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Comprehensive research and design for mid-inference context injection via KV cache manipulation. Covers literature review (CAA, SVF, FASB, RepE, EasySteer, AI Steerability 360), llama.cpp architecture analysis (KV cache, batch positions, K-shift, chat templates), upstream landscape (control vectors, seq operations, server architecture), identified gaps, and phased implementation plan. Closes #17 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Core library (src/llama-steering.h, src/llama-steering.cpp): - llama_steering_hint_inject(): shifts KV cache positions via seq_add, builds a batch at the gap positions, and decodes hint tokens - llama_steering_hint_prepare(): wraps text with chat template and tokenizes with special token parsing for use with inject Public API (include/llama.h): - Added steering hints section between adapter cvec and memory APIs Server integration (tools/server/): - New task type SERVER_TASK_TYPE_STEERING_INJECT - POST /v1/steering/inject and /steering/inject endpoints - Accepts: id_slot, text, role (default: system), position (default: -1) - Updates slot token tracking after injection - Returns: {"success": true, "n_injected": N} Tested with Qwen2.5-0.5B-Instruct: - Chat completions streaming + mid-generation steering hint injection - 18 tokens injected successfully during active generation - Model behavior changed after injection (counting disrupted) - Responses API works, steering requires slot to be in generating state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add context size overflow check before injection - Add decode failure rollback (undo position shift on error) - Remove unused <cstring> include - Add automated pytest suite (6 tests covering error cases and injection during active generation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

marksverdhei · 2026-04-27T13:59:08Z

Closing as stale — branch is 645 commits behind ht and the steering-hints feature hasn't been a current priority. Reopen against a fresh branch off ht if you want to revive this work; the design doc commit on the branch is the canonical reference for the approach.

marksverdhei and others added 15 commits March 11, 2026 09:31

docs: add ht-fork documentation, branding, and discussion links

3372a61

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

convert: support LoRA conversion for MLA kv_b_proj

8e1dab3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ci: add fork sync automation

83965cf

feat: add --remap-developer-role flag to translate developer→system

a0d6cdd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ci: add ht branch to flake8 lint workflow triggers

5c8f037

feat: welcome agentic contributions, remove upstream AI restrictions (#6

870fb9e

)

merge: sync with upstream master (Qwen3.5 GATED_DELTA_NET op + fixes)

b1a190e

marksverdhei marked this pull request as ready for review March 12, 2026 14:01

marksverdhei force-pushed the ht branch 5 times, most recently from 5772904 to 22e54b4 Compare March 31, 2026 11:50

marksverdhei force-pushed the ht branch 2 times, most recently from cb142b0 to f40775c Compare April 7, 2026 12:35

marksverdhei force-pushed the ht branch from 6846da3 to 139f68e Compare April 12, 2026 09:32

marksverdhei closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: steering hints — mid-inference context injection#18

feat: steering hints — mid-inference context injection#18
marksverdhei wants to merge 15 commits into
htfrom
feature/steering-hints

marksverdhei commented Mar 12, 2026 •

edited

Loading

Uh oh!

marksverdhei commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Core mechanism

API

Safety checks

Files changed

Known limitations

Test plan

Uh oh!

marksverdhei commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marksverdhei commented Mar 12, 2026 •

edited

Loading