Skip to content

feat: steering hints — mid-inference context injection#18

Closed
marksverdhei wants to merge 15 commits into
htfrom
feature/steering-hints
Closed

feat: steering hints — mid-inference context injection#18
marksverdhei wants to merge 15 commits into
htfrom
feature/steering-hints

Conversation

@marksverdhei

@marksverdhei marksverdhei commented Mar 12, 2026

Copy link
Copy Markdown

Summary

Implements steering hints — mid-inference context injection via KV cache manipulation. This allows injecting user text into active generation at a specific context position, creating overlapping activations that steer model output without interrupting reasoning. "Telepathy for LLMs."

Core mechanism

  1. User sends hint text while model is actively generating
  2. Hint text is wrapped with chat template tags (full open+close)
  3. KV cache positions are shifted via seq_add to create a gap
  4. Hint tokens are decoded into the gap via llama_decode()
  5. K-shift re-rotates RoPE — model sees hint as native context
  6. Generation resumes with hint influencing all future tokens

Key insight: llama.cpp's causal attention mask is purely position-based — it doesn't care when a KV entry was written, only what position it claims.

API

Library functions (include/llama.h):

  • llama_steering_hint_inject() — inject pre-tokenized hint tokens into KV cache at a given position
  • llama_steering_hint_prepare() — wrap text with chat template and tokenize for use with inject

Server endpoints:

  • POST /steering/inject and POST /v1/steering/inject
  • Body: { "id_slot": 0, "text": "...", "role": "system", "position": -1 }
  • Returns: { "success": true, "n_injected": N }

Safety checks

  • Context size overflow check before injection
  • Slot must be in generating state
  • Error recovery: if decode fails, position shift is rolled back
  • Input validation for required fields

Files changed

File Change
docs/steering-hints.md Comprehensive design document with literature review
include/llama.h Public API declarations
src/llama-steering.h Internal header
src/llama-steering.cpp Core implementation
src/CMakeLists.txt Build integration
tools/server/server-task.h Task type and result struct
tools/server/server-task.cpp Result serialization
tools/server/server-context.h Route handler declaration
tools/server/server-context.cpp Task handler + HTTP route
tools/server/server.cpp Endpoint registration
tools/server/tests/unit/test_steering.py Automated tests

Known limitations

  • Injection timing is critical — on fast hardware, the slot may finish generating before the inject request arrives
  • Very small models may produce repetitive output after injection (observed with 3B model)
  • No queuing mechanism for inject requests that arrive after generation completes
  • First commit message says "Closes Steering hints: mid-inference context injection via overlapping activations #17" but issue should remain open for follow-up work

Relates to #17

Test plan

  • Build succeeds (library + server)
  • Manual test: inject during streaming generation (CPU)
  • Manual test: inject during streaming generation (GPU/RTX 3090)
  • Error cases: invalid slot, idle slot, missing fields
  • Context size overflow protection
  • Decode failure rollback
  • Automated pytest suite (test_steering.py)

🤖 Generated with Claude Code

marksverdhei and others added 15 commits March 11, 2026 09:31
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text
and mmproj GGUF conversion. Handle config structure difference where the
Thinker-only variant has vision/audio configs at the top level. Add pooling
type detection for embedding use cases. Fix audio tensor routing to base
MmprojModel class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
)

* docs: add ht-fork documentation, branding, and discussion links

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* convert: support LoRA conversion for MLA kv_b_proj

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add fork sync automation

* feat: add --remap-developer-role flag to translate developer→system

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: support LCO-Embedding-Omni (Qwen2.5 Omni Thinker) GGUF conversion

Register Qwen2_5OmniThinkerForConditionalGeneration architecture for text
and mmproj GGUF conversion. Handle config structure difference where the
Thinker-only variant has vision/audio configs at the top level. Add pooling
type detection for embedding use cases. Fix audio tensor routing to base
MmprojModel class.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* ci: add ht branch to flake8 lint workflow triggers

* feat: welcome agentic contributions, remove upstream AI restrictions

- Delete AGENTS.md (upstream's anti-AI contributor guidelines)
- Replace restrictive AI Usage Policy with welcoming Agentic Contributions section
- Update README to highlight fork's pragmatic stance on AI contributions

Unlike upstream, we evaluate code by quality, not by how it was written.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* webui: add cancel button for in-progress model loading

Allow users to cancel a model that is stuck loading or taking too long
in the router mode model selector. The cancel button appears next to
the loading spinner in both the model selector dropdown/sheet trigger
and within individual model option rows.

Uses the existing /models/unload endpoint which already supports
unloading models in LOADING state. The frontend polling loop is
interrupted via AbortController to prevent stale error toasts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* webui: add cancelling state indicator and fix cancel polling

- Show orange "Cancelling" indicator with spinner while cancel is in progress
- Poll until server confirms model is no longer in LOADING state before
  clearing the cancelling indicator
- Guard against redundant unload calls on already-unloaded models
- Keep loadingModelId alive during cancel so selector trigger shows
  the cancelling state correctly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(webui): color-coded spinners for model load/unload/cancel states

- Loading: green spinner, clockwise
- Unloading: red spinner, reverse direction with "Unloading" label
- Cancelling: orange spinner, reverse direction
- Track unloading state separately in models store

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(webui): address PR review feedback for cancel model loading

- Remove duplicated cancel logic from ModelsSelector and ModelsSelectorSheet
  by deriving loading/cancelling state from the store (issue #1)
- Fix race condition: no longer set isLoadingModel=false before cancel
  completes, preventing brief UI flash (issue #2)
- Add MAX_CANCEL_POLL_ATTEMPTS (60) timeout to cancel polling loop
  to prevent infinite polling if server never transitions (issue #3)
- Replace div cancel buttons with proper <button> elements for
  keyboard accessibility and screen reader support (issue #4)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
)

- Rename all frontend references from "llama.cpp" to "ht-llama.cpp"
- Dark mode: turquoise-tinted backgrounds, purple-tinted text
- Light mode: inverted — turquoise backgrounds, purple text
- Add reverse spin animation utility class

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Auto-discover LoRA adapters from the models directory by scanning GGUF
metadata (general.type = "adapter") and match them to models by
architecture. Adapters are loaded with --lora-init-without-apply so
they start disabled and can be toggled on via the UI.

Frontend adds a Popover-based LoRA dropdown in the chat action bar
(next to model selector) with multi-select checkboxes and scale inputs.
Includes "Show only matching" toggle to view all discovered adapters.
Works in both MODEL and ROUTER mode.

Backend changes:
- Add GGUF metadata scanning for adapter classification (preset.cpp)
- Auto-inject matching LoRA adapters into child process args (server-models.cpp)
- Include discovered adapters in /v1/models response
- Fix router proxy for /lora-adapters POST (array body fallback to query param)

Frontend changes:
- New LoraAdapters popover component with checkbox multi-select
- LoRA service with router mode support (query param routing)
- Reactive store with toggle, scale, change tracking, apply
- Integration in ChatFormActions bar and chat completion requests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive research and design for mid-inference context injection
via KV cache manipulation. Covers literature review (CAA, SVF, FASB,
RepE, EasySteer, AI Steerability 360), llama.cpp architecture analysis
(KV cache, batch positions, K-shift, chat templates), upstream landscape
(control vectors, seq operations, server architecture), identified gaps,
and phased implementation plan.

Closes #17

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core library (src/llama-steering.h, src/llama-steering.cpp):
- llama_steering_hint_inject(): shifts KV cache positions via seq_add,
  builds a batch at the gap positions, and decodes hint tokens
- llama_steering_hint_prepare(): wraps text with chat template and
  tokenizes with special token parsing for use with inject

Public API (include/llama.h):
- Added steering hints section between adapter cvec and memory APIs

Server integration (tools/server/):
- New task type SERVER_TASK_TYPE_STEERING_INJECT
- POST /v1/steering/inject and /steering/inject endpoints
- Accepts: id_slot, text, role (default: system), position (default: -1)
- Updates slot token tracking after injection
- Returns: {"success": true, "n_injected": N}

Tested with Qwen2.5-0.5B-Instruct:
- Chat completions streaming + mid-generation steering hint injection
- 18 tokens injected successfully during active generation
- Model behavior changed after injection (counting disrupted)
- Responses API works, steering requires slot to be in generating state

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add context size overflow check before injection
- Add decode failure rollback (undo position shift on error)
- Remove unused <cstring> include
- Add automated pytest suite (6 tests covering error cases and
  injection during active generation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@marksverdhei marksverdhei marked this pull request as ready for review March 12, 2026 14:01
@marksverdhei marksverdhei force-pushed the ht branch 5 times, most recently from 5772904 to 22e54b4 Compare March 31, 2026 11:50
@marksverdhei marksverdhei force-pushed the ht branch 2 times, most recently from cb142b0 to f40775c Compare April 7, 2026 12:35
marksverdhei pushed a commit that referenced this pull request Apr 12, 2026
)

* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (#8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (#9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (#7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (#11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (#12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (#16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (#17)

* meta : formatting, naming, indentation (#18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@marksverdhei

Copy link
Copy Markdown
Author

Closing as stale — branch is 645 commits behind ht and the steering-hints feature hasn't been a current priority. Reopen against a fresh branch off ht if you want to revive this work; the design doc commit on the branch is the canonical reference for the approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Steering hints: mid-inference context injection via overlapping activations

1 participant