Summary
Implement steering hints — the ability to inject user input into an active inference pass at a given context position, creating overlapping activations that steer model output without interrupting ongoing reasoning. This can be thought of as "telepathy": the model receives a fully-formed user message mid-generation without the usual turn-taking interruption.
Background & Prior Art
What agentic CLIs call "steering"
Several agentic CLI tools have introduced "steering hints" as a UX concept:
- Gemini CLI (issue #18782, PR #19307): "experimental in-progress steering hints" — user types while the model is thinking, text is injected into continuation turns as a hidden instruction. Explicitly prompt-level: "inject a hidden steering instruction into continuation/follow-up turns".
- Gemini CLI (issue #17197): Proposed
/inject command — "the string is pushed to the conversation history as a 'User' message or 'System' hint".
- OpenAI Codex CLI: Removed the
steer feature flag and standardized on always-on steer path in TUI. Interactive mode (--interactive) allows mid-turn guidance. Again, text-level context manipulation.
All of these operate at the prompt/context level — they queue text and inject it into the next reasoning step. None modify model activations directly.
Real activation-level steering in llama.cpp
llama.cpp already supports control vectors (PR #5970, issue #1460):
llama_set_adapter_cvec() applies per-layer vectors to activations during forward pass
llama_adapter_cvec::apply_to() adds steering tensor via ggml_add(ctx, cur, layer_dir)
- Layer range
[il_start, il_end] scopes which layers receive the vector
- Vectors generated via PCA on positive/negative prompt pair activations (
tools/cvector-generator/)
Research context
- Contrastive Activation Addition (CAA): Steering vectors computed from residual stream activation differences (Rimsky et al., 2023)
- Steering Vector Fields (Feb 2026): Learn differentiable scoring functions whose gradient defines steering direction per activation — context-aware rather than static
- SADI / FASB / CAST: Adaptive per-input steering methods that determine intervention strength on-the-fly during inference
- EasySteer: Unified framework with pluggable steering methods and pre-computed vectors for 8 domains
- AI Steerability 360: IBM toolkit for systematic LLM steering (arxiv 2603.07837)
Proposal
Core idea
During active token generation, accept user text input and:
- Tokenize it as a complete, properly-wrapped user message (with full chat template tags — see caveats)
- Encode these tokens into the KV cache at a target context position offset (overlapping with the current generation window)
- The model's attention mechanism naturally picks up these new KV entries, steering subsequent token generation
This differs from both prompt injection (waits for next turn) and static control vectors (pre-computed, fixed direction). It's dynamic, text-derived, position-targeted context steering.
How it works mechanically
llama.cpp's llama_batch already supports arbitrary position assignment:
typedef struct llama_batch {
llama_token * token; // token ids
llama_pos * pos; // positions in sequence
llama_seq_id ** seq_id; // sequence membership
// ...
} llama_batch;
The KV cache stores activations keyed by (seq_id, pos). By constructing a batch with the steering hint tokens at specific positions and calling llama_decode(), we write new KV entries that the model will attend to for all subsequent tokens.
Sequence of operations
1. Model is generating token at position N
2. User types steering hint: "focus on error handling"
3. System tokenizes hint with chat template wrapping:
<|im_start|>user\nfocus on error handling<|im_end|>
4. Construct batch with hint tokens at positions [N+1, N+k]
(or on a parallel sequence that shares KV attention)
5. llama_decode() the hint batch — writes to KV cache
6. Model continues generating from position N+1 onward,
now attending to both its own prior context AND the hint
Caveats & Open Questions
Chat template handling
This is the trickiest part. User steering inputs must be fully wrapped with proper chat template tags so the model interprets them correctly:
- Must use the model's actual chat template (Jinja or built-in)
- The hint needs complete open+close tags (e.g.,
<|im_start|>user\n...<|im_end|> for ChatML)
- Cannot leave tags unclosed — the model will treat unclosed tags as continuation of the current assistant turn
llama_chat_apply_template() can be used to wrap the hint as a single-message conversation with add_ass=false
- Special tokens must be properly tokenized (not as text literals)
Position collision & attention masking
- If hint tokens occupy positions that overlap with tokens the model is actively generating, we get activation interference
- Options: (a) place hints at positions ahead of current generation, (b) use a separate sequence ID with shared KV attention, (c) use position offsets that the model hasn't reached yet
- Need to understand RoPE position encoding implications — positions that are "future" relative to current generation may cause attention pattern issues
KV cache capacity
- Each hint consumes KV cache slots proportional to its token count
- For long-running generations, repeated hints could exhaust cache
- May need a eviction/compaction strategy for expired hints
Causality
- Standard causal attention means tokens only attend to previous positions
- Hints placed at future positions won't be seen until generation reaches them
- Hints placed at current/past positions create "retroactive" steering — the model hasn't seen them before but now attends to them
- This is the "telepathy" effect: existing KV entries are unchanged, but new entries appear that influence future attention computations
Alternative: parallel sequence steering
Instead of overlapping positions, use llama_memory_seq_cp() to branch the sequence:
seq 0: [system prompt] [user msg] [assistant generating...]
seq 1: [copy of seq 0] + [steering hint tokens]
Then switch generation to seq 1. This avoids position collision but costs more KV memory.
Performance considerations
llama_decode() of the hint batch is a forward pass — compute cost proportional to hint length
- For short hints ("focus on errors", "be more concise"), this is negligible
- Could batch hint processing with the next token generation step
Scope
In scope
- API for injecting steering hint text at a given context position during active generation
- Proper chat template wrapping of hint text
- Integration with existing KV cache and batch infrastructure
- Server endpoint for submitting steering hints to active completions
- Basic CLI support (type-while-generating)
Out of scope (for now)
- Activation-level steering vector computation from hint text (future enhancement)
- Automatic position selection heuristics
- Multi-modal steering hints
- Hint persistence across context shifts
References
Summary
Implement steering hints — the ability to inject user input into an active inference pass at a given context position, creating overlapping activations that steer model output without interrupting ongoing reasoning. This can be thought of as "telepathy": the model receives a fully-formed user message mid-generation without the usual turn-taking interruption.
Background & Prior Art
What agentic CLIs call "steering"
Several agentic CLI tools have introduced "steering hints" as a UX concept:
/injectcommand — "the string is pushed to the conversation history as a 'User' message or 'System' hint".steerfeature flag and standardized on always-on steer path in TUI. Interactive mode (--interactive) allows mid-turn guidance. Again, text-level context manipulation.All of these operate at the prompt/context level — they queue text and inject it into the next reasoning step. None modify model activations directly.
Real activation-level steering in llama.cpp
llama.cpp already supports control vectors (PR #5970, issue #1460):
llama_set_adapter_cvec()applies per-layer vectors to activations during forward passllama_adapter_cvec::apply_to()adds steering tensor viaggml_add(ctx, cur, layer_dir)[il_start, il_end]scopes which layers receive the vectortools/cvector-generator/)Research context
Proposal
Core idea
During active token generation, accept user text input and:
This differs from both prompt injection (waits for next turn) and static control vectors (pre-computed, fixed direction). It's dynamic, text-derived, position-targeted context steering.
How it works mechanically
llama.cpp's
llama_batchalready supports arbitrary position assignment:The KV cache stores activations keyed by
(seq_id, pos). By constructing a batch with the steering hint tokens at specific positions and callingllama_decode(), we write new KV entries that the model will attend to for all subsequent tokens.Sequence of operations
Caveats & Open Questions
Chat template handling
This is the trickiest part. User steering inputs must be fully wrapped with proper chat template tags so the model interprets them correctly:
<|im_start|>user\n...<|im_end|>for ChatML)llama_chat_apply_template()can be used to wrap the hint as a single-message conversation withadd_ass=falsePosition collision & attention masking
KV cache capacity
Causality
Alternative: parallel sequence steering
Instead of overlapping positions, use
llama_memory_seq_cp()to branch the sequence:Then switch generation to seq 1. This avoids position collision but costs more KV memory.
Performance considerations
llama_decode()of the hint batch is a forward pass — compute cost proportional to hint lengthScope
In scope
Out of scope (for now)
References