Summary
Replace routine compaction-as-default with agent-aware handoff: when an agent's context approaches a configurable threshold (proposed default ~80% of the model's window — ~800K on 1M models), the agent writes a structured handoff prompt for a successor and the engine spawns a fresh agent loop seeded with that prompt. The successor's context begins clean, with the handoff as its stable cache prefix.
This is one concrete mechanism for the goal already tracked at #541 ("V4: cache-maximal context as default, demote routine compaction"). Filing as its own issue because the mechanism is specific enough to design + spec without conflating with #541's broader intent.
Why this beats compaction for cache behavior
V4 caches shared prefixes at 128-token granularity with ~90% cost discount. Compaction breaks that cache: the new summarized prefix is byte-different from anything cached upstream, so the next turn pays the full uncached prompt cost. On a 800K-token-prefix session, that's roughly a 10× cost spike on the post-compaction turn.
Handoff doesn't pay that cost — it doesn't mutate the existing agent's prefix at all. The outgoing agent finishes its current turn cleanly; the incoming agent starts with a brand new prefix (the handoff prompt) that becomes its own stable cache key for the rest of its run.
Empirically, the cost-of-aggressive-compaction signal has been visible in community feedback (some users disable compaction entirely to dodge the cache miss). That's a reasonable workaround for the symptom but loses the long-context benefit. Handoff keeps the long-context benefit by being more architectural about when the prefix changes.
Mechanism sketch
-
Awareness threshold — engine surfaces context fill ratio to the agent each turn. At ~70% the agent is told "approaching limit; consider wrapping current sub-task." At ~80% it's told "draft a handoff." At ~90% it's told "stop and hand off now." Thresholds configurable.
-
Handoff prompt format — structured markdown the agent writes into a designated artifact (e.g. .deepseek/handoff.md — already a real path used by prompts.rs::HANDOFF_RELATIVE_PATH for cross-session handoffs). Sections:
- Goal (what the user is trying to do)
- Current state (what's been done, what's in flight)
- Open blockers (what the next agent needs to figure out)
- Recent decisions (what's been ruled in / out)
- Files-of-interest (working set)
- Recovery hints (commands to re-discover state if anything's missing)
-
Successor spawn — engine reads the handoff, builds a fresh system prompt that prepends the handoff block, and starts a new turn loop. The user sees a marker in the transcript ("→ handed off to successor agent") but otherwise continues talking normally.
-
Hybrid fallback — if the agent self-assesses "this conversation can't be cleanly handed off" (e.g. mid-thought reasoning chain that needs full continuity), it can opt to compact instead. The decision becomes the agent's, not the system's. Both behaviors stay available; the default flips from compact-first to handoff-first.
Risks worth naming
- Producing a great handoff is itself expensive. The outgoing agent has to spend reasoning effort to write the handoff right when its context is most cluttered. Mitigation: the handoff template is structured (not free-form summary) so the cost is bounded.
- UX cost of mid-task handoff. The user notices the transition. Mitigation: clear visual marker, plus the successor opens with "I'm picking up from {summary}" so the seam reads as continuity, not amnesia.
- Not all conversations have natural break-points. Single-thread deep debugging can't be cleanly chunked. Hybrid fallback (compact-when-handoff-fails) handles this.
- Handoff quality is bounded by the outgoing agent's self-awareness. Some signals it should track: open files, recent tool failures, decisions, what the user was just told. We can scaffold these via the handoff template.
Relation to existing issues
Out of scope
- The actual handoff template content (separate spec).
- Multi-agent orchestration patterns beyond simple succession.
- v0.8.11 work — this is v0.9.0 candidate.
Reporter
Maintainer-scoped design exploration (Hmbown), 2026-05-04. Spawned from a session reviewing the cache-cost economics of routine compaction.
Summary
Replace routine compaction-as-default with agent-aware handoff: when an agent's context approaches a configurable threshold (proposed default ~80% of the model's window — ~800K on 1M models), the agent writes a structured handoff prompt for a successor and the engine spawns a fresh agent loop seeded with that prompt. The successor's context begins clean, with the handoff as its stable cache prefix.
This is one concrete mechanism for the goal already tracked at #541 ("V4: cache-maximal context as default, demote routine compaction"). Filing as its own issue because the mechanism is specific enough to design + spec without conflating with #541's broader intent.
Why this beats compaction for cache behavior
V4 caches shared prefixes at 128-token granularity with ~90% cost discount. Compaction breaks that cache: the new summarized prefix is byte-different from anything cached upstream, so the next turn pays the full uncached prompt cost. On a 800K-token-prefix session, that's roughly a 10× cost spike on the post-compaction turn.
Handoff doesn't pay that cost — it doesn't mutate the existing agent's prefix at all. The outgoing agent finishes its current turn cleanly; the incoming agent starts with a brand new prefix (the handoff prompt) that becomes its own stable cache key for the rest of its run.
Empirically, the cost-of-aggressive-compaction signal has been visible in community feedback (some users disable compaction entirely to dodge the cache miss). That's a reasonable workaround for the symptom but loses the long-context benefit. Handoff keeps the long-context benefit by being more architectural about when the prefix changes.
Mechanism sketch
Awareness threshold — engine surfaces context fill ratio to the agent each turn. At ~70% the agent is told "approaching limit; consider wrapping current sub-task." At ~80% it's told "draft a handoff." At ~90% it's told "stop and hand off now." Thresholds configurable.
Handoff prompt format — structured markdown the agent writes into a designated artifact (e.g.
.deepseek/handoff.md— already a real path used byprompts.rs::HANDOFF_RELATIVE_PATHfor cross-session handoffs). Sections:Successor spawn — engine reads the handoff, builds a fresh system prompt that prepends the handoff block, and starts a new turn loop. The user sees a marker in the transcript ("→ handed off to successor agent") but otherwise continues talking normally.
Hybrid fallback — if the agent self-assesses "this conversation can't be cleanly handed off" (e.g. mid-thought reasoning chain that needs full continuity), it can opt to compact instead. The decision becomes the agent's, not the system's. Both behaviors stay available; the default flips from compact-first to handoff-first.
Risks worth naming
Relation to existing issues
Out of scope
Reporter
Maintainer-scoped design exploration (Hmbown), 2026-05-04. Spawned from a session reviewing the cache-cost economics of routine compaction.