- Offline first — no API key, no network required, scale to cloud
- Tools that spawn agents — the model decides when to go deeper
- Multi-hop tool use — emergent hypothesis refinement and adaptive tool use
- Shared KV prefix — agents inherit full attention from parent
- Tree search — fork sampler, grammar, and metrics atomically for LATS / MCTS
- Branch comparison — N attempts from one origin, measure agreement
- Parallel agents, amortized compute — N agents advance in one GPU pass (bin-packed)
What you can build
Research pipelines
Search, read, hypothesize, verify — across local files, web, databases, or
any data source.
Personal assistants
Multi-turn agents with persistent KV state, connected to any service or API.
Code agents
Navigate codebases, trace dependencies, run tests, propose changes.
Data analysis
Query databases, aggregate results, produce reports with source attribution.
Support agents
Search knowledge bases, follow troubleshooting trees, escalate with full
context.
Your workflow
Any process where agents need tools, shared state, and structured cleanup.
Traditional Agents: There’s no attention continuity between LLM calls.
Each request re-encodes the full context and re-computes attention from scratch. The KV state from the prior call — the specific weighted relationships the model found between tokens — is gone. The next call re-reads a transcript of what a previous generation produced and re-computes Q·Kᵀ over the entire sequence. Every agent framework today works this way. Call a tool, get a result, rebuild the prompt, make a new request. The model’s prior attention state is discarded. Continuity is simulated by making the model re-read its own output as text.Continuous Context Agents: Agents branch, inherit, and build on each other’s attention state.
They share a physical frontier in the KV cache. Every branch inherits the full attention state of its parent — prior generations, tool results, and prefilled context remain addressable at their original positions. Forking is O(1) metadata. Context is never re-encoded.When the model computes Q·Kᵀ for its next token, it attends over all K vectors
at positions 0..N — including those written during prior tool result prefills.
Child branches attend over the parent’s KV vectors at shared positions — the
same physical key-value pairs, not a re-encoding. No information bottleneck.
No lossy compression step.
What emerges from this
Attention continuity changes what agents do with tools. We observe agents forming and testing hypotheses through iterative tool use — narrowing, discovering, hypothesizing, then verifying — not prompted, but emergent from the attention mechanics. Later search queries reference concepts absent from the original question, discovered during earlier reads and still physically present in the KV cache. See Concurrency Model — The Decision Boundary for the full mechanism with receipts from real pipeline runs.Why this runs on your laptop (and phone)
Prefix sharing, scratchpad extraction, and position-aware forking aren’t performance optimizations. Without them, Continuous Context Agents don’t exist on consumer hardware — they’d need a datacenter. The efficiency is what makes the architecture possible at 16K context. Three agents sharing a 16K window can’t fit if each re-decodes 900 tokens of tool schemas. With prefix sharing, those tokens are decoded once. Every fork inherits them. Measured across a real pipeline: 4.4x fewer tokens processed than a prompt-rebuilding approach. A single web search result can be 1,500–3,700 tokens. Scratchpad extraction attends to the full result on an ephemeral branch, compresses it via grammar-constrained generation, then prunes the branch. The compressed result stays; the ephemeral KV is freed. ContextPressure reads available headroom and makes real-time orchestration decisions — how many sub-agents to spawn, when to extract partial findings, when to synthesize early. Same pipeline code on a 32K cloud GPU or a 16K laptop. Depth adapts to the hardware. Runs fully offline — no API keys, no network, no data leaving the device.The stack
- Tools: Anything an agent can call — databases, APIs, filesystems, web search, services. You define the interface.
- Agent Pools: Parallel agents on shared KV. System prompt decoded once, inherited by every fork.
- Sources: Any data backend — local files, web, vector stores, email, JIRA. Five-method contract.
- Pipelines: Compose generator stages into any workflow your application needs.
Start building
Quick Start
Your first agent with tools in 5 minutes.
Thinking in lloyal
New to generators and structured concurrency? Start with the mental model.