Skip to main content
Traditional agents rebuild context each step. Continuous Context Agents persist KV state — tool results prefilled into attention, branches forked at O(1) cost.
  • Offline first — no API key, no network required, scale to cloud
  • Tools that spawn agents — the model decides when to go deeper
  • Multi-hop tool use — emergent hypothesis refinement and adaptive tool use
  • Shared KV prefix — agents inherit full attention from parent
  • Tree search — fork sampler, grammar, and metrics atomically for LATS / MCTS
  • Branch comparison — N attempts from one origin, measure agreement
  • Parallel agents, amortized compute — N agents advance in one GPU pass (bin-packed)

What you can build

Research pipelines

Search, read, hypothesize, verify — across local files, web, databases, or any data source.

Personal assistants

Multi-turn agents with persistent KV state, connected to any service or API.

Code agents

Navigate codebases, trace dependencies, run tests, propose changes.

Data analysis

Query databases, aggregate results, produce reports with source attribution.

Support agents

Search knowledge bases, follow troubleshooting trees, escalate with full context.

Your workflow

Any process where agents need tools, shared state, and structured cleanup.

Traditional Agents: There’s no attention continuity between LLM calls.

Each request re-encodes the full context and re-computes attention from scratch. The KV state from the prior call — the specific weighted relationships the model found between tokens — is gone. The next call re-reads a transcript of what a previous generation produced and re-computes Q·Kᵀ over the entire sequence. Every agent framework today works this way. Call a tool, get a result, rebuild the prompt, make a new request. The model’s prior attention state is discarded. Continuity is simulated by making the model re-read its own output as text.

Continuous Context Agents: Agents branch, inherit, and build on each other’s attention state.

They share a physical frontier in the KV cache. Every branch inherits the full attention state of its parent — prior generations, tool results, and prefilled context remain addressable at their original positions. Forking is O(1) metadata. Context is never re-encoded.
When the model computes Q·Kᵀ for its next token, it attends over all K vectors at positions 0..N — including those written during prior tool result prefills. Child branches attend over the parent’s KV vectors at shared positions — the same physical key-value pairs, not a re-encoding. No information bottleneck. No lossy compression step.
A fork is a metadata operation on shared KV cells — O(1) cost, zero tensor copy. Three agents forked from the same root attend to the same physical key-value pairs for the prefix while writing their own unique context above the fork point. When a tool returns a result, it’s prefilled directly into the agent’s KV cache. The model’s next generation step sees the complete result at its original position in the attention window. When a sub-agent forks from a parent, it inherits every tool call, every tool result, every token the parent generated. The sub-agent’s first token attends over the parent’s full accumulated state.

What emerges from this

Attention continuity changes what agents do with tools. We observe agents forming and testing hypotheses through iterative tool use — narrowing, discovering, hypothesizing, then verifying — not prompted, but emergent from the attention mechanics. Later search queries reference concepts absent from the original question, discovered during earlier reads and still physically present in the KV cache. See Concurrency Model — The Decision Boundary for the full mechanism with receipts from real pipeline runs.

Why this runs on your laptop (and phone)

Prefix sharing, scratchpad extraction, and position-aware forking aren’t performance optimizations. Without them, Continuous Context Agents don’t exist on consumer hardware — they’d need a datacenter. The efficiency is what makes the architecture possible at 16K context. Three agents sharing a 16K window can’t fit if each re-decodes 900 tokens of tool schemas. With prefix sharing, those tokens are decoded once. Every fork inherits them. Measured across a real pipeline: 4.4x fewer tokens processed than a prompt-rebuilding approach. A single web search result can be 1,500–3,700 tokens. Scratchpad extraction attends to the full result on an ephemeral branch, compresses it via grammar-constrained generation, then prunes the branch. The compressed result stays; the ephemeral KV is freed. ContextPressure reads available headroom and makes real-time orchestration decisions — how many sub-agents to spawn, when to extract partial findings, when to synthesize early. Same pipeline code on a 32K cloud GPU or a 16K laptop. Depth adapts to the hardware. Runs fully offline — no API keys, no network, no data leaving the device.

The stack

Your application

@lloyal-labs/rig              Tools, data sources, pipeline building blocks

@lloyal-labs/lloyal-agents    Continuous Context Agent orchestration

@lloyal-labs/sdk              Inference runtime: branches, sessions, KV cache

@lloyal-labs/lloyal.node      Native GPU backend (macOS · Linux · Windows)
  • Tools: Anything an agent can call — databases, APIs, filesystems, web search, services. You define the interface.
  • Agent Pools: Parallel agents on shared KV. System prompt decoded once, inherited by every fork.
  • Sources: Any data backend — local files, web, vector stores, email, JIRA. Five-method contract.
  • Pipelines: Compose generator stages into any workflow your application needs.

Start building

Quick Start

Your first agent with tools in 5 minutes.

Thinking in lloyal

New to generators and structured concurrency? Start with the mental model.