G1 preamble doesn't scale past ~10 grafts even with caps — needs carving / lazy-load / summarization

Gap class follow-up to G1 ([hermes-agent#55](https://github.com/TechDevGroup/hermes-agent/issues/55) / PR #59). Surfaced by structural finding on polynomial-explorer worker: empty responses across all 3 tried models (`mistral-large`, `coding-groq`) when MCP tools + vertical preamble both attached. Even after fix #69 (synthetic recovery on `finish_reason=stop`), the worker can't actually make progress because the root contributor is preamble size.

## Current G1 behavior (correction to the field report)

My plugin **does** cap output today:

```python
# plugins/devagentic-vertical-preamble/preamble.py
_PER_GRAFT_CONTENT_CHARS = 2048
_MAX_GRAFTS_RENDERED = 8
_MAX_TOTAL_PREAMBLE_CHARS = 32768
```

So 70 grafts → **at most 8 rendered, capped at 16KB graft content + headers**, total ≤32KB preamble. NOT "all 70 verbatim" but still substantial — and that's a per-call constant, not relevance-weighted.

## Why the cap doesn't save us

Even with caps in place:

| Contributor | Approx tokens |
|---|---:|
| Vertical preamble (8 grafts cap-rendered) | ~8K |
| Worker-guardrails doc (~3KB in poly-explorer's case) | ~1K |
| 31 MCP tool schemas (post-G2/G3/G4) | ~12-15K |
| Conversation history + task prompt | varies, 2-5K |
| **Total** | **~25-30K tokens before any model response** |

`mistral-large` has a 128K context window — 30K shouldn't overwhelm it. But empirically: it returns `content="" + finish_reason=stop + tool_calls=null` (the failure shape captured by the hermes-side diagnostic in PR #68). Hypothesis: it's not token budget, it's **schema complexity overwhelming attention**. 31 tool schemas + dense graft content (markdown with code blocks, residue tuples, polynomial formulas) is a structural attention load that some models silently fail on rather than producing content.

`coding-groq` (groq-gpt-oss-120b) shows the same failure shape. Smaller models even more so.

## Why this matters for the blackbox vertical pattern

The cohesive blackbox spin-up flow (devagentic#203 §3.1) assumes vertical-spec + grafted-context auto-inject. With G1 as-shipped, that works for verticals with **≤ ~3 small grafts**. polynomial-explorer (8 sources, 70 docs) is the first vertical big enough to expose the structural ceiling. Future verticals will hit it sooner since the MCP tool count grows monotonically (G2b/c/d filed at #60/#61/#62, +4-6 more tools).

## Three approach options (orchestrator-proposed)

### (a) Summarize-on-ingest
At `spinUpVertical` time, run each graft through a paid-tier summarizer; store summaries alongside originals as `kind:grafted-context-summary` docs. Preamble loader sends summaries (~200 tokens each) instead of raw content.

- **Pros**: cheap per-turn (~1.6K tokens for 8 summaries vs 8K raw); summaries are concept-tuned; ship-once cost.
- **Cons**: information loss; need full-text fetch tool for deep-dive turns; summary quality is sensitive to the summarizer's familiarity with the domain; per-vertical pre-ship cost; needs a summarization model wired (paid tier or local distillation pass).

### (b) Lazy-load on-demand via tool query (my recommendation)
**Don't inject grafted-context bodies in preamble.** Preamble shows only an INDEX: `[{graft_id, source, path, 1-line abstract}, …]`. Worker fetches bodies on demand via a new `grafted_context_fetch(graft_id)` MCP tool. Pairs with existing `lane_h_fetch` design pattern from G4.

- **Pros**: zero per-turn preamble cost above the index (~50 tokens per graft × 70 = ~3.5K); worker controls what it loads (and when); matches existing lane_h_list/lane_h_fetch surface so workers already know the idiom; no new infra needed; vertical scales to ~hundreds of grafts.
- **Cons**: extra round-trip per fetched graft; worker has to KNOW it should ask (CLAUDE.md instruction needed); loses the "everything visible by default" reflexive property of the preamble pattern; can mis-fetch (asks for graft it didn't need).

### (c) Carve to top-N by relevance against current turn
Per-turn relevance scoring: embed user message → score grafts by cosine similarity → inject top-K (e.g. top-3). Either with proper embedding model OR weak lexical scoring (BM25 / keyword overlap).

- **Pros**: dynamic per-turn relevance; respects "preamble = what's relevant now" semantics; matches RAG-style context injection.
- **Cons**: needs embedding infrastructure (devagentic doesn't have an embedding model wired) or accepts weaker lexical scoring; per-turn compute cost; cold-turn problem (first turn has no signal to score against); may miss long-tail relevant grafts.

## Recommended sequencing

Ship **(b)** first — it matches the existing G4 `lane_h_fetch` pattern, requires only a new devagentic query + new hermes MCP tool, and the worker's CLAUDE.md can include the "fetch graft on demand" instruction. This is concretely 2 small PRs on the existing pattern.

Defer **(a)** until paid-tier-summarizer infrastructure lands generally (likely R3/R4 of devagentic#210's flow-router work, or the existing scaffold-cache pattern from `framework_config.scaffold_cache_role`).

Defer **(c)** until embeddings are wired into devagentic — currently no in-process embedding model; would require adding mistral-embed or similar to the model_cfg.

## Concrete next-PR scope for option (b)

1. **devagentic**: new GraphQL query `graftedContextById(userId, graftId)` returning a single doc body. Sister to `verticalContext` (devagentic#205) and `reasoningGraftCandidates` (devagentic#207). ~30 LOC + test.
2. **hermes-agent**: new `grafted_context_fetch(graft_id)` MCP tool in a `devagentic-grafts` plugin (or extend `devagentic-lane-h` since the pattern is identical). ~50 LOC + test.
3. **hermes-agent**: modify G1 preamble loader to render INDEX (id + source + path + 1-line abstract from first 80 chars of content) instead of full content. Keep worker-guardrails fully inlined (those are short + critical). Per-turn cost: 70 grafts × ~50 chars = ~3.5KB vs current 32KB.
4. **CLAUDE.md template** for spinUp: add a line teaching the worker about `grafted_context_fetch` — fetch what you need, when you need it.

Estimated effort: 3-4 PRs, ~4-6h focused. Not a quick fix per your call — splits naturally into a tight series.

## Severity / priority

Per orchestrator: not urgent; polynomial-explorer can wait. Fusion stack itself is healthy. File for proper design treatment; the empty-content fix in #69 unblocks workers from BURNING retries even when this gap is unfixed (worker sees the synthetic recovery instructing it how to proceed despite preamble overload).

## Related

- hermes-agent#55 (G1 preamble loader, original ship)
- hermes-agent#67 (empty-content failure that surfaced this) — PR #69 ships the immediate-symptom fix
- devagentic#203 §3.1 (canonical blackbox vertical spin-up flow that this gap class breaks at scale)
- devagentic#210 (flow-router parent — R5 workflow-preamble cache is conceptually adjacent)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

G1 preamble doesn't scale past ~10 grafts even with caps — needs carving / lazy-load / summarization #71

Current G1 behavior (correction to the field report)

Why the cap doesn't save us

Why this matters for the blackbox vertical pattern

Three approach options (orchestrator-proposed)

(a) Summarize-on-ingest

(b) Lazy-load on-demand via tool query (my recommendation)

(c) Carve to top-N by relevance against current turn

Recommended sequencing

Concrete next-PR scope for option (b)

Severity / priority

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Contributor	Approx tokens
Vertical preamble (8 grafts cap-rendered)	~8K
Worker-guardrails doc (~3KB in poly-explorer's case)	~1K
31 MCP tool schemas (post-G2/G3/G4)	~12-15K
Conversation history + task prompt	varies, 2-5K
Total	~25-30K tokens before any model response

G1 preamble doesn't scale past ~10 grafts even with caps — needs carving / lazy-load / summarization #71

Description

Current G1 behavior (correction to the field report)

Why the cap doesn't save us

Why this matters for the blackbox vertical pattern

Three approach options (orchestrator-proposed)

(a) Summarize-on-ingest

(b) Lazy-load on-demand via tool query (my recommendation)

(c) Carve to top-N by relevance against current turn

Recommended sequencing

Concrete next-PR scope for option (b)

Severity / priority

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions