Gap class follow-up to G1 (hermes-agent#55 / PR #59). Surfaced by structural finding on polynomial-explorer worker: empty responses across all 3 tried models (mistral-large, coding-groq) when MCP tools + vertical preamble both attached. Even after fix #69 (synthetic recovery on finish_reason=stop), the worker can't actually make progress because the root contributor is preamble size.
Current G1 behavior (correction to the field report)
My plugin does cap output today:
# plugins/devagentic-vertical-preamble/preamble.py
_PER_GRAFT_CONTENT_CHARS = 2048
_MAX_GRAFTS_RENDERED = 8
_MAX_TOTAL_PREAMBLE_CHARS = 32768
So 70 grafts → at most 8 rendered, capped at 16KB graft content + headers, total ≤32KB preamble. NOT "all 70 verbatim" but still substantial — and that's a per-call constant, not relevance-weighted.
Why the cap doesn't save us
Even with caps in place:
| Contributor |
Approx tokens |
| Vertical preamble (8 grafts cap-rendered) |
~8K |
| Worker-guardrails doc (~3KB in poly-explorer's case) |
~1K |
| 31 MCP tool schemas (post-G2/G3/G4) |
~12-15K |
| Conversation history + task prompt |
varies, 2-5K |
| Total |
~25-30K tokens before any model response |
mistral-large has a 128K context window — 30K shouldn't overwhelm it. But empirically: it returns content="" + finish_reason=stop + tool_calls=null (the failure shape captured by the hermes-side diagnostic in PR #68). Hypothesis: it's not token budget, it's schema complexity overwhelming attention. 31 tool schemas + dense graft content (markdown with code blocks, residue tuples, polynomial formulas) is a structural attention load that some models silently fail on rather than producing content.
coding-groq (groq-gpt-oss-120b) shows the same failure shape. Smaller models even more so.
Why this matters for the blackbox vertical pattern
The cohesive blackbox spin-up flow (devagentic#203 §3.1) assumes vertical-spec + grafted-context auto-inject. With G1 as-shipped, that works for verticals with ≤ ~3 small grafts. polynomial-explorer (8 sources, 70 docs) is the first vertical big enough to expose the structural ceiling. Future verticals will hit it sooner since the MCP tool count grows monotonically (G2b/c/d filed at #60/#61/#62, +4-6 more tools).
Three approach options (orchestrator-proposed)
(a) Summarize-on-ingest
At spinUpVertical time, run each graft through a paid-tier summarizer; store summaries alongside originals as kind:grafted-context-summary docs. Preamble loader sends summaries (~200 tokens each) instead of raw content.
- Pros: cheap per-turn (~1.6K tokens for 8 summaries vs 8K raw); summaries are concept-tuned; ship-once cost.
- Cons: information loss; need full-text fetch tool for deep-dive turns; summary quality is sensitive to the summarizer's familiarity with the domain; per-vertical pre-ship cost; needs a summarization model wired (paid tier or local distillation pass).
(b) Lazy-load on-demand via tool query (my recommendation)
Don't inject grafted-context bodies in preamble. Preamble shows only an INDEX: [{graft_id, source, path, 1-line abstract}, …]. Worker fetches bodies on demand via a new grafted_context_fetch(graft_id) MCP tool. Pairs with existing lane_h_fetch design pattern from G4.
- Pros: zero per-turn preamble cost above the index (~50 tokens per graft × 70 = ~3.5K); worker controls what it loads (and when); matches existing lane_h_list/lane_h_fetch surface so workers already know the idiom; no new infra needed; vertical scales to ~hundreds of grafts.
- Cons: extra round-trip per fetched graft; worker has to KNOW it should ask (CLAUDE.md instruction needed); loses the "everything visible by default" reflexive property of the preamble pattern; can mis-fetch (asks for graft it didn't need).
(c) Carve to top-N by relevance against current turn
Per-turn relevance scoring: embed user message → score grafts by cosine similarity → inject top-K (e.g. top-3). Either with proper embedding model OR weak lexical scoring (BM25 / keyword overlap).
- Pros: dynamic per-turn relevance; respects "preamble = what's relevant now" semantics; matches RAG-style context injection.
- Cons: needs embedding infrastructure (devagentic doesn't have an embedding model wired) or accepts weaker lexical scoring; per-turn compute cost; cold-turn problem (first turn has no signal to score against); may miss long-tail relevant grafts.
Recommended sequencing
Ship (b) first — it matches the existing G4 lane_h_fetch pattern, requires only a new devagentic query + new hermes MCP tool, and the worker's CLAUDE.md can include the "fetch graft on demand" instruction. This is concretely 2 small PRs on the existing pattern.
Defer (a) until paid-tier-summarizer infrastructure lands generally (likely R3/R4 of devagentic#210's flow-router work, or the existing scaffold-cache pattern from framework_config.scaffold_cache_role).
Defer (c) until embeddings are wired into devagentic — currently no in-process embedding model; would require adding mistral-embed or similar to the model_cfg.
Concrete next-PR scope for option (b)
- devagentic: new GraphQL query
graftedContextById(userId, graftId) returning a single doc body. Sister to verticalContext (devagentic#205) and reasoningGraftCandidates (devagentic#207). ~30 LOC + test.
- hermes-agent: new
grafted_context_fetch(graft_id) MCP tool in a devagentic-grafts plugin (or extend devagentic-lane-h since the pattern is identical). ~50 LOC + test.
- hermes-agent: modify G1 preamble loader to render INDEX (id + source + path + 1-line abstract from first 80 chars of content) instead of full content. Keep worker-guardrails fully inlined (those are short + critical). Per-turn cost: 70 grafts × ~50 chars = ~3.5KB vs current 32KB.
- CLAUDE.md template for spinUp: add a line teaching the worker about
grafted_context_fetch — fetch what you need, when you need it.
Estimated effort: 3-4 PRs, ~4-6h focused. Not a quick fix per your call — splits naturally into a tight series.
Severity / priority
Per orchestrator: not urgent; polynomial-explorer can wait. Fusion stack itself is healthy. File for proper design treatment; the empty-content fix in #69 unblocks workers from BURNING retries even when this gap is unfixed (worker sees the synthetic recovery instructing it how to proceed despite preamble overload).
Related
Gap class follow-up to G1 (hermes-agent#55 / PR #59). Surfaced by structural finding on polynomial-explorer worker: empty responses across all 3 tried models (
mistral-large,coding-groq) when MCP tools + vertical preamble both attached. Even after fix #69 (synthetic recovery onfinish_reason=stop), the worker can't actually make progress because the root contributor is preamble size.Current G1 behavior (correction to the field report)
My plugin does cap output today:
So 70 grafts → at most 8 rendered, capped at 16KB graft content + headers, total ≤32KB preamble. NOT "all 70 verbatim" but still substantial — and that's a per-call constant, not relevance-weighted.
Why the cap doesn't save us
Even with caps in place:
mistral-largehas a 128K context window — 30K shouldn't overwhelm it. But empirically: it returnscontent="" + finish_reason=stop + tool_calls=null(the failure shape captured by the hermes-side diagnostic in PR #68). Hypothesis: it's not token budget, it's schema complexity overwhelming attention. 31 tool schemas + dense graft content (markdown with code blocks, residue tuples, polynomial formulas) is a structural attention load that some models silently fail on rather than producing content.coding-groq(groq-gpt-oss-120b) shows the same failure shape. Smaller models even more so.Why this matters for the blackbox vertical pattern
The cohesive blackbox spin-up flow (devagentic#203 §3.1) assumes vertical-spec + grafted-context auto-inject. With G1 as-shipped, that works for verticals with ≤ ~3 small grafts. polynomial-explorer (8 sources, 70 docs) is the first vertical big enough to expose the structural ceiling. Future verticals will hit it sooner since the MCP tool count grows monotonically (G2b/c/d filed at #60/#61/#62, +4-6 more tools).
Three approach options (orchestrator-proposed)
(a) Summarize-on-ingest
At
spinUpVerticaltime, run each graft through a paid-tier summarizer; store summaries alongside originals askind:grafted-context-summarydocs. Preamble loader sends summaries (~200 tokens each) instead of raw content.(b) Lazy-load on-demand via tool query (my recommendation)
Don't inject grafted-context bodies in preamble. Preamble shows only an INDEX:
[{graft_id, source, path, 1-line abstract}, …]. Worker fetches bodies on demand via a newgrafted_context_fetch(graft_id)MCP tool. Pairs with existinglane_h_fetchdesign pattern from G4.(c) Carve to top-N by relevance against current turn
Per-turn relevance scoring: embed user message → score grafts by cosine similarity → inject top-K (e.g. top-3). Either with proper embedding model OR weak lexical scoring (BM25 / keyword overlap).
Recommended sequencing
Ship (b) first — it matches the existing G4
lane_h_fetchpattern, requires only a new devagentic query + new hermes MCP tool, and the worker's CLAUDE.md can include the "fetch graft on demand" instruction. This is concretely 2 small PRs on the existing pattern.Defer (a) until paid-tier-summarizer infrastructure lands generally (likely R3/R4 of devagentic#210's flow-router work, or the existing scaffold-cache pattern from
framework_config.scaffold_cache_role).Defer (c) until embeddings are wired into devagentic — currently no in-process embedding model; would require adding mistral-embed or similar to the model_cfg.
Concrete next-PR scope for option (b)
graftedContextById(userId, graftId)returning a single doc body. Sister toverticalContext(devagentic#205) andreasoningGraftCandidates(devagentic#207). ~30 LOC + test.grafted_context_fetch(graft_id)MCP tool in adevagentic-graftsplugin (or extenddevagentic-lane-hsince the pattern is identical). ~50 LOC + test.grafted_context_fetch— fetch what you need, when you need it.Estimated effort: 3-4 PRs, ~4-6h focused. Not a quick fix per your call — splits naturally into a tight series.
Severity / priority
Per orchestrator: not urgent; polynomial-explorer can wait. Fusion stack itself is healthy. File for proper design treatment; the empty-content fix in #69 unblocks workers from BURNING retries even when this gap is unfixed (worker sees the synthetic recovery instructing it how to proceed despite preamble overload).
Related