Skip to content

G1 preamble doesn't scale past ~10 grafts even with caps — needs carving / lazy-load / summarization #71

@PowerCreek

Description

@PowerCreek

Gap class follow-up to G1 (hermes-agent#55 / PR #59). Surfaced by structural finding on polynomial-explorer worker: empty responses across all 3 tried models (mistral-large, coding-groq) when MCP tools + vertical preamble both attached. Even after fix #69 (synthetic recovery on finish_reason=stop), the worker can't actually make progress because the root contributor is preamble size.

Current G1 behavior (correction to the field report)

My plugin does cap output today:

# plugins/devagentic-vertical-preamble/preamble.py
_PER_GRAFT_CONTENT_CHARS = 2048
_MAX_GRAFTS_RENDERED = 8
_MAX_TOTAL_PREAMBLE_CHARS = 32768

So 70 grafts → at most 8 rendered, capped at 16KB graft content + headers, total ≤32KB preamble. NOT "all 70 verbatim" but still substantial — and that's a per-call constant, not relevance-weighted.

Why the cap doesn't save us

Even with caps in place:

Contributor Approx tokens
Vertical preamble (8 grafts cap-rendered) ~8K
Worker-guardrails doc (~3KB in poly-explorer's case) ~1K
31 MCP tool schemas (post-G2/G3/G4) ~12-15K
Conversation history + task prompt varies, 2-5K
Total ~25-30K tokens before any model response

mistral-large has a 128K context window — 30K shouldn't overwhelm it. But empirically: it returns content="" + finish_reason=stop + tool_calls=null (the failure shape captured by the hermes-side diagnostic in PR #68). Hypothesis: it's not token budget, it's schema complexity overwhelming attention. 31 tool schemas + dense graft content (markdown with code blocks, residue tuples, polynomial formulas) is a structural attention load that some models silently fail on rather than producing content.

coding-groq (groq-gpt-oss-120b) shows the same failure shape. Smaller models even more so.

Why this matters for the blackbox vertical pattern

The cohesive blackbox spin-up flow (devagentic#203 §3.1) assumes vertical-spec + grafted-context auto-inject. With G1 as-shipped, that works for verticals with ≤ ~3 small grafts. polynomial-explorer (8 sources, 70 docs) is the first vertical big enough to expose the structural ceiling. Future verticals will hit it sooner since the MCP tool count grows monotonically (G2b/c/d filed at #60/#61/#62, +4-6 more tools).

Three approach options (orchestrator-proposed)

(a) Summarize-on-ingest

At spinUpVertical time, run each graft through a paid-tier summarizer; store summaries alongside originals as kind:grafted-context-summary docs. Preamble loader sends summaries (~200 tokens each) instead of raw content.

  • Pros: cheap per-turn (~1.6K tokens for 8 summaries vs 8K raw); summaries are concept-tuned; ship-once cost.
  • Cons: information loss; need full-text fetch tool for deep-dive turns; summary quality is sensitive to the summarizer's familiarity with the domain; per-vertical pre-ship cost; needs a summarization model wired (paid tier or local distillation pass).

(b) Lazy-load on-demand via tool query (my recommendation)

Don't inject grafted-context bodies in preamble. Preamble shows only an INDEX: [{graft_id, source, path, 1-line abstract}, …]. Worker fetches bodies on demand via a new grafted_context_fetch(graft_id) MCP tool. Pairs with existing lane_h_fetch design pattern from G4.

  • Pros: zero per-turn preamble cost above the index (~50 tokens per graft × 70 = ~3.5K); worker controls what it loads (and when); matches existing lane_h_list/lane_h_fetch surface so workers already know the idiom; no new infra needed; vertical scales to ~hundreds of grafts.
  • Cons: extra round-trip per fetched graft; worker has to KNOW it should ask (CLAUDE.md instruction needed); loses the "everything visible by default" reflexive property of the preamble pattern; can mis-fetch (asks for graft it didn't need).

(c) Carve to top-N by relevance against current turn

Per-turn relevance scoring: embed user message → score grafts by cosine similarity → inject top-K (e.g. top-3). Either with proper embedding model OR weak lexical scoring (BM25 / keyword overlap).

  • Pros: dynamic per-turn relevance; respects "preamble = what's relevant now" semantics; matches RAG-style context injection.
  • Cons: needs embedding infrastructure (devagentic doesn't have an embedding model wired) or accepts weaker lexical scoring; per-turn compute cost; cold-turn problem (first turn has no signal to score against); may miss long-tail relevant grafts.

Recommended sequencing

Ship (b) first — it matches the existing G4 lane_h_fetch pattern, requires only a new devagentic query + new hermes MCP tool, and the worker's CLAUDE.md can include the "fetch graft on demand" instruction. This is concretely 2 small PRs on the existing pattern.

Defer (a) until paid-tier-summarizer infrastructure lands generally (likely R3/R4 of devagentic#210's flow-router work, or the existing scaffold-cache pattern from framework_config.scaffold_cache_role).

Defer (c) until embeddings are wired into devagentic — currently no in-process embedding model; would require adding mistral-embed or similar to the model_cfg.

Concrete next-PR scope for option (b)

  1. devagentic: new GraphQL query graftedContextById(userId, graftId) returning a single doc body. Sister to verticalContext (devagentic#205) and reasoningGraftCandidates (devagentic#207). ~30 LOC + test.
  2. hermes-agent: new grafted_context_fetch(graft_id) MCP tool in a devagentic-grafts plugin (or extend devagentic-lane-h since the pattern is identical). ~50 LOC + test.
  3. hermes-agent: modify G1 preamble loader to render INDEX (id + source + path + 1-line abstract from first 80 chars of content) instead of full content. Keep worker-guardrails fully inlined (those are short + critical). Per-turn cost: 70 grafts × ~50 chars = ~3.5KB vs current 32KB.
  4. CLAUDE.md template for spinUp: add a line teaching the worker about grafted_context_fetch — fetch what you need, when you need it.

Estimated effort: 3-4 PRs, ~4-6h focused. Not a quick fix per your call — splits naturally into a tight series.

Severity / priority

Per orchestrator: not urgent; polynomial-explorer can wait. Fusion stack itself is healthy. File for proper design treatment; the empty-content fix in #69 unblocks workers from BURNING retries even when this gap is unfixed (worker sees the synthetic recovery instructing it how to proceed despite preamble overload).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions