feat: add v1 agent-scoped memory scope for LCM tools#2
Open
jacoblyles wants to merge 1 commit into
Open
Conversation
100yenadmin
referenced
this pull request
in 100yenadmin/lossless-claw
May 7, 2026
Opus subagent analysis of v4.1 baseline (333 blocks) vs v4.2 stubs (689 blocks) at the same 258K-token budget recommended four mitigations to address moderate-risk findings: 1. Recency cue [t-NNm] on turn headers 2. Semantic stub wrapping <lcm-stub> XML tags 3. Empty-assistant collapsing 4. Resolution markers at completion boundaries Applied first-principles-architectural-decision skill (research, run-the-system, where-it-lives diagrams, adversarial debate) before building any of them. Verdict: REJECT ALL FOUR. Each fails on a specific load-bearing constraint: - #1 fails on prefix-cache stability (clock-based tag changes the rendered string on every assemble, invalidating the cache that v4.2's whole value proposition relies on). User timestamps already exist inline. - #2 fails on "novelty has cost, format already works" — the existing [LCM Tool Output: file_xxx | …] bracket form is correctly parsed by Opus in live tests (drilldown via lcm_describe works on Option F format). Replacing a working v4.1-trained format with a novel XML form is unjustified churn. - #3 fails on Anthropic/OpenAI wire contract. The "empty assistants" contain tool_use blocks (required to live in assistant turns; paired with tool_results by toolCallId). Dropping them would break pairing — providers reject orphan tool_results. - Martian-Engineering#4 fails on detection signal. No reliable way to mark "work completed" — user phrases like "go ahead" / "yes" / "keep digging" oscillate. False positives are strictly worse than no marker (license premature stubbing). Adversarial debate at ≥95% confidence target on each. AGAINST won on all four. Decision record committed for future operators who hit similar moderate-risk findings and reach for similar mitigations. Final v4.2 shipping shape: Options C + D + F at commit e309bed. Architecturally additive, reversible, default-off. Empirically: 333→689 items at same budget; Opus drills down correctly; no confabulation observed.
100yenadmin
referenced
this pull request
in 100yenadmin/lossless-claw
May 7, 2026
…pattern Wire #2 of 3 for the agent context-management architecture (Wave-14). # What this lands Tools that could push context over budget now run a pre-call gate BEFORE doing work: estimate the result size; if (currentTokens + estimated) / tokenBudget > REFUSAL_THRESHOLD (0.92), return a structured `{ok: false, needsCompact: true, ...}` payload instead. Agent reads, calls lcm_compact, retries — the natural negotiation pattern. Without this layer, an agent at 78% context calling `lcm_describe expandMessages=true expandMessagesLimit=20` (estimated 13K tokens) lands at ~84% AT BEST — but worst-case messages can saturate the result-cap and push past 100%, causing context_length_exceeded errors mid-turn. # Tools wired PRE-CHECK ENFORCED (7): - lcm_grep (5 modes) - lcm_semantic_recall - lcm_describe (HIGHEST priority — biggest blow-up risk per Agent C) - lcm_expand_query - lcm_get_entity - lcm_search_entities - lcm_compact (small footprint; included for uniform agent UX) NOT WIRED (intentionally — self-protecting or out-of-scope): - lcm_synthesize_around: internal 50K source cap; prompt-bounded output ~2-3K. Per Agent B, can't blow context. - lcm_expand: sub-agent-only, has its own grant ledger # Files NEW: - `src/plugin/needs-compact-gate.ts` (~190 LOC) — REFUSAL_THRESHOLD constant (0.92 — calibrated against real DB), per-tool `estimateResultTokens(toolName, params)` formulas, the `evaluateNeedsCompactGate` core logic, and a `runWithTokenGate` wrapper helper that tools use to compose pre-check + post-call cache accumulation. - `test/v41-needs-compact-gate.test.ts` (~120 LOC) — 19 tests covering per-tool estimator math, refusal logic, suggested-action narrowing, bypass-on-missing-telemetry, and threshold boundary cases. EDITED (each ~5-10 LOC of changes): - src/tools/lcm-grep-tool.ts — gate at top of execute, tap on returns - src/tools/lcm-describe-tool.ts — gate + tap on final return - src/tools/lcm-semantic-recall-tool.ts — runWithTokenGate wrapper - src/tools/lcm-expand-query-tool.ts — wrapper - src/tools/lcm-get-entity-tool.ts — wrapper - src/tools/lcm-search-entities-tool.ts — wrapper - src/plugin/index.ts — pass `getRuntimeContext` to all 7 tool factories - src/plugin/token-state.ts — add `tapResultForTokenAccounting` helper # How the agent experience works ``` Agent: lcm_describe id=sum_xxx expandMessages=true expandMessagesLimit=30 Tool gate: estimatedResultTokens = 10000 (capped) currentRatio = 0.78 projectedRatio = (156000 + 10000) / 200000 = 0.83 → BELOW 0.92 → run normally Agent: lcm_describe id=sum_yyy expandMessages=true expandMessagesLimit=30 Tool gate: currentRatio = 0.89 // accumulated from previous result projectedRatio = 0.94 → OVER 0.92 → REFUSE Tool returns: { ok: false, needsCompact: true, reason: "context-overflow-prevention", currentRatio: 0.89, estimatedResultTokens: 10000, projectedRatio: 0.94, note: "Serving this call would push context to 94% of budget...", suggested_actions: [ "lcm_compact then retry with same params", "retry with expandMessagesLimit=15" ] } Agent: reads, calls lcm_compact, retries. Now at 70% — call succeeds. ``` # Threshold (0.92) calibration Wave-14 Agent A sampled Eva's live DB (3,904 leaves, 414 condensed, 315K messages). Per-tool result hard cap is 10K tokens (MAX_RESULT_CHARS / 4). With 200K context: 0.95 cushion → 10K headroom = zero margin (one capped call → 100%) 0.92 cushion → 16K headroom = one capped call + agent response Lower thresholds → over-refusal on safe calls # Per-tool estimator confidence (Per Wave-14 Agent C calibration against actual format strings) - lcm_grep regex/full_text/hybrid/semantic — 90% - lcm_grep verbatim — 60% (variable per-message size) - lcm_semantic_recall — 90% - lcm_describe (no expand) — 70% - lcm_describe (expand flags) — 60% (high subtree variance) - lcm_get_entity / lcm_search_entities — 90% - lcm_expand_query — 80% Estimator capped at HARD_CAP_TOKENS (10K) regardless of natural estimate — protects against under-estimation. Tools that return less than estimated just have headroom; tools with bad estimates get their natural cap protection. # Verification - 1592/1592 tests passing (1573 baseline + 19 new gate tests) - 7/7 release-readiness preflight checks pass - 330 TS errors (under 700 baseline; PR introduced none) # What's next (Commit 3 of 3) Synchronous compaction at critical pressure (`afterTurn` deferred-mode drain runs sync at >0.85 currentRatio). System-level safety net behind the agent-driven layers.
100yenadmin
referenced
this pull request
in 100yenadmin/lossless-claw
May 7, 2026
Opus subagent analysis of v4.1 baseline (333 blocks) vs v4.2 stubs (689 blocks) at the same 258K-token budget recommended four mitigations to address moderate-risk findings: 1. Recency cue [t-NNm] on turn headers 2. Semantic stub wrapping <lcm-stub> XML tags 3. Empty-assistant collapsing 4. Resolution markers at completion boundaries Applied first-principles-architectural-decision skill (research, run-the-system, where-it-lives diagrams, adversarial debate) before building any of them. Verdict: REJECT ALL FOUR. Each fails on a specific load-bearing constraint: - #1 fails on prefix-cache stability (clock-based tag changes the rendered string on every assemble, invalidating the cache that v4.2's whole value proposition relies on). User timestamps already exist inline. - #2 fails on "novelty has cost, format already works" — the existing [LCM Tool Output: file_xxx | …] bracket form is correctly parsed by Opus in live tests (drilldown via lcm_describe works on Option F format). Replacing a working v4.1-trained format with a novel XML form is unjustified churn. - #3 fails on Anthropic/OpenAI wire contract. The "empty assistants" contain tool_use blocks (required to live in assistant turns; paired with tool_results by toolCallId). Dropping them would break pairing — providers reject orphan tool_results. - Martian-Engineering#4 fails on detection signal. No reliable way to mark "work completed" — user phrases like "go ahead" / "yes" / "keep digging" oscillate. False positives are strictly worse than no marker (license premature stubbing). Adversarial debate at ≥95% confidence target on each. AGAINST won on all four. Decision record committed for future operators who hit similar moderate-risk findings and reach for similar mitigations. Final v4.2 shipping shape: Options C + D + F at commit e309bed. Architecturally additive, reversible, default-off. Empirically: 333→689 items at same budget; Opus drills down correctly; no confabulation observed.
Collaborator
|
@jacoblyles triage pass update: I marked this priority:P3 enhancement/linked-pr/stale-check. The P0 delegated-retrieval leakage path was fixed by #768, so this older agent-scoped memory branch may now be partially superseded. Can you confirm whether there is remaining scope behavior here that #768 did not cover? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
src/plugins/agent-memory-scopeTest
npx vitest run test/agent-memory-scope.test.ts test/lcm-tools.test.ts test/lcm-expand-query-tool.test.tsNotes
allowAgentScope,maxAgentConversations) from OpenClaw