(ignore) feat(v4.2): stub-tier stratification — externalize old tool results, agent drills down via lcm_describe(file_xxx) by 100yenadmin · Pull Request #626 · Martian-Engineering/lossless-claw

100yenadmin · 2026-05-07T16:25:49Z

Stacked on top of #613 (LCM v4.1 omnibus). PR #626 has 9 commits + ~2,140 LOC of v4.2 delta over electricsheephq:feat/lcm-v4.1-omnibus head (536784c). Reviewers can either wait for #613 to merge then this rebases onto main cleanly, or review the delta directly via git diff 536784c...feat/lcm-stub-tier-stratification.

The problem this solves

When a long session pushes against the token budget at assemble time, v4.1's only lever for evictable items is "drop the whole row." Heavy tool results (12K+ tokens for a verbose Read/Bash/Grep) force the budget into a bad choice:

Keep the heavy old tool result → lose 3 older summaries
Drop the tool result → lose its assistant pairing and meaning too

Measured on a real DB (live snapshot, 2.6 GB, 315k messages), session 0cb8928b at 258k budget: chronological eviction kept 333 items.

What this PR does

Adds a per-row sidecar (messages.large_content) that stores a file_xxx id pointing to the externalized payload in large_files (existing v4.1 storage table). At assemble time, evictable tool-result rows with the sidecar populated are replaced with the v4.1 [LCM Tool Output: file_xxx | tool=… | N bytes] reference format that's been in production for months. Drilldown uses the existing lcm_describe(id="file_xxx") path.

[LCM Tool Output: file_85b322f5fda14187ab641331 | tool=exec | 34,975 bytes]

Exploration Summary:
Tool: exec | Command: bash -lc 'cd /Users/lume/.openclaw/workspace && echo "== selfheal ==" && sed -n "1,260p" scripts/evaos-support/selfheal.sh && echo "== config-guard ==" …

Use lcm_describe with the file id to inspect the full output.

The Exploration Summary line carries a one-line preview of the originating tool_input (path / command / pattern / sessionId) so an agent reading the conversation can match a user reference like "the selfheal.sh script you read earlier" to the right fileId, then drill down.

Architecture

Layer	File	What
Schema	`src/db/migration.ts`	`messages.large_content TEXT` (idempotent ALTER); `PRAGMA busy_timeout=30000` before `BEGIN EXCLUSIVE` to coexist with running gateway
Store	`src/store/conversation-store.ts`	Project + map row → `MessageRecord.largeContent`
Assembler	`src/assembler.ts`	New `applyStubSubstitution()` runs before budget pass on evictable items only — fresh tail (last ~64 turns / 24K tokens) is never stubbed; assistant turns are never stubbed; only tool-result `content` is replaced
Engine	`src/engine.ts`	Forwards `config.stubLargeToolPayloads` (default `false`)
Tool description	`src/tools/lcm-describe-tool.ts`	Strengthened to mention `[LCM Tool Output: file_xxx]` references so the model's tool-selection heuristics fire
Migration script	`scripts/lcm-blob-migrate.mjs`	Idempotent (only touches `large_content IS NULL` rows); 200-row chunked transactions; `PRAGMA busy_timeout`; `wal_checkpoint(TRUNCATE)` after large UPDATE; populates `large_files.exploration_summary` with the `tool_input`-derived disambiguator

Empirical bench (live-DB snapshot)

Session 0cb8928b, 6,804 messages, 258k token budget:

Variant	Items	Tokens	Stubbed	Tokens saved
v4.1 baseline	333	252,288	0	0
v4.2 (full migration)	689	257,849	86	412,373

Tool-result count is identical in both (101 in each). v4.2 doesn't displace tool outputs — it stubs heavy ones and reuses the freed budget to fit more older history. Same token budget, ~2× wall-clock context (~74 min → ~130 min on this conversation).

Drilldown validation (Opus subagents)

Spawned Claude Opus 4.1 subagents and gave them the assembled prompt as a transcript file. Three scenarios:

Test	Result
Conversational summary ("what did we work on?")	Substantive coherent answer drawn from assistant turns. Zero tool calls needed. No confabulation.
Specific elided-content probe (before tool_input disambiguator)	Opus correctly refused to guess but couldn't match user reference to fileId — assistant `tool_use` blocks were stripped by the assembler
Same probe after Option F	Found correct fileId, wrote `lcm_describe(id="file_xxx")` call, refused to fabricate ✅

Quoting Opus on the Option F result:

"The Option F Exploration Summary: Tool: … | Command: … line was extremely helpful. The command string contained sed -n \"1,260p\" scripts/evaos-support/selfheal.sh literally — that's an unambiguous keyword match for 'what file's contents are in this elided blob.' Without that line I'd have had to grep the assistant text for nearby fileId references and guess, or call lcm_describe on multiple candidates speculatively. With it, the mapping was one grep away."

Critically, Opus does not confabulate when content is unavailable.

Mitigation evaluation (post-skill review)

Four mitigations were proposed by Opus to address moderate-risk items found in the comparative analysis. We applied the first-principles-architectural-decision skill (research / run-the-system / where-it-lives diagrams / adversarial debate at ≥95% confidence) before deciding to build any of them.

Verdict: REJECT ALL FOUR. Decision record committed at audit/v42-bench/DECISION-mitigations.md. One-line summary:

Mitigation	Decision	Why
Recency cue `[t-NNm]`	REJECT	Cache thrashing (clock-based string changes prefix every assemble). User timestamps already exist inline.
`<lcm-stub>` XML wrapping	REJECT	Existing `[LCM Tool Output:]` format works in live test. Novel format = unproven regression risk.
Empty-assistant collapsing	REJECT	"Empty" assistant turns contain `tool_use` blocks (required by Anthropic/OpenAI wire contract). Collapsing would break `tool_use ↔ tool_result` pairing.
Resolution markers	REJECT	No reliable signal for "work completed". False positives strictly worse than no marker.

What's NOT stubbed

Fresh tail: last ~64 turns or 24,000 tokens — agent's working memory preserved verbatim
Assistant turns: never stubbed; the narrative of what was done is always intact
User turns: never stubbed
Tool messages without large_content: never stubbed (legacy / unmigrated rows untouched)
Tool messages whose runtime role degraded to assistant: never stubbed — phantom drilldown risk avoided

Default off

Behind config.stubLargeToolPayloads (default false). With the flag off the new code paths don't run and assembly is byte-identical to v4.1.

Tests

1592/1592 pass (added 5 new tests in test/v42-stub-tier.test.ts):

emits stubs only for evictable externalized tool messages (boundary)
preserves tool_use ↔ tool_result pairing when stubbing
never stubs tool messages without externalized files (legacy rows)
preserves multi-block tool_result content shape (image + text)
drilldown round-trip: agent can recover the full payload via the file_xxx referenced in the stub

Plus the harness scripts (used by the Opus subagent test above):

scripts/v42-assemble-bench.mjs — token/item bench
scripts/v42-drilldown-harness.mjs — real-LLM drilldown test (OpenRouter, multi-mode prompts)
scripts/v42-dump-prompt.mjs — transcript dumper for sub-agent A/B testing
scripts/lcm-blob-migrate.mjs — idempotent, reversible blob migration

How to download and test

# Pull this PR's branch
gh pr checkout 626 --repo Martian-Engineering/lossless-claw

# Verify you're at the v4.2 head
git log --oneline -1
# expected: 85f922d docs(v4.2): decision record …

# Install deps (re-uses existing setup)
npm ci

# Run the full test suite — should be 1592/1592 pass
VOYAGE_API_KEY=$(cat ~/.openclaw/credentials/voyage-api-key) \
LCM_TEST_VEC0_PATH=$HOME/.openclaw/extensions/node_modules/sqlite-vec-darwin-arm64/vec0.dylib \
  npx vitest run --reporter=dot

# Build the dist tarball for deployment
npm run build && npm pack
# → produces martian-engineering-lossless-claw-<version>.tgz

# Snapshot your live LCM DB before testing migration
mkdir -p /tmp/v42-test
cp ~/.openclaw/lcm.db /tmp/v42-test/lcm-test.db

# Schema migration (idempotent, additive ALTER — runs automatically when
# the new plugin is loaded; can also be triggered by running the bench)
VOYAGE_API_KEY=... LCM_TEST_VEC0_PATH=... \
  npx tsx scripts/v42-assemble-bench.mjs --db /tmp/v42-test/lcm-test.db --variant baseline

# Dry-run the blob migration to see what would be touched
node scripts/lcm-blob-migrate.mjs --db /tmp/v42-test/lcm-test.db --threshold-bytes 8000 --dry-run

# Apply for real (stores file_xxx id in large_content; writes payload files
# to <storage-dir>; populates large_files row with tool_input disambiguator)
mkdir -p /tmp/v42-test/files
node scripts/lcm-blob-migrate.mjs --db /tmp/v42-test/lcm-test.db --threshold-bytes 8000 \
    --storage-dir /tmp/v42-test/files

# Bench the migrated DB with stubs ON
VOYAGE_API_KEY=... LCM_TEST_VEC0_PATH=... \
  npx tsx scripts/v42-assemble-bench.mjs \
    --db /tmp/v42-test/lcm-test.db --variant v42-stubs \
    --session-id <your-session-id>
# → should report estimatedTokens, contextItems, and stubStats (stubbedCount > 0)

To deploy in a real session and observe live drilldown behavior:

# 1. Stop gateway
launchctl unload -w ~/Library/LaunchAgents/ai.openclaw.gateway.plist

# 2. Replace plugin
EXT=~/.openclaw/extensions/lossless-claw
rm -rf "$EXT" && mkdir -p "$EXT"
tar -xzf martian-engineering-lossless-claw-*.tgz --strip-components=1 -C "$EXT"

# 3. Migrate the live DB (note: use --dry-run first to size the eligibility set)
node "$EXT/scripts/lcm-blob-migrate.mjs" --db ~/.openclaw/lcm.db --threshold-bytes 8000

# 4. Enable the flag
python3 -c "
import json, pathlib
p = pathlib.Path.home() / '.openclaw/openclaw.json'
cfg = json.loads(p.read_text())
plugins = cfg.setdefault('plugins', {})
lc = plugins.setdefault('lossless-claw', {})
lc['stubLargeToolPayloads'] = True
p.write_text(json.dumps(cfg, indent=2))
"

# 5. Restart gateway
launchctl load -w ~/Library/LaunchAgents/ai.openclaw.gateway.plist

# 6. Watch for stubs in assemble logs
tail -f ~/.openclaw/logs/gateway.log | grep -E 'stubStats|lcm_describe|<lcm-stub|LCM Tool Output'

Reversibility

Step	How to reverse
Schema ALTER	Idempotent. Column can stay; readers ignore it.
Blob migration	`UPDATE messages SET large_content = NULL`
Storage files	`rm -rf <storage-dir>`
Flag enable	`stubLargeToolPayloads = false` + restart

Test plan

Live runtime drilldown (post-merge, in a controlled session): enable flag, run a tool-heavy session, observe drilldown invocation rate. Pass: ≥70% (target 90%) on conversational queries that reference specific elided tool calls.
JOIN cost on a large DB: confirm getLargeFile() lookups in resolveMessageItem don't regress assembler latency on a 2.6GB DB
Reviewer audit of applyStubSubstitution for additional edge cases beyond what's already covered in tests
Reversibility drill: enable → migrate → disable → confirm assembly returns to baseline
Idle DB sizing: run lcm-blob-migrate.mjs --dry-run against an operator's DB to size the eligibility set

Files

Layer	Source LOC (excl. comments)
Schema	~10
Store	~10
Assembler	~75
Engine	5
Tools	~10
Total source	~110
Tests	~380
Migration script	~190
Harnesses	~600

Total v4.2 delta vs #613: 13 files, +2,139 / -8 LOC.

Commits (vs #613 head `536784c`)

847e232 assemble-bench scaffolding
6e1b857 Variant B (per-row large_content sidecar — initial design)
99611f2 Option C — route stubs through file_xxx (fixes P0 from adversarial review: messageId path was unresolvable)
b02e659 real-LLM drilldown harness
89be6c9 Option D — strengthen lcm_describe description with [LCM Tool Output:] mention
ce5561a harness: add --conversational / --realistic / --no-stubs modes
37ca40d Option F — tool_input disambiguator in exploration_summary
85f922d decision record (REJECT all four post-Opus mitigations)

First commit of the v4.1 omnibus implementation. Smallest possible slice: introduces the cross-process concurrency model module and the `lcm_worker_lock` table that enables a sidecar worker process for cold maintenance work (condensation, extraction, embedding backfill, theme consolidation, eval, profile rebuild). Resolves v4.1.1 amendment A9 (`last_heartbeat_at` column required by §0.5 fallback rule: gateway can take over only when BOTH `expires_at < now` AND `last_heartbeat_at < now - 300s`). Changes: - src/concurrency/model.ts (NEW) — single source of truth for §0 invariants, busy_timeout constants, worker job-kind catalogue, and defensive assertion helpers (assertForeignKeysEnabled, assertBusyTimeoutForRole). Documents the no-LLM-in-write-tx invariant and the worker_threads heartbeat requirement (v4.1.1 A9). - src/db/migration.ts (+25 lines) — new `ensureLcmWorkerLockTable` migration step. Idempotent CREATE TABLE IF NOT EXISTS, runs after FTS setup, before the BEGIN EXCLUSIVE COMMIT. - test/concurrency-model.test.ts (NEW, 10 tests) — verifies invariant ordering (worker timeout < gateway, TTL ≥ 3× heartbeat, fallback soak > TTL), job-kind catalogue, and assertion helpers. - test/lcm-worker-lock.test.ts (NEW, 4 tests) — verifies migration creates the table with the right columns (including A9's last_heartbeat_at), is idempotent, supports basic acquire/heartbeat, and supports stale-lock GC. Verification: - npm run build: passes - npm test --run: 48 files / 872 tests passing (up from 858 baseline, +14 new tests, zero regressions) - Live DB ground-truth check: ran the new DDL against a copy of /Users/lume/.openclaw/lcm.db (2.5GB, 762 conversations, 3771 leaf summaries). Migration succeeds; existing data untouched; acquire pattern works; PK conflict throws as expected. Notes: - Code-as-ground-truth pivot: per the v4.1.1 plan, each commit cites the amendment(s) it resolves and is verified against live data. - v4.1.1 A6 finding (PRAGMA foreign_keys = OFF on Eva's CLI test) partially superseded: src/db/connection.ts:configureConnection() already sets it ON for every connection that goes through the standard path. The new assertForeignKeysEnabled() is a defensive guardrail for future code paths that bypass configureConnection.

…_feature_flags (A.02) Resolves v4.1.1 amendments A2 (suppress_reason + superseded_by columns) and A8 (feature-flag storage). Adds the v3.1 columns the v4.1 spec depends on (session_key, suppressed_at, entity_index, contains_suppressed_leaves) since v3.1 never shipped to upstream. Changes: - src/db/migration.ts (+104 LOC): - ensureSummaryV41Columns(db) — adds 7 columns to summaries via the existing PRAGMA table_info / ADD COLUMN pattern (matches ensureSummaryDepthColumn / ensureSummaryMetadataColumns / etc.): session_key TEXT NOT NULL DEFAULT '' (v3.1 A1) suppressed_at TEXT (v3.1 A3) entity_index TEXT (v3.1 §7.2) contains_suppressed_leaves INTEGER NOT NULL DEFAULT 0 (v3.1 A3) suppress_reason TEXT (v4.1.1 A2) superseded_by TEXT REFERENCES summaries (v4.1.1 A2/A4) ON DELETE SET NULL leaf_summarizer_cap_was INTEGER (v4.1) - ensureMessageSuppressedAtColumn(db) — adds messages.suppressed_at (v3.1 A3 cascade target for lcm_quote / lcm_factcheck filtering) - ensureLcmFeatureFlagsTable(db) — clean new table `lcm_feature_flags(flag PK, value NOT NULL, updated_at NOT NULL)` - lcm_worker_lock TEXT PK explicitly NOT NULL (SQLite legacy quirk allows NULL in TEXT PK columns without it). - test/v41-summaries-columns.test.ts (NEW, 12 tests): - Per-column verifications (NOT NULL, default value, FK target/action) - lcm_feature_flags schema + basic set/read pattern - Legacy `lcm_migration_flags` coexistence verified Verification: - npm run build: passes - npm test --run: 49 files / 884 tests passing (+12 from A.01's 872, 0 regressions) - Live DB ground-truth check on copy of /Users/lume/.openclaw/lcm.db: summaries 14 → 21 columns; 7 v4.1 cols added. messages gains suppressed_at; 3774 leaves preserved. lcm_worker_lock + lcm_feature_flags created. Eva's legacy lcm_rollups* + lcm_migration_flags untouched. 4187 summaries now have session_key='' (A.08 backfill target). Code-as-ground-truth findings (revising v4.1.1 spec): 1. v4.1.1 A8 originally said "extend lcm_migration_flags with value column." That table doesn't exist in upstream src/ — it only exists on Eva's live DB from old fork-side code. Replaced with a clean new `lcm_feature_flags` table. Eva's legacy table stays alongside, untouched. 2. v4.1.1 A6 (PRAGMA foreign_keys = OFF) is partly misleading: the codebase's src/db/connection.ts:configureConnection() already sets foreign_keys = ON for every connection through the standard path. Eva's earlier sqlite3 CLI test was using a different connection, not the production path. The new src/concurrency/model.ts already provides assertForeignKeysEnabled() as a defensive guardrail. 3. SQLite TEXT PRIMARY KEY columns do NOT auto-enforce NOT NULL (legacy behavior). Both new tables (lcm_worker_lock, lcm_feature_flags) now have explicit NOT NULL on their PK column. Caught by tests. 4. SQLite ADD COLUMN with REFERENCES requires NULL default — verified `superseded_by TEXT REFERENCES summaries(summary_id) ON DELETE SET NULL` works as ALTER TABLE ADD COLUMN (no NOT NULL allowed). Documented in ensureSummaryV41Columns docstring.

… + audit (A.03) Adds the four "support tables" the worker process and operator surface need before the heavy schema (synthesis cache, embeddings, entities, themes) lands. Each is a clean idempotent CREATE TABLE IF NOT EXISTS. Resolves v4.1.1: - A3 — `lcm_extraction_queue`: gateway atomically inserts a queue row with every leaf write; worker drains it for entity coreference and procedure-recheck. CHECK constraint on `kind` ('entity' | 'procedure-recheck'). Indexes on pending (queued_at WHERE picked_at IS NULL) and dead-letter (attempts >= 5). - B2 (partial) — `lcm_purge_rebuild_queue`: persistent rebuild queue for `lcm_purge --immediate`. T1 fires suppression cascade + enqueues; worker drains using A4 forwarder pattern. Indexes on pending + purge_session_id. - B3 (partial) — `lcm_voyage_rate_state`: cross-process rate-limit budget for Voyage embed + rerank. SQLite serializes BEGIN IMMEDIATE naturally so gateway + worker coordinate via this shared row. CHECK constraint on bucket ('embed' | 'rerank'). Seeded with both rows idempotently (`INSERT OR IGNORE`). Spec note: HTTP call MUST happen AFTER the COMMIT — wrapping HTTP in BEGIN IMMEDIATE would serialize every gateway query embed and add 200-2000ms latency. - §C item — `lcm_session_key_audit`: reversibility log for §2.1 step 1 re-key of 5 legacy convs. Allows operator `/lcm undo-session-key-rekey <conv_id>` if the spike's identification was wrong for any of those convs. Changes: - src/db/migration.ts (+90 LOC): four `runMigrationStep` blocks added inline after the v3.1+v4.1 column work from A.02 - test/v41-support-tables.test.ts (NEW, 9 tests): per-table schema verification (columns, FKs, indexes, CHECK constraints), CHECK rejection paths, idempotent re-run verification, brief-tx update pattern verification for rate state Verification: - npm test --run: 50 files / 893 tests passing (+9 from A.02's 884, zero regressions) - Live DB ground-truth check on copy of /Users/lume/.openclaw/lcm.db: PRE lcm_ tables: 5 (legacy lcm_migration_flags + lcm_migration_state + 3 lcm_rollups* from Eva's fork) POST lcm_ tables: 9 (5 legacy preserved + 4 new) voyage rate state seeded with embed + rerank rows 3774 leaves preserved, 762 conversations preserved Eva's lcm_rollups* untouched (out-of-scope for v4.1; v4.1 replaces its functionality via lcm_synthesis_cache landing in A.04) Notes: - All four FKs use the production summaries / conversations tables; CASCADE on DELETE is the right semantics (queue/audit rows are derived; if their parent is genuinely deleted, they should follow). - Per v4.1.1 A6 (now confirmed code-side): connection.ts already enforces foreign_keys = ON, so these CASCADEs work in production.

… cache_leaf_refs + synthesis_audit (A.04) Adds the four-table synthesis layer per v4.1 §3 + §1.3 + v4.1.1 B1/B4. Tables created in dependency order so FKs work on first run: prompt_registry → synthesis_cache (FK on prompt_id) → cache_leaf_refs (FK on cache_id) → synthesis_audit (FK on prompt_id + either summary_id or cache_id). Resolves v4.1.1: - B1 — `lcm_synthesis_audit` schema: pass_output is NULLable (insert with NULL before LLM call, UPDATE on return). Adds `status` column ('started' | 'completed' | 'failed') for orphan-row tracking. Started- GC index supports the 1-hour orphan cleanup query. - B4 — UNIQUE lookup index on `lcm_synthesis_cache` enables cross- process single-flight via INSERT OR IGNORE pattern (loser of race reads back in-flight row, polls for status='ready'). - §3 + §1.3 — prompt registry with versioning per (memory_type, tier_label, pass_kind, version) tuple. Append-only; bundle_version groups prompt sets for synchronized voice-consistency rebuild. - §3 — synthesis cache with status='building' single-flight, prompt_id FK enables prompt-selective invalidation (NEVER touches durable summaries.content rows — closes v3 design principle 4 violation that v4 had introduced). - v3.1 A3 extension — cache_leaf_refs inverse index for proactive purge on lcm_suppress (cascades both directions: ref deleted when either cache_id OR leaf_summary_id parent is deleted). Changes: - src/db/migration.ts (+150 LOC): four runMigrationStep blocks, all idempotent, all in dependency order. - test/v41-synthesis-tables.test.ts (NEW, 14 tests): - prompt_registry: CHECK constraint enforcement (memory_type, pass_kind), UNIQUE constraint on (memory_type, tier_label, pass_kind, version) - synthesis_cache: status + tier_label CHECK enforcement, INSERT OR IGNORE single-flight pattern (ON CONFLICT DO NOTHING) - cache_leaf_refs: bidirectional CASCADE behavior verified - synthesis_audit: pass_output NULLable, started→completed pattern, CHECK requiring at least one target column, started-GC index exists Verification: - npm test --run: 51 files / 907 tests passing (+14 from A.03's 893, zero regressions) - Live DB ground-truth check on copy of /Users/lume/.openclaw/lcm.db: PRE: 5 lcm_ tables (legacy) POST A.01-A.04 cumulative: 15 lcm_ tables = 5 legacy preserved + 10 new (worker_lock, feature_flags, extraction_queue, purge_rebuild_queue, voyage_rate_state, session_key_audit, prompt_registry, synthesis_cache, cache_leaf_refs, synthesis_audit) 3774 leaves preserved, 762 conversations preserved. PRAGMA foreign_keys=1. Notes: - DB copies for end-to-end verification moved to /Volumes/LEXAR/lcm-tmp (the live DB is 2.5GB; /tmp filled up after a few iterations). - B4 UNIQUE index uses COALESCE(grep_filter, '') so SQLite can index the expression deterministically (NULL-grep_filter rows would otherwise not be uniquely-indexed since NULL ≠ NULL in SQL semantics).

… (A.05) Per v4.1 §11 + v4.1.1 (revising v4 design): - N≥100 stratified queries (50% fts-easy, 25% fts-medium, 25% paraphrastic). - 2× empirical SD threshold (calibrate by 5x repeated baseline runs). - Ensemble judge (3 different model families). - Mixed absolute+pairwise scoring per dimension. - Drift index for cumulative regression. - Measures BOTH retrieval_recall AND synthesis_quality (separate metrics per v4.1.1 — closes the v4 gap where eval collapsed them). Tables (dependency order): - lcm_eval_query_set: query set registry (e.g. 'eva-baseline-v2') - lcm_eval_query: per-query rows with stratum CHECK constraint, optional reference_summary for gold-standard comparison, must_not_regress flag for critical Eva queries - lcm_eval_run: per-run rows with separate retrieval_recall_score AND synthesis_quality_score, ensemble judge_models JSON, noise_floor_sd for drift calibration, trigger CHECK constraint - lcm_eval_drift: cumulative-delta drift index per query_set All cascade via FK on query_set_id deletion. Verified: - 52 files / 915 tests passing (+8 from A.04, zero regressions) - Live DB copy: 15 → 19 lcm_ tables. 3774 leaves preserved.

…ions + procedures + intentions (A.06) Per v4.1 §7 + v4.1.1 B5/B6/B7/B8/B11. Five tables for the extraction layer (entity coreference + procedures + intentions tracking). Tables (all idempotent, dependency-ordered): - lcm_entity_type_registry: freeform entity_type catalogue (Eva domain has session_key, config_flag, R-XXX agent IDs, error_code, etc. — no closed CHECK enum, per v4.1.1 §C). - lcm_entities: simplified schema (no separate aliases table per v4.1.1 B5; alternate surface forms denormalized into JSON column). UNIQUE index (session_key, canonical_text COLLATE NOCASE) enables case-insensitive cross-process single-flight (B4 pattern). FK to summaries(first_seen_in_summary_id) ON DELETE SET NULL. - lcm_entity_mentions: tracks each mention site. CASCADE on both entity_id and summary_id deletion (basis for v4.1.1 §C suppression cascade — when leaf gets suppressed, mentions cascade-delete). - lcm_procedures: status lifecycle ('draft'|'active'|'stale'| 'archived'|'deprecated'); extraction_source distinguishes auto (clustering pipeline) from 'manual' (lcm_remember_procedure tool, v4.1.1 B8 fix for one-shot procedures). - lcm_intentions: 3 statuses ('pending'|'fulfilled'|'cancelled' per B11); resolution_text + resolved_at columns for capture context. source_leaf_id is NULL-allowed since ON DELETE SET NULL requires it. Verified: - 53 files / 929 tests passing (+14 from A.05, zero regressions) - All 5 tables created, FK + CHECK constraints enforced.

….07) Per v4.1 §1 + v4.1.1 A5/A7. The MANAGED tables only — vec0 virtual table itself defers to Group B (requires sqlite-vec extension load, best-effort per A7's two-transaction pattern). - lcm_embedding_profile: model registry (model_name PK, dim, active flag, archive_after for graceful retirement). Group B startup seeds voyage-4-large after successful sqlite-vec load. - lcm_embedding_meta: sidecar with composite PK (embedded_id, embedded_kind, embedding_model) enabling parallel rows during model-bump cutover. CHECK on embedded_kind ('summary' | 'entity' | 'theme'). FK to lcm_embedding_profile prevents orphan model refs. No FK on embedded_id — polymorphic per v4.1.1 §C item; orphan cleanup via idle pass in Group B. Verified: - 54 files / 934 tests passing (+5 from A.06, zero regressions)

…4.1 read patterns (A.08) Per v4.1 — adds 5 partial/composite indexes that the new retrieval + suppression + idle-rebuild paths need. All CREATE INDEX IF NOT EXISTS, all idempotent, all conditional on the v4.1 columns added by A.02. Indexes: - summaries_session_key_kind_latest_idx: cross-conv assemble + retrieval scope filter. Partial WHERE session_key != '' (skips pre-A.09 backfill rows so the index stays compact during the cleanup window). - summaries_suppressed_idx: WHERE suppressed_at IS NOT NULL — small footprint partial index for the suppression filter on every retrieval. - summaries_contains_suppressed_idx: WHERE contains_suppressed_leaves = 1 AND superseded_by IS NULL — §8.1 idle-rebuild candidate scan. - messages_suppressed_idx: WHERE suppressed_at IS NOT NULL — for lcm_quote / lcm_factcheck filtering. - conversations_session_key_v41_idx: WHERE session_key IS NOT NULL — boosts the cross-conv JOIN path that legacy:conv_<id> session_keys use (existing conversations_session_key_active_created_idx is on the active flag too, which legacy convs don't satisfy). Verified: - 55 files / 942 tests passing (+7 from A.07, zero regressions)

…lowup) The optimizer picks full table scan for tiny test datasets (3 rows), not the new index — that's the right query plan for that data size, just not what the test asserted. Index PRESENCE verification (the other 6 tests in this file) covers what unit tests can; index USE in production data shape is verified by A.09's live-DB run-script.

…JOIN backfill (A.09) Per v4.1 §2.1 (universal cleanup; per-user re-keying like Eva's 5-legacy-convs → agent:main:main is OPERATOR-DRIVEN via Group F's `/lcm reconcile-session-keys`, NOT hardcoded into upstream migration). Three idempotent migration steps: 1. backfillConversationSessionKeys: every NULL conversations.session_key gets backfilled to 'legacy:conv_<id>'. Each re-key writes a row to lcm_session_key_audit (deterministic audit_id derived from conv_id ensures idempotent re-runs don't duplicate audit rows). Closes v4.1.1 A5 (NULL collapse to empty bucket would destroy cross-conv identity for legacy data). 2. backfillSummarySessionKeys: every summary still at the A.02 default session_key='' gets backfilled from the parent conversation via JOIN. After step 1 ran, conversations.session_key is non-NULL for all rows. Idempotent: condition is WHERE session_key = '' so already- set rows are preserved. 3. backfillForkRollupsSessionKeys: forward-compat for Eva's fork-side lcm_rollups table (created by PR Martian-Engineering#516, not in upstream src). Only touches the table if it exists AND has session_key column. No-op on fresh upstream installs. Verified on copy of Eva's live DB (/Volumes/LEXAR/lcm-tmp/lcm-test.db): PRE: 762 convs, 522 NULL session_keys, 4 agent:main:main, 0 legacy: POST: 762 convs, 0 NULL, 4 agent:main:main preserved, 522 legacy:conv_* 4187 summary session_key backfills (all summaries now keyed) 522 audit rows recorded 5 legacy convs identified as having leaves (target for Eva's future `/lcm reconcile-session-keys` to merge into agent:main:main) - 56 files / 947 tests passing (+6 from A.08, zero regressions)

… (A.10) Per v4.1 §2.2 — fixes the leaf-summarizer cap bug. The empirical-spike-agent found 543 leaves on Eva's live DB pegged at exactly 2,415 tokens (the LLM hitting the old 2400 default and producing artificially-truncated summaries). This commit raises the default in two places that share the constant: - src/summarize.ts:50 DEFAULT_LEAF_TARGET_TOKENS: 2400 → 4000 - src/db/config.ts:464 fallback default for pc.leafTargetTokens: 2400 → 4000 Comment added to both locations citing the empirical finding so future readers see the rationale. Voyage embedding (Group B) supports 32K input context, so 4000-token leaves are well within budget. Average leaf on Eva's corpus is 1,167 tokens (most leaves don't approach the cap); the change only affects leaves where the source content is dense enough to need it. Existing 543 capped leaves on Eva's DB stay as-is — regenerating them from source messages is expensive (LLM calls) and is operator-driven, not a migration step. Leaves are immutable per v3 design principle 4. Tests: - test/v41-leaf-cap.test.ts (NEW, 3 tests): verifies new constant + rationale comment present - test/config.test.ts: updated existing assertion 2400 → 4000 950/950 tests passing.

Raw fetch wrapper for Voyage AI. We do NOT use the voyageai npm SDK: v0.2.1 has an ESM resolution bug confirmed during Phase A spike (see docs/projects/lcm-rollup-overhaul/voyage-spike-results.md). Two entry points: embedTexts() and rerankCandidates(). Both: - Send `truncation: false` so over-cap docs are surfaced as 400 errors rather than silently clipped (lossless invariant — a truncated embedding produces a vector that doesn't reflect the source, with no signal in the vector itself that anything was dropped). - Throw typed VoyageError on every failure mode (auth/bad_request/ rate_limit/server_error/network/unexpected) so callers can react appropriately. Backfill cron will use `kind` to decide whether to park, requeue, or surface to operator. - Retry on 5xx + network errors with exponential backoff (capped 30s). NOT on 4xx (caller bug — retrying just spends quota). - Honor Retry-After header on 429 (seconds OR HTTP-date). - Support mock fetch injection for tests — no module-level state, no globals, no live API calls in CI. Token budget constants exported for callers: - MAX_TOKENS_PER_EMBED_BATCH = 80K (Voyage caps at 120K, tokenizer counts ~9.5% higher than our token_count, so 80K leaves margin). - MAX_TOKENS_PER_EMBED_DOC = 30K (voyage-4-large per-doc cap is 32K). - MAX_TOKENS_PER_RERANK_CALL = 600K (rerank-2.5 per-call total). Privacy: error messages strip Voyage-echoed input from 400 responses (some Voyage 400s include the input verbatim — could leak PII to logs that aren't supposed to see it). Raw responseBody preserved on the VoyageError for callers that need it. Coverage: 22 tests, all mock fetch: - embed happy path (input_type, ordering, empty input, truncation flag) - rerank happy path (top_k, sorting, id join) - all 6 error kinds + retry behavior - VOYAGE_API_KEY env var resolution Resolves: foundation for v4.1 §13 (embedding generation + reranking). Next (B.02): per-model vec0 table creation.

…(B.02) Centralizes all sqlite-vec interaction in src/embeddings/store.ts. Callers never touch vec0 SQL directly. Reasons documented in module header, but short version: 1. sqlite-vec is best-effort. tryLoadSqliteVec() searches candidate paths (env, plugin node_modules, ~/.openclaw/extensions) and returns boolean. If false, the rest of LCM still works (FTS-only retrieval). Aligned with v4.1.1 A7 graceful-degrade amendment. 2. vec0 has class-of-column quirks that bite: INTEGER metadata cols reject JS number literals (need BigInt at the binding site), and auxiliary cols throw "illegal WHERE constraint" if filtered inside MATCH queries. Schema choice: embedding float[<dim>] -- the vector +embedded_id text -- AUX (never WHERE-filtered) embedded_kind text -- METADATA (filterable in MATCH) suppressed integer -- METADATA (filterable in MATCH) Empirically verified: WHERE on +embedded_kind crashes vec0; WHERE on plain `embedded_kind text` (metadata) works. Centralizing this here so future code can't accidentally pick wrong column class. 3. Profile dim is immutable. registerEmbeddingProfile() throws on mismatch. To switch dim, bump the model name (e.g. add a suffix) and run cutover — never silently change dim of an existing profile. API surface: - tryLoadSqliteVec(db, opts) → boolean - vec0Version(db) → "v0.1.9" | null - candidateVec0Paths() → string[] (for diagnostics) - embeddingsTableName(modelName) → "lcm_embeddings_<slug>" - embeddingsTableExists(db, modelName) → boolean - registerEmbeddingProfile(db, modelName, dim) - ensureEmbeddingsTable(db, modelName, dim) - recordEmbedding(db, {modelName, embeddedId, embeddedKind, vector, suppressed?, sourceTokenCount}) — vec0 INSERT + meta UPSERT - replaceEmbedding(...) — DELETE-then-INSERT (for re-embed) - deleteEmbedding(...) — for purge cascade - markEmbeddingSuppressed(...) — UPDATE metadata (works on metadata cols; would corrupt if used on PARTITION KEY per v4.1.1 finding) - searchSimilar(db, {modelName, queryVector, k, embeddedKinds, excludeSuppressed}) — KNN with default exclude-suppressed - isEmbedded(db, {embeddedId, embeddedKind, modelName}) → boolean Coverage: 28 tests - 15 always-on: name validation, candidate paths, graceful degrade, profile registration with dim mismatch / bad-input rejection - 13 vec0-gated: load extension, ensure table, record/replace/delete embedding, KNN with kind filter, KNN with suppression, mark suppressed flips visibility, two independent models per DB The vec0-gated suite uses LCM_TEST_VEC0_PATH env var override (or defaults to /Users/lume/.openclaw/... on dev). vitest.config.ts overrides $HOME so homedir() inside tests doesn't see the dev install — this gate accommodates that. Build: dist/index.js = 708.4kb (was 708.4kb pre-B.02 — empty plugin import boundary, store module is tree-shaken from index.ts which doesn't import it yet; gateway picks up via Group B.05 leaf-time embed wire-up). Tests: 1000 passing (was 972 before B.02; +28 new). Resolves: foundation for v4.1 §13 (vec0 storage layer). Next (B.03): AFTER DELETE TRIGGER on summaries → cascades suppression + deletion into vec0 (since FK from vec0 → summaries corrupts vec0).

…B.03) Three new SQLite triggers, each with a specific job: 1. Per-model `lcm_embed_suppress_<slug>` (in src/embeddings/store.ts): AFTER UPDATE OF suppressed_at ON summaries WHEN (NEW.suppressed_at IS NULL) != (OLD.suppressed_at IS NULL) → mirrors the NULL-vs-not transition into vec0.suppressed metadata column for the corresponding embedded_id (kind='summary'). Why a trigger: suppression can be set from any path — operator's /lcm purge, agent tool, manual SQL, future migration cleanup. A trigger guarantees the cascade by-DB rather than by-convention. Why metadata col + WHEN clause: the trigger fires only on actual transitions, not on every other UPDATE; vec0 metadata column is pre-filterable in KNN MATCH queries (auxiliary cols throw "illegal WHERE constraint" — verified empirically). 2. Per-model `lcm_embed_delete_<slug>` (in src/embeddings/store.ts): AFTER DELETE ON summaries → DELETE matching vec0 row. Why a trigger and not FK CASCADE: vec0 corrupts under FK (v4.1.1 finding from upstream review). Trigger is the only safe path to keep vec0 + summaries in sync on hard-delete. 3. Shared `lcm_embedding_meta_cleanup_summary` (in src/db/migration.ts): AFTER DELETE ON summaries → DELETE matching lcm_embedding_meta row WHERE kind='summary'. Why this is in migration not store: lcm_embedding_meta exists once regardless of how many vec0 model tables exist (it's a cross-model sidecar). The kind='summary' filter prevents accidental cleanup of polymorphic entity/theme rows. Entity/theme cleanup triggers will land in Groups E/G when those embeddings ship. Per-model triggers are created idempotently when ensureEmbeddingsTable is called for a model. dropEmbeddingsTriggers() is exported for the model-archival cutover path (Group F operator surface). Coverage: 9 new tests (3 always-on, 6 vec0-gated): - meta-table cleanup trigger only deletes kind='summary' (entity row untouched) - meta cleanup trigger is idempotent across re-migration - suppression cascade NULL → not-NULL hides row from KNN - un-suppression cascade not-NULL → NULL restores visibility - WHEN clause skips no-op transitions (NULL → NULL, or content updates) - delete cascade removes vec0 row + meta row - two-model setup: cleanup hits both vec0 tables - dropEmbeddingsTriggers stops cascade firing - re-creating triggers is idempotent Live-DB verification: copied Eva's lcm.db (4187 summaries, 762 conversations) to /Volumes/LEXAR; migration completes in 3.9s; meta cleanup trigger created cleanly. Tests: 1009 passing (was 1000 before B.03; +9 new). Resolves: v4.1 §10 suppression cascade for vec0 retrieval surfaces. Next (B.fix): fold Group A adversarial-pass fixes (Gap 2 NULL UNIQUE on lcm_prompt_registry; Gap 7 wire concurrency assertions; Gap 9 add live-DB regression test).

Resolves Gaps 2, 7, 9 from the Group A adversarial code review: Gap 2 (MED) — lcm_prompt_registry NULL tier_label deduplication. SQLite treats multiple NULL values as distinct in UNIQUE constraints, so the original UNIQUE(memory_type, tier_label, pass_kind, version) admits duplicate rows when tier_label IS NULL. The synthesis spec requires singletons-per-version, so add a follow-up migration step (ensureLcmPromptRegistryNullSafeUniqueIdx) that creates a COALESCE-based UNIQUE INDEX. Same pattern is already used for lcm_synthesis_cache_lookup_uniq. The original UNIQUE constraint stays (catches non-NULL collisions); the new index catches NULL collisions. Gap 7 (LOW) — wire assertForeignKeysEnabled into configureConnection. src/concurrency/model.ts already exports assertForeignKeysEnabled(db) but nothing in production calls it. Add a call after the existing PRAGMA foreign_keys = ON in src/db/connection.ts:configureConnection so any future regression that opens a connection without FK enforcement (which would silently degrade every ON DELETE CASCADE in the schema) fails fast. assertBusyTimeoutForRole wiring is intentionally deferred to Group B.05 (worker startup) per the Group A reviewer's recommendation. Gap 9 (MED) — live-DB-shape regression test. All other v41-*.test.ts files start from a fresh :memory: and run the full migration on an empty DB. None tested the migration against a partially pre-existing schema (where conversations / summaries / messages already exist with rows but lcm_* tables don't yet). The Eva-live-DB verification was one-off and not in CI. New test v41-pre-existing-schema-migration.test.ts seeds the upstream pre-v4.1 baseline shape, inserts conversations + summaries + messages, runs runLcmMigrations, and verifies: NULL session_keys are backfilled, audit rows exist, summaries.session_key is JOIN-backfilled, all 21 v4.1 tables exist, the new lcm_prompt_registry_uniq_lookup index exists, and re-runs are idempotent.

Helper module on top of A.01's lcm_worker_lock table. Acquisition is atomic via PRIMARY KEY uniqueness on (job_kind) — INSERT OR IGNORE returns 1 if we got it, 0 if someone else holds it. API: - acquireLock(db, jobKind, {workerId, ttlMs?, jobSessionKey?, jobMetadata?}) → boolean. GC's expired locks BEFORE acquiring (≤ datetime('now') so ttl=0 is immediately reclaimable; race-safe via INSERT OR IGNORE). - releaseLock(db, jobKind, workerId) → boolean. Only frees if the workerId matches (prevents accidental cross-worker release). - heartbeatLock(db, jobKind, workerId, ttlMs?) → boolean. Updates expires_at + last_heartbeat_at. Returns false if the lock was preempted (caller MUST abort to avoid double-processing). - lockInfo(db, jobKind) → LockInfo | null. Used by /lcm health. - generateWorkerId(role) → string. Format `<role>-<pid>-<ms>-<6hex>`. Used by Group B.04 backfill cron (next commit) and Groups E (extraction) + G (themes consolidation) + worker scaffolding (B.05). Coverage: 13 tests (single-process acquire/release, TTL+GC behavior, heartbeat semantics including preemption-detection, metadata round-trip, multi-kind isolation, generateWorkerId uniqueness). Tests: 1017 → 1030 (+13). Resolves: §0 cross-process lock primitive used by all worker jobs. Next (B.04b): backfill cron module that uses these primitives.

…(E.spike) Wraps ml-hclust (mljs ecosystem) for use by Group E procedure clustering. Library choice rationale (full notes in module header): - ESM-native (this plugin ships ESM only) - MIT licensed, actively maintained (v4.0.0 published 2025-11-26) - Small footprint (~48KB unpacked); esbuild tree-shakes most transitive deps. Bundle delta: 708.7kb → 709.4kb (+0.7KB; index.ts doesn't import yet — Group E will pull it in) - Accepts precomputed distance matrix (we pass cosine distance), so we can do Ward+cosine without hacking the lib's internal euclidean - Cluster.cut(height) AND Cluster.group(K) both supported, satisfying both "let dendrogram decide" and "force K" use cases Architecture choice notes: - Ward + cosine on precomputed matrix: same approximation scipy gives you (linkage(method="ward", metric="cosine")). Mathematically loose (Ward assumes squared Euclidean) but conventional for text embeddings. Fallback method: "average" (UPGMA) — no Euclidean assumption — if empirical eval shows wonky merges. - Pre-normalize each vector once → cosine distance becomes (1 - dot). Halves the inner-loop cost and centralizes float-drift clamping. - O(N^2 D) distance build + O(N^3) agnes. For N=2000 D=1024 that's ~few seconds in JS — comfortably within the worker-process budget. Alternatives considered + rejected: - hierarchical-clustering-js: 404 on npm - density-clustering: wrong algorithm family (DBSCAN/k-means only) - clusterfck: deprecated - clustering-js: abandoned API: - clusterHierarchical({vectors, cutHeight?, numClusters?}) → ClusterResult Coverage: 11 tests - empty input, single vector, identical vectors, separable groups - force-K mode, mixed-dim rejection, non-Float32Array rejection, cutHeight validation, internal coverage check - 100-vector perf sanity (<2s) Built (subagent: a1e8a944580405a69) — research + library survey done in parallel with Group B.04 work; spec checked + tests verified before committing. Tests: 1030 → 1041 (+11). Resolves: foundation for Group E procedure clustering. Group E will: (1) pre-filter leaves (structural — numbered steps / commands / explicit "how to" markers, NOT FTS verb regex) (2) call clusterHierarchical() over voyage-4-large embeddings (3) filter to clusters with ≥8 members + LLM-judge confidence > 0.9 (4) write to lcm_procedures with status='active'

…idempotent (B.04b) Walks unembedded leaves, batches by token budget, calls Voyage, writes vec0 + meta. Designed as a single-tick API: caller (worker scheduler) invokes once per tick; the function acquires lcm_worker_lock, processes up to perTickLimit documents, releases lock, returns BackfillResult. API: - runBackfillTick(db, opts) → Promise<BackfillResult> - countPendingDocs(db, args) → number (for /lcm health and tick-scheduling) BackfillOptions covers: model + Voyage model dispatch, input_type (MUST be 'document' for backfill), API key + mock fetch, RPS pacing (default 0.5 = one call per 2s), batch token cap (default 80K), per-tick doc cap (default 200), token-count min/max (default 1 .. 30K), worker_id override (for stable IDs across ticks), onBatchComplete hook for telemetry, skipLock for tests. BackfillResult tracks: embeddedCount, skippedOverCap (rows above the 30K cap, requiring operator attention), skipped[] (per-row failures with kind='voyage_400'/'voyage_other'/'over_cap'), perTickLimitReached (scheduler reschedules if true), lockNotAcquired (scheduler skips this tick), voyageTokensConsumed (API usage telemetry), durationMs. Invariants: 1. NO LLM/network in any DB write tx. Each Voyage HTTP call lives OUTSIDE the per-batch transaction; rate-state UPDATE (when added in B.04c follow-up) will be a brief BEGIN IMMEDIATE that COMMITs before the HTTP call (never holds a write lock through HTTP latency). 2. Single-flight via worker lock — gateway-fallback safe. 3. Resumable — each batch's writes commit independently. Crash mid-tick loses one in-flight batch worth of Voyage spend at most. Next tick picks up still-unembedded rows. 4. Idempotent on per-row basis. SELECT pre-filters rows that already have a non-archived `lcm_embedding_meta` entry; a duplicate-write would just be a no-op via INSERT OR REPLACE. 5. Suppression-aware: rows where `summaries.suppressed_at IS NOT NULL` are excluded. 6. Per-tick failure blocklist — failed_summary_ids set excludes them from subsequent SELECTs within the same tick. Next tick re-attempts (Voyage may have recovered). Without this, a persistent 400 would spin the loop until perTickLimit. 7. Auth errors are FATAL — re-thrown so the operator gets surfaced. Still releases the lock via try/finally. Heartbeat: lock heartbeat fires every batch. If preempted (heartbeat returns false), tick aborts cleanly without partial state. Coverage: 13 tests (all vec0-gated, mock fetch — NO live API): - basic embed-all, isEmbedded reflects state - skip suppressed leaves (no Voyage call for them) - idempotent on second tick (zero new Voyage calls) - over-cap leaves filtered at SELECT (countPendingDocs verifies) - perTickLimit caps work + perTickLimitReached flag - 400 records skipped doc, no abort - 401 (auth) re-thrown, lock released via finally - 500 records skipped, continues with other batches - lockNotAcquired when another worker holds (no Voyage call) - lock released on success - lock released even on auth error - batches packed to maxBatchTokens (greedy bin-pack) - countPendingDocs accurate Tests: 1041 → 1054 (+13). Resolves: foundation for v4.1 §13 backfill — first-run embedding of existing summaries on Eva's live DB. Group B.05 (next) wires async leaf-time embed for new leaves so the cron only handles backfill of the 4187-row corpus, not new ongoing leaves.

….05) Two pieces, both foundation for Group F's `/lcm worker` operator surface (later) and to close Group A adversarial-review Gap 8. ## 1. Worker loop (src/concurrency/worker-loop.ts) Generic single-process worker loop. One Node process running multiple background jobs cooperatively, single-threaded, each with its own cadence. Cross-process safety via lcm_worker_lock from B.04a. API: - new WorkerLoop(db, {jobs: WorkerJob[], onJobComplete?}) - loop.start() → idempotent, schedules setInterval per job - loop.stop({gracefulTimeoutMs?: 30000}) → waits for in-flight ticks - loop.runOnce(kind) → outside-schedule manual tick (used by leaf-write hooks to nudge backfill, and by `/lcm worker tick` operator command) - loop.isRunning() / loop.inFlightCount() — for /lcm health Design choices: - setInterval (not setTimeout chain): predictable cadence, dispatcher skips overlapping ticks rather than queuing — extra ticks lose, not queued forever. - Errors in jobs captured via onJobComplete, never propagate to loop — one bad tick doesn't crash the worker. - generationId guard: stop()-then-start() doesn't run leftover ticks from the old loop. - validateJobs() at construction: duplicate kinds + invalid intervalMs rejected up-front (programmer error). NOT yet wired into plugin lifecycle. Group F's /lcm worker [start|stop] operator command will instantiate it with the actual job list. Until then, the loop is a library — the embedding store + backfill modules are usable standalone. NOT using worker_threads. v4.1.1 A9 foresees true heartbeat-isolation via worker_threads, but that's a future commit. setInterval-driven dispatch is fine for our cadences (5-60s). ## 2. Leaf-write session_key fix (Gap 8 from Group A adversarial review) src/store/summary-store.ts:411 — INSERT INTO summaries now atomically populates session_key from a sub-SELECT of conversations.session_key. Closes the gap where new summaries inserted between gateway boots had session_key='' until next boot's JOIN-backfill ran. The COALESCE defends against (theoretically impossible) NULL conversations.session_key. This means every newly-written summary IMMEDIATELY participates in session_key-filtered partial indexes (summaries_session_key_kind_latest_idx from A.08), without waiting for migration boot. All 1054 existing tests still pass — change is additive (default still '' if conversation has no session_key, but the migration ensures every conv has one). Coverage: 13 new worker-loop tests - start/stop idempotency - schedules at cadence (timing-based) - two jobs with different intervals - overlapping ticks skipped (not queued) - errors in jobs captured + loop continues - graceful stop waits for in-flight - graceful stop returns false on timeout - runOnce returns result, throws on unknown kind, throws on in-flight - validates duplicate kinds + bad intervalMs Tests: 1054 → 1067 (+13). Resolves: foundation for v4.1 §0 worker scheduling + Group A Gap 8. Group B is now complete (B.01 Voyage client, B.02 vec0, B.03 cascade triggers, B.fix polish, B.04a worker-lock, B.04b backfill cron, B.05 worker loop + session_key fix). Next: Group B adversarial pass, then Group C retrieval (hybrid lcm_grep, lcm_semantic_recall).

… join (C.01) Wraps the embed-query → vec0 KNN → JOIN-back-to-summaries flow used by both `lcm_semantic_recall` (Group C) AND the hybrid mode of `lcm_grep` (C.02). Centralizing here so the two callers can't drift on suppression semantics, kind filtering, or session-key scope. API: - getActiveEmbeddingModel(db) → {modelName, dim} | null Picks active=1 + archive_after IS NULL row, most-recent registered_at on ties (handles model-cutover gracefully). - runSemanticSearch(db, opts) → Promise<SemanticSearchResult> Throws SemanticSearchUnavailableError if vec0 not loaded OR no active profile OR vec0 table missing — caller decides whether to degrade (FTS-only) or surface error. SemanticSearchOptions covers: query (text) OR queryVector (precomputed), session_keys / conversation_ids / since / before / summary_kinds filters, embedded_kinds default ['summary'], excludeSuppressed default true, all Voyage knobs (apiKey/fetch/maxRetries/inputType — default 'query' for asymmetric retrieval). Suppression filtered at TWO layers (defense in depth — race between trigger fire and KNN call could leak a stale row through metadata): 1. vec0 metadata `suppressed = 0` pre-filter inside MATCH 2. Final JOIN to summaries WHERE `suppressed_at IS NULL` session_key scope uses the column populated atomically at write time per Group A Gap 8 fix (in B.05). conversation_id, time, and kind filters all bind via parameterized SQL — no injection vectors. Coverage: 15 tests - getActiveEmbeddingModel: null when no profile, picks active+ most-recent, excludes archived - SemanticSearchUnavailableError when vec0 not loaded / no profile - input validation: requires query OR queryVector; dim mismatch - happy path: ranked hits, joined content + metadata - suppression filter (default + opt-in to include) - session_keys filter restricts to matching sessions - conversation_ids filter restricts to matching conversations - since/before time filter - Voyage call with input_type='query' verified, voyageTokensConsumed tracked - summary_kinds filter (leaf vs condensed) Tests: 1067 → 1082 (+15). Resolves: foundation for v4.1 §13 retrieval pipeline. Next (C.02): new lcm_semantic_recall tool + hybrid mode for lcm_grep that calls this service alongside FTS and merges with Voyage rerank-2.5.

…rank (C.02a) Combines FTS5 candidates with vec0 KNN candidates, deduplicates by summary_id, then either: - Reranks via Voyage rerank-2.5 (default) — produces final relevance scoring across the union, taking advantage of the spike-validated +52.5pp lift on paraphrastic queries - OR reciprocal-rank-fusion (RRF) when rerank=false OR when Voyage rerank fails (transient 5xx; auth re-thrown for operator surfacing) API: - runHybridSearch(db, opts) → Promise<HybridSearchResult> opts: query, kFts (default 50), kSemantic (default 50), topN (default 20), filters (sessionKeys/conversationIds/since/before/summaryKinds), excludeSuppressed default true, rerank default true, voyage HTTP knobs. Caller injects ftsSearch() so this module doesn't take ownership of FTS5 sanitization or hybrid-recency sort logic — that lives in the existing SummaryStore/RetrievalEngine path. HybridHit returned with: - {summaryId, conversationId, sessionKey, kind, content, tokenCount, createdAt} - score (rerank score OR RRF score) - fromFts / fromSemantic provenance flags - semanticDistance (cosine), ftsRank — for diagnostics + caller display Graceful degrade: - vec0 not loaded → degradedToFtsOnly=true, FTS-only result - rerank 5xx → degradedSkippedRerank=true, RRF fallback - rerank 401 (auth) → re-thrown; operator must fix API key - empty query → throws (programmer error) Suppression: both FTS-side and semantic-side default to excludeSuppressed. Rerank input is post-suppression union, so no post-rerank filter needed. NOT YET WIRED into lcm_grep tool. Next commit (C.02b) extends the tool with mode='hybrid' that calls runHybridSearch with summaryStore.searchSummaries adapted to FtsHit shape. Coverage: 8 tests (vec0-gated, mock fetch — NO live API): - merges FTS + semantic, rerank produces top-N - dedupe overlap (FTS + semantic both find same doc) - vec0 unavailable → FTS-only with degraded flag - rerank 500 → RRF fallback with degraded flag - rerank 401 → re-thrown - rerank=false explicit → RRF mode, no Voyage rerank call - empty query rejected - no candidates → empty hits Tests: 1082 → 1090 (+8). Resolves: foundation for hybrid retrieval. Used by C.02b (lcm_grep mode='hybrid') AND C.04 (lcm_synthesize_around window_kind='semantic').

…paths (C.03) v4.1 §10 invariant: every agent-facing retrieval surface defaults to exclude-suppressed. Adds `WHERE suppressed_at IS NULL` to four search code paths in SummaryStore: 1. searchFullText (FTS5 path) — alias `s.suppressed_at IS NULL` 2. searchLike (LIKE-fallback path) — `suppressed_at IS NULL` 3. searchCjkTrigram (CJK FTS path) — alias `s.suppressed_at IS NULL` 4. searchRegex — `suppressed_at IS NULL` These four functions back the existing `lcm_grep` tool's regex / full_text modes (and the new C.02b hybrid mode via the ftsSearch callback). Suppressed leaves now never surface to agents through any search-side path. The vec0 retrieval surfaces (semantic-search, hybrid-search) already filter via metadata pre-filter (vec0 `suppressed=0`) AND defense-in- depth JOIN to summaries.suppressed_at IS NULL. Both layers are independently tested. What this DOESN'T change: - getSummary(id), getSummaryParents/Children/Subtree, getSummaryMessages, context-item reads — these are structural lookups used by lineage / expansion / assembler. The architecture's "7 read paths" cascade handles them by suppressing-at-source (assembler builds context from latest non-suppressed leaves; expansion respects contains_suppressed_leaves flag for condensed). A per-method excludeSuppressed default param refactor was considered but deferred. - lcm-doctor / lcm-command operator paths — operator tooling intentionally sees ALL rows including suppressed (for cleanup, audit, doctor checks). Coverage: 4 new tests (LIKE/full_text path, regex path, restore-on- unsuppress, multiple-suppression). Tests: 1090 → 1094 (+4). Resolves: v4.1 §10 invariant for SummaryStore search paths.

Wires the semantic-search service from src/embeddings/ into a new agent-callable tool. lcm_semantic_recall is the purely-semantic counterpart to lcm_grep; agents use it for paraphrastic queries that exact-match FTS would miss. Hybrid (keyword + semantic) is reserved for lcm_grep mode='hybrid' (Group C.02b). The tool resolves conversation scope via the existing resolveLcmConversationScope helper, parses since/before like lcm_grep, and gracefully degrades when sqlite-vec is missing or when VOYAGE_API_KEY is not set — both surfaces return jsonResult errors that direct the agent back to lcm_grep instead of throwing. A small public getDb() accessor is added to LcmContextEngine so tools can call runSemanticSearch(db, opts) directly without plumbing a new dependency through the LcmDependencies surface. Mirrors the existing getRetrieval() / getConversationStore() / getSummaryStore() pattern. Manifest contracts.tools updated to match the new register call site (guarded by manifest.test.ts). Tests cover input validation (empty query, bad timestamps, missing scope), graceful degradation (vec0 unavailable, missing API key), happy path with mocked Voyage fetch, conversationId scope filter, and since/before passthrough — vec0-dependent tests skip cleanly when the extension isn't installed. Refs: architecture v4.1 §13.

… collision (B.fix2) Resolves Group B adversarial-pass HIGH/BLOCKER findings: ## Gap 1 (BLOCKER) — backfill heartbeat vs Voyage retry budget src/embeddings/backfill.ts: was using Voyage client's default retry + timeout (3 retries × 60s = ~4 min worst-case per batch). With WORKER_LOCK_TTL_MS=90s, a stuck batch can let another worker GC the lock and start backfilling the same docs → Voyage double-bill + duplicate vec0 rows (auxiliary cols have no UNIQUE constraint to catch this). Fix: introduce `voyageMaxRetries` default = 1 + `voyageTimeoutMs` default = 30s in BackfillOptions. Worst-case per batch now: 2 attempts × 30s + ~0.5s backoff ≈ 60.5s Comfortably under 90s lock TTL → another worker can't preempt mid-batch. Caller can override either knob (e.g. for first-run backfill where contention is low and longer Voyage tolerance is acceptable). Tests that need to surface 5xx immediately use voyageMaxRetries: 0. ## Gap 2 (HIGH) — slug collision silently corrupts KNN src/embeddings/store.ts: registerEmbeddingProfile() didn't check that the new model_name's sluggified form was already in use. Two profiles like `voyage-4-large` and `voyage_4_large` both sluggify to `voyage4large` → same vec0 table → inserts from both profiles route to one table → KNN cross-contaminates. Fix: scan existing profiles for slug equality BEFORE INSERT OR IGNORE. Throws with explanatory message identifying the existing model_name that already owns the slug. The existing `MODEL_NAME_PATTERN = /^[A-Za-z0-9._-]{1,64}$/` allows `-`, `_`, `.` — all of which are stripped by sluggification — so false-collision risk is real, not hypothetical. ## Gap 8 (LOW, folded in) — dim upper bound consistency ensureEmbeddingsTable rejects dim > 4096; registerEmbeddingProfile had no upper bound, leaving an orphaned profile if caller did register-then-ensure. Aligned both functions to reject dim > 4096 in registerEmbeddingProfile too. ## Coverage: 8 new tests in v41-group-b-fix2.test.ts - Slug collision rejected: dash↔underscore↔dot↔case variants - Genuinely-different slug allowed - Re-registering same model still idempotent - Collision detection order-independent - Dim > 4096 rejected (matching ensureEmbeddingsTable) - Dim = 4096 accepted (boundary) - Backfill default voyageMaxRetries=1 (proven by call count = 2) - Backfill caller can override voyageMaxRetries: 0 Tests: 1094 → 1112 (+18 — also includes 10 from C.01b subagent). Group B adversarial Gaps 3-7 (3 MED + 1 LOW remaining) are doc/comment polish; deferred to cycle-2 review.

Extends lcm_grep with a third mode='hybrid' that blends FTS + semantic vector search via Voyage rerank. The schema enum picks up the new value, and the tool description points agents at lcm_semantic_recall for purely-semantic exploration so the two surfaces stay distinguishable. The hybrid path delegates to runHybridSearch (src/embeddings/), passing a small adapter that wraps summaryStore.searchSummaries(mode:'full_text' sort:'relevance') and hydrates the snippets back to full FtsHit shape via a single batched SELECT against summaries by summary_id. We could have piped each hit through getSummary, but the IN(...) batch is one round-trip and the values we need (session_key, content, token_count, created_at, conversation_id) are already on the row. Output format mirrors the regex/full_text branch — same '## LCM Grep Results' header, '**Mode:** hybrid' line, conversation scope + time filter — but with hybrid-specific extras: - per-hit provenance flag: [from FTS+semantic] / [from FTS only] / [from semantic only] - rerank/RRF score - degraded warnings: '*(semantic search unavailable; degraded to FTS-only)*' when vec0 is missing, '*(rerank failed; using RRF fusion fallback)*' when rerank network errors and we fall back to reciprocal-rank-fusion Auth errors from Voyage surface as a jsonResult error message that points the agent at mode='full_text' as the keyword-only fallback. Tests cover schema enum + description metadata, the degraded-vec0-missing path (FTS-only mode with the warning + FTS-only provenance flag), happy path with mocked Voyage embed + rerank (mixed provenance flags + score-ordered hits), and the rerank-failed RRF fallback path. Refs: architecture v4.1 §13.

Versioned prompt templates per (memory_type, tier_label, pass_kind). Append-only — old versions stay archived (active=0); new versions inserted with active=1, previous-active row deactivated atomically. Backed by lcm_prompt_registry (created in A.04, NULL-tier UNIQUE patched in B.fix Gap 2). Schema: (prompt_id PK, memory_type, tier_label NULLABLE, pass_kind, version, template, model_recommendation, active, bundle_version, notes) API: - getActivePrompt(db, {memoryType, tierLabel, passKind}) → PromptRecord | null - getPromptById(db, promptId) → PromptRecord | null (used by synthesis-cache to verify the prompt_id is still current or look up the archived version that was used) - registerPrompt(db, opts) → string (the new prompt_id) Atomic: deactivates previous + inserts new in BEGIN IMMEDIATE. Auto-versions (max(version) + 1 within triple). - listActivePrompts(db) → for /lcm health - bumpBundleVersion(db) → for voice-consistency rebuilds NULL tierLabel handling: matched literally (not coerced to "") in both lookup and update. Aligns with B.fix Gap 2's NULL-safe UNIQUE index on (memory_type, COALESCE(tier_label, ''), pass_kind, version) — the registry treats NULL and '' as DIFFERENT for purposes of routing, even though the UNIQUE index treats them as the same for collision detection. Why versioning matters for cache invalidation: lcm_synthesis_cache (D.02 next commit) will FK on prompt_id. When a prompt is updated: - Old cache entries reference the now-archived prompt_id → stale - New synthesis calls write rows with the new prompt_id → fresh - Cache invalidation can be SELECTIVE (only entries with archived prompt_id need rebuild) — never touches durable summaries.content Coverage: 11 tests - register + getActivePrompt happy path - re-register same triple deactivates previous + bumps version - per-triple version isolation (different triples independent) - NULL tierLabel matched literally - getActivePrompt returns null when none registered - promptIdOverride respected - modelRecommendation/bundleVersion/notes round-trip - listActivePrompts excludes archived - bumpBundleVersion increments active prompts only - atomic transaction rolls back on PK collision Tests: 1112 → 1123 (+11). Resolves: foundation for v4.1 §3 synthesis. Next (D.02): synthesis dispatch that uses this registry for prompt selection.

Extends the lcm_describe summary payload with two fields agents need when reasoning across session families: - sessionKey: pulled from the parent conversations row (which holds the same value as summaries.session_key per the Gap 8 / B.05 atomic-write invariant). The SummaryRecord public store API doesn't carry session_key through, so retrieval.describeSummary() fans out a parallel conversationStore.getConversation(conversationId) alongside the existing parents/children/messages/subtree fetches. Empty string when the parent conversation has no session_key. - timeRange: a normalized {earliestAt, latestAt, createdAt} struct that mirrors the three time fields already present on the summary. Convenience for callers that prefer one bracket over three siblings. Both fields are also surfaced in the text rendering — the meta line now carries 'sessionKey=...' and 'created=...' alongside the existing 'range=earliest..latest', so agents inspecting summaries get the session affiliation and creation time visible without parsing the JSON details. Tests cover both the populated path (sessionKey appears verbatim, timeRange struct round-trips through details) and the empty path (sessionKey rendered as '-' for missing values). Refs: architecture v4.1 §13.

…D.02) Per-tier dispatch on top of D.01's prompt registry. Picks model + pass strategy per tier label, runs the LLM call(s), records every pass to lcm_synthesis_audit, returns final synthesized text. Per-tier strategies (per architecture-v4.1 §3 + literature consensus that critique-revise underperforms single-pass for summarization): daily → single-pass (mini model) weekly → single-pass (mid model) monthly → single + verify_fidelity (premium model) — verify_fidelity prompt asks "are there claims in the summary that aren't in the source?" — separate model call, returns 'OK' or 'HALLUCINATION: <details>' yearly → best-of-N (N=3) + judge (premium-thinking) — N candidates run in parallel; judge prompt picks the best by index (0..N-1) custom → single-pass (mid model) filtered → single-pass (mid model) Default models: claude-haiku-4-5 (daily), claude-sonnet-4-5 (weekly, custom, filtered), claude-opus-4-7 (monthly), claude-opus-4-7-thinking (yearly). Override per-prompt via lcm_prompt_registry.model_recommendation or per-call via SynthesizeRequest.{modelOverride, forceModel}. API: - dispatchSynthesis(db, llmCall, req: SynthesizeRequest) → Promise<SynthesizeResult> - LlmCall is INJECTED — production wires to existing pi-ai infrastructure (Group F integration); tests inject deterministic mocks. Keeps dispatch decoupled from the existing summarize.ts (which is geared to per-leaf compaction in the gateway hot path — different concerns). SynthesizeRequest covers: tier, memoryType, sourceText, target (summary_id OR cache_id), passSessionId (groups multi-pass audit rows), bestOfN override (yearly), model overrides. SynthesizeResult: output, primaryPromptId, audit IDs, total latency, total cost cents, hallucinationFlagged (monthly), bestOfN detail (yearly: n + selectedIndex + all candidates). Audit trail: every pass writes a 'started' row up-front (forensic record even if LLM crashes mid-call), then UPDATEs to 'completed' or 'failed' with output + latency + cost + last_error. Error handling: - missing_prompt: thrown if the (memoryType, tier, single|judge) triple has no active prompt registered. Operator must register via /lcm command (Group F) or seed in deployment. - llm_failure: re-thrown after writing audit row with status='failed' and last_error set. Caller (synthesis worker) decides whether to retry or surface to operator. - judge_failure: yearly tier judge returned malformed output (no digit, or out-of-range). Indicates a bad judge prompt — the candidate outputs are intact in audit rows for manual recovery. Template rendering: simple {{source_text}}, {{tier}}, {{memory_type}} substitutions for the primary template; {{candidate_summary}} for verify; {{candidates}} (rendered as numbered list) for judge. Coverage: 16 tests - DEFAULT_MODEL_BY_TIER + PASS_STRATEGY_BY_TIER constants - daily / weekly: single-pass, audit row, default model - monthly: single + verify; hallucinationFlagged true vs false vs skipped (no verify prompt) - yearly: 3 candidates + judge picks 1; bestOfN=5 override; judge output without digit → judge_failure; missing judge prompt → missing_prompt - missing primary prompt → missing_prompt - LLM call exception → llm_failure + audit row.status='failed' + last_error captured - prompt model_recommendation overrides tier default - forceModel + modelOverride wins - template substitution Tests: 1130 → 1146 (+16; subagent's C.05 already merged). Resolves: foundation for v4.1 §3 synthesis. Next (D.03): eval harness for measuring retrieval recall + synthesis quality on Eva's stratified N=100 query corpus.

Heuristic gate before procedure clustering. Most leaves are conversational; only a small fraction look like procedures. We pre-filter by the SHAPE of the content (not by FTS verb regex, which 3 adversarial agents flagged as too noisy + many false negatives). Three structural signals (compose with OR): numbered-steps — 3+ lines starting with "1.", "Step 1:", "1)", "(1)", etc. Strict counting (no "1. ... only 2 ...") Score weight: 0.4 command-block — 2+ shell-command-shaped lines: - $-prompt, ❯-prompt, %-prompt, > -prompt - lines inside ```bash/sh/zsh/shell``` fences - lines starting with recognized tools (git/npm/pnpm/yarn/docker/kubectl/terraform/aws/ gcloud/az/gh/cargo/python/node/psql/mysql/redis-cli) Score weight: 0.4 how-to-marker — 2+ unambiguous markers like "how to ", "the procedure for ", "steps to ", "in order to ", "first/then/finally,". Conservative — single marker is too noisy (lots of conversational uses). Score weight: 0.3 A leaf is a clustering CANDIDATE if any one signal fires. The score (sum of fired weights, capped at 1) is exposed for downstream ranking — Group E's clustering call may threshold on it. API: - prefilterContent(content) → {isCandidate, signals[], score} - prefilterLeaves<T>(leaves[]) → only the candidate rows, with {signals, score} attached Pure module: no DB, no LLM, no async. Safe to call inline. Coverage: 18 tests - numbered-steps: markdown, "Step N:", "N)", insufficient count, prose with embedded numbers - command-block: $ prompt, fenced bash, line-start tool names, single-command rejection - how-to-marker: 2+ markers fire, single marker doesn't - composite: multi-signal stack, score cap at 1, plain conversation - input edges: empty, undefined, null - prefilterLeaves batch helper Tests: 1146 → 1164 (+18). Resolves: foundation for v4.1 §6.2 procedure clustering. Next (E.02): clustering pass that runs ml-hclust over candidate leaves' embeddings.

…lose + recalibrate) Methodology: Research → Run → Diagram → Debate → Decide → Implement. Step 1 data (live DB): the estimator at needs-compact-gate.ts:88-104 is 4× too high for this corpus. Real expandMessages=20 emits 2,551–3,604 tokens (median ~140 tokens/msg = ~560 chars/msg); estimator predicted 12-13K tokens (assumed 600 tokens/msg = 2400 chars/msg). The corpus DAG is also flat parent-of-1 — 414 condensed summaries each have 1 direct child, so expandChildren=20 emits 0-1 child of ~2K tokens, not 20×. Step 3 adversarial review caught: the originally-proposed fix (pre-call refuse with 5K grant default) was wrong-domain protection (the F1 anti- pattern from feedback_adversarial_review_domain_check.md). The actual sub-agent grant protection already exists at lcm-describe-tool.ts:329-342 (pre-emit redaction) and :637-659 (post-emit consumeTokenBudget ledger). Adding a new gate on top of a 4×-broken estimator was building on bad foundation. Step 4 decision: don't add new code; recalibrate the existing estimator. Coefficients now match empirical observations: - expandChildren k * 4075 → k * 2000 (typical 2K-token children) - expandMessages k * 2400 → k * 600 (typical 150-token messages) Tests updated to reflect new estimator output (4200 tokens for expandMessages=20, was capped at 10K). The F2 reviewer's failure scenario (grant + over-disclosure) is theoretical against this corpus; validation showed audit table has 2 rows total (P0 follow-up: instrument audit writes so we get production data on real grant sizing). LOC: 3 (coefficient changes only) + ~10 test fixture updates. Documents: - /tmp/research-f2-f6-data.md (Step 1 distributions) - /tmp/validation-f2-f5-f6.md (Step 1.5 actual tool execution) - /tmp/adversarial-f2.md (Step 3 hostile-reviewer position) - /tmp/decision-phase2-final.md (Step 4 decision record)

…Summarizer Methodology: Research → Debate → Decide → Implement. Step 1 archeology found two LlmCall wrappers: - createWorkerLlmCall (worker-llm.ts:52-126) honors args.model + returns actualModel - buildLlmCallFromSummarizer (this file) ignored args.model + returned no actualModel Wave-11 commit e96e03e finding Martian-Engineering#4 ("Documentation accuracy" heading) fixed the tool description's overclaim — but did NOT adjudicate the audit-row gap. lcm_synthesis_audit.model_used recorded the dispatched intent (pickModel's recommendation), not the actually-resolved model. Operators debugging a synthesis failure would see the wrong model in audit logs. Step 3 adversarial review verified: the original "close as won't-fix" recommendation overclaimed Wave-11 precedent. The decision record had already filed a P3 follow-up to do this exact 10 LOC fix — calling it won't-fix while filing P3 was contradictory. Just do the fix. Step 4 decision: thread a `resolveActualModel: () => string | undefined` parameter into the wrapper. Pass `() => summarizerBuilt.model` from the call site. This eliminates the audit/execution gap. The wrapper now returns `actualModel` from the summarizer's resolved primary candidate (src/summarize.ts:1688-1695). Caveat documented in code comment: if mid-call fallback fires, the recorded model may not match the candidate that actually succeeded. Strictly better than recording dispatched intent. Future improvement: have the summarizer surface the candidate that actually ran. Tool description also updated to say "audit table records the resolved model that actually ran" (was: "records the per-tier model name in the audit table") — the contract is now honest end-to-end. LOC: 10 (parameter + return field + call site + description text). Documents: /tmp/adversarial-f8.md, /tmp/decision-phase2-final.md

…e wrapper Methodology: Research → Run → Debate → Decide → Implement. Step 1.5 validation (live DB): drift can flip the needsCompact gate ALLOW↔REFUSE decision, but only in a narrow 80-85%-of-budget anchor band. Drift is bounded to single-iteration (resets on next llm_output). Step 3 adversarial review caught: the originally-proposed Option A (spot-tap the missed return paths) was a scope undercount. lcm_describe has 3 return paths (lines 137 refusal, 661 summary, 707 file, 713 fallthrough); the original commit ed05cc0 said "tap on final return" (singular) and only tapped 137 + 713. The 661 + 707 paths emit the LARGEST result payloads in the file (full subtree+expansion at 661, full file content at 707). Spot-tap left those untapped. The proposed "invariant test" would either be theater (regex passes today's bug) or force wrapper migration anyway. Step 4 decision: migrate to runWithTokenGate wrapper. The wrapper does the pre-call gate AND post-call tap automatically — single return funnel, structurally impossible to skip a tap on ANY future return path. Removed: - Inline `evaluateNeedsCompactGate` import + invocation (lines 130-138) - Inline `tapResultForTokenAccounting` calls (lines 137 + 713) - Direct `tapResultForTokenAccounting` import Added: - `runWithTokenGate` import - Single wrapper invocation at the top of `execute`, with all 3 return paths flowing through `inner: async () => {...}` The 3 return paths (now untapped because the wrapper does it): - jsonResult(refusal) for invalid input - { content, details } for summary results - { content, details } for file results - jsonResult(result) fallthrough Net diff: +6 lines (wrapper) - 4 lines (deleted inline gate + taps) = +2 LOC, but the bug class is closed structurally. Documents: /tmp/adversarial-f5.md, /tmp/decision-phase2-final.md

…tim per-hit cap Two related changes in lcm-grep-tool.ts. Methodology: Research → Run → Debate → Decide. Both flipped after adversarial review caught my mistakes. # F5 — wrapper migration Adversarial review counted 12 untapped return paths total (across grep + describe), not the 4 I claimed. In grep alone: - Line 392: regex/full_text success - Lines 590, 598, 604: hybrid error returns (in runHybridLcmGrep) - Line 661: hybrid success - Lines 761, 774, 779: semantic error returns (in runSemanticLcmGrep) - Line 854: semantic success - Line 1063: verbatim success Spot-tap was whack-a-mole. Wave-9 → Wave-12 has hit the same antipattern twice already. The structural fix is the wrapper migration. Removed: inline `evaluateNeedsCompactGate` + 4 `tapResultForTokenAccounting` calls in execute body (early-error paths). Added: single `runWithTokenGate` wrapper around the entire body. All return paths — including helper functions' internal error returns — now flow through the wrapper's auto-tap exit. Single return funnel, can't skip a tap. # F6 — verbatim per-hit content cap (5K chars) Live-DB validation showed 5/5 plausible verbatim queries leak 6-12× the markdown disclosure via `details.hits[].content`: markdown caps at 25-33K chars while details carries 200-385K chars per call. Empirical single hits up to 200K chars exist (5× the entire markdown budget). Adversarial review caught my original "metadata-only details" (Option D) recommendation as factually wrong: I had claimed "verified zero callers" but actual grep found 20+ active callers including: - test/lcm-grep-verbatim-mode.test.ts (canonical contract test) - test/v41-five-questions.test.ts (entire Type-C citation suite) - test/v41-adversarial-scenarios.test.ts (defense-in-depth regressions) - scripts/v41-qa-runner.mjs (live-DB harness, "critical" severity) Decision flipped to Option A: keep `content` field but cap each hit at 5K chars, slice `details.hits` to `renderedRowCount` (rows actually emitted into markdown). 5K is the 96th percentile of message lengths in the observed corpus — typical messages fit fine, the long-tail tool-output dumps get capped with `contentTruncated: true` + `fullContentLength` flag pointing at lcm_describe(messageId, expandMessages=true) for the full body. New fields in details: - truncated: bool (markdown loop broke early) - hits[i].contentTruncated: bool (this hit's content was capped) - hits[i].fullContentLength: number (so caller can decide if follow-up via lcm_describe is worth it) # Tests 10 verbatim tests pass (was 8): 2 new invariants pin the cap behavior + the renderedRowCount slicing. - "INVARIANT: per-hit content cap at 5K chars + truncation flags" - "INVARIANT: details.hits sliced to renderedRowCount when markdown truncates" The 20+ existing callers all still pass (verified): they assert against substrings + messageIds, not full-content equality. LOC: ~50 (F5 wrapper migration) + ~30 (F6 cap + flags) + ~50 (new tests). Documents: - /tmp/adversarial-f5.md - /tmp/adversarial-f6.md - /tmp/decision-phase2-final.md - /tmp/research-f2-f6-data.md (F6 message-length distributions) - /tmp/validation-f2-f5-f6.md (F6 dual-channel leak measurements)

Wave-12 reviewer F4 landed the suppression-aware aggregate CTE in lcm_get_entity AND lcm_search_entities via parallel edits — byte-identical SQL maintained in two places, a parallel-edit drift hazard. The first-principles-architectural-decision methodology run (research + adversarial debate + reach-for analysis) chose Option B (extract shared helper) over Option A (merge into lcm_entity { mode }) for the entity axis: - Both adversarial agents independently recommended B (helper) over A - Reach-for v1 (25 scenarios) found search_entities orphaned (0 reaches) but reach-for v2 (30 scenarios incl. browse/fuzzy F1-F5) found it REACHABLE when scenarios target its niche (3 first-reaches on F1, F2, F4) - The original "consolidate" verdict was a scenario-coverage artifact, not tool orphaning. Both tools have earned their keep. Helper at src/tools/lcm-entity-shared.ts exports: - VISIBLE_MENTIONS_CTE — the WITH visible_mentions AS (...) clause - entityAggCte({ includeFirstIn }) — the , entity_agg AS (...) clause, with the get-entity-only first_in column toggleable Both tools now build their query as: ${VISIBLE_MENTIONS_CTE}${entityAggCte({ includeFirstIn: true|false })} SELECT ... FROM lcm_entities e JOIN entity_agg ea ON ... WHERE ... Surface unchanged. Tests unchanged (20/20 pass). Documents: - /tmp/research-entity-consolidation.md (Step 1) - /tmp/step2-entity-consolidation-options.md (Step 2) - /tmp/adversarial-entity-A.md, /tmp/adversarial-entity-C.md (Step 3) - /tmp/reach-for-analysis.md (Step 1.7 v1) - /tmp/reach-for-analysis-v2.md (Step 1.7 v2)

…ic' (9→8 tools) # Wave-12 consolidation SA — final ship The first-principles-architectural-decision methodology run produced a nuanced verdict for tool consolidation. The semantic axis got consolidated; the entity axis did not. ## Decision: drop lcm_semantic_recall, fold capabilities into lcm_grep Reach-for analysis (Step 1.7) showed: - v1 (25 scenarios): 0 first-reaches for lcm_semantic_recall - v2 (30 scenarios incl. F1-F5 browse/fuzzy/cost-cheap): 1 narrow first-reach - Even with its tailor-made F5 scenario, it only barely beat lcm_grep mode='semantic'. No durable niche. Code archeology (Step 1.5) found the introducing commit `1e09df9` itself admitted "lcm_semantic_recall kept distinct (**same cost** as mode='semantic'; both exposed for clarity per challenger C2 verdict)." The "for clarity" justification was invalidated by circular descriptions that defer to each other ("for purely-semantic exploration prefer lcm_semantic_recall" inside lcm_grep, vs "reserve lcm_semantic_recall for purely semantic exploration" inside recall). Changes: 1. **Schema**: added `summaryKinds` filter to lcm_grep (was the only recall-only differentiator). Honored only by mode='semantic' / 'hybrid'; ignored elsewhere. 2. **Implementation**: deleted src/tools/lcm-semantic-recall-tool.ts. Plumbing through runSemanticLcmGrep already shared underlying `runSemanticSearch` + confidence-band logic. 3. **Manifest**: removed from openclaw.plugin.json. 9 → 8 tools. 4. **Plugin index**: removed import + registerTool call. 5. **needs-compact-gate.ts**: removed lcm_semantic_recall case in estimateResultTokens (folded into lcm_grep semantic estimator). 6. **Tests**: removed lcm-semantic-recall-tool.test.ts; updated 4 tests that referenced recall (parity-invariants, adversarial-scenarios, five-questions, tool-budget-guardrail) to use lcm_grep mode='semantic'. 7. **Description fix**: lcm_grep description no longer cross-defers to recall; tells the agent semantic mode is the standalone pure-vector path with optional summaryKinds filter. ## Decision: KEEP lcm_search_entities (axis-different from earlier plan) Reach-for v1 had also flagged lcm_search_entities as orphaned (0 first-reaches in 25 scenarios). v2 with F1-F5 added flipped this: - F1 (browse all entities of a type): reached for lcm_search_entities - F2 (fuzzy-name lookup): reached for lcm_search_entities - F4 (filter by entity_type): reached for lcm_search_entities - 3 first-reaches across F-scenarios where the description fits The original v1 zero was a SCENARIO COVERAGE artifact — THE_FIVE_QUESTIONS was biased toward expert queries that already named the canonical entity. Adding browse/fuzzy/type-filter scenarios revealed the tool serves a real niche. Eva's intuition that the v1 reach-for picture was incomplete was correct. Description rewrite leads with the browse-first niche so the gravity matches the just-validated reach-for. ## Tests - 1587 tests pass (was 1599; net -12 from deleted recall test file and consolidated parity tests) - 0 new TS errors (671 vs pre-fix baseline 679 — actually -8 from deleting recall tool's compile errors) - Live DB harness: all substantive checks pass (semantic, hybrid, suppression cascade, extraction). The 3 reported "fails" are the pre-existing "corpus already fully embedded" no-op messages. ## Ancillary changes - Added F1-F5 scenarios to THE_FIVE_QUESTIONS.md (browse / fuzzy-name / vague-summary / type-filter / paraphrastic-cheap) - Baked F1-F5 into scripts/v41-qa-runner.mjs as permanent test coverage - Updated lcm_search_entities to allow empty `query` when `entityType` is provided (browse-by-type use case the new description promises) - Updated operator-facing log messages in lcm-command.ts and semantic-infra-init.ts to drop stale lcm_semantic_recall references ## Methodology lesson (encoded into the skill) Step 1.7 (reach-for validation) MUST be paired with scenario-coverage audit. Tool absence in reach-for ≠ tool orphaning. Could be scenario gap. Verify by adding scenarios that exercise the tool's claimed niche before declaring it dead. Documents: - /tmp/research-entity-consolidation.md, /tmp/research-semantic-consolidation.md (Step 1) - /tmp/step2-entity-consolidation-options.md, /tmp/step2-semantic-consolidation-options.md (Step 2) - /tmp/adversarial-{entity-A,entity-C,semantic-SA,semantic-SB}.md (Step 3, 4 of 5) - /tmp/ripple-id-prefix-consolidation.md (Step 3 ripple analysis) - /tmp/reach-for-analysis.md (Step 1.7 v1) - /tmp/reach-for-analysis-v2.md (Step 1.7 v2 — verdict C)

100yenadmin · 2026-05-07T19:33:16Z

Empirical validation summary

Posting the full test results that drove the design through Options C → D → F. All measurements against ~/.openclaw/lcm.db (live snapshot, 2.6 GB, 315k messages), session 0cb8928b-… (conv 1691, 6,804 messages), 258K-token budget.

1. Assembler-side context density

Same conversation, same budget, same fresh-tail rules:

Variant	Items	Tokens	Stubs	Tokens saved	Wall-clock covered
v4.1 baseline	333	252,288	0	0	~74 min
v4.2 (full migration)	689	257,849	86	412,373	~130 min

Tool-result count is identical in both (101 in each). v4.2 doesn't displace tool outputs — it stubs heavy ones and reuses the freed budget to fit more older history (assistant prose +72, user/summaries +16). Same token budget, same tool coverage, ~2× wall-clock context.

Bytes on disk: baseline 692 KB / v4.2 552 KB. v4.2 is more items in fewer bytes.

2. Drilldown round-trip (Opus subagents, real model not simulator)

Spawned Claude Opus 4.1 subagents and gave them the assembled prompt as a transcript file. Three scenarios:

Test	Prompt	Result
Conversational summary	"what did we work on?"	Substantive coherent answer drawn from assistant turns. Zero tool calls needed. No confabulation.
Specific elided-content probe (before Option F)	"what did the ripgrep against openclaw-ui-source return?"	Opus correctly refused to guess but couldn't match the user reference to a fileId — assistant `tool_use` blocks had been stripped by the assembler, leaving stubs orphaned.
Same probe after Option F	(same question)	Found the matching fileId, wrote correct `lcm_describe(id="file_xxx")` call, refused to fabricate.

Quoting Opus on the Option F result:

"The Option F Exploration Summary: Tool: … | Command: … line was extremely helpful. The command string contained sed -n \"1,260p\" scripts/evaos-support/selfheal.sh literally — that's an unambiguous keyword match for 'what file's contents are in this elided blob.' Without that line I'd have had to grep the assistant text for nearby fileId references and guess, or call lcm_describe on multiple candidates speculatively. With it, the mapping was one grep away."

Critically, Opus does not confabulate when content is unavailable — it states what it would fetch and refuses to invent. This is the agent behavior we want.

3. Risk analysis (Opus, ranked, grep-grounded)

Risk	Rating	Notes
Cache invalidation / prefix stability	Minor	Older content is the most stable prefix. Stubs are stable file refs.
Cognitive load (more blocks to scan)	Moderate	Opus: signal-to-noise unchanged; ~39% empty-assistant in both variants
Stale information	Moderate	Inline timestamps in user prefixes mitigate; not bulletproof
Token economics	Minor (net positive)	+4K tokens buys +56 min history → ~14 min per 1K tokens
Stub-specific (legibility)	Minor	`[LCM Tool Output: …]` format works in live test; `Use lcm_describe…` line is clear

Opus's overall verdict: ship-with-mitigation.

4. Mitigation evaluation (post-skill review)

Four mitigations were proposed by Opus to address the moderate-risk items. We applied the first-principles-architectural-decision skill (Steps 1: research, 1.5: run-the-system, 2: where-it-lives, 3: adversarial debate) before deciding to build any of them.

Verdict: REJECT ALL FOUR. Decision record at /tmp/decision-v42-mitigations.md (will land in this repo at audit/v42-bench/DECISION-mitigations.md). One-line summary per mitigation:

Mitigation	Decision	Why
Recency cue `[t-NNm]`	REJECT	Cache thrashing (per-assemble clock-based string changes prefix). User timestamps already exist in prefix form.
Semantic stub wrapping `<lcm-stub>` XML	REJECT	Existing `[LCM Tool Output:]` format works in live test. Novel format is unproven regression risk.
Empty-assistant collapsing	REJECT	The "empty" turns contain `tool_use` blocks required by Anthropic/OpenAI wire contract. Collapsing would break `tool_use ↔ tool_result` pairing.
Resolution markers	REJECT	No reliable signal for "work completed". False positives strictly worse than no marker.

Each was put through both adversarial FOR and AGAINST agents at ≥95% confidence target. The AGAINST position won decisively on each — not because the mitigations are bad ideas in the abstract, but because each fails on a specific load-bearing constraint of v4.2.

5. Tests

1538/1538 unit tests pass. Five new tests in test/v42-stub-tier.test.ts:

emits stubs only for evictable externalized tool messages (boundary)
preserves tool_use ↔ tool_result pairing when stubbing
never stubs tool messages without externalized files (legacy rows)
preserves multi-block tool_result content shape (image + text)
drilldown round-trip: agent can recover the full payload via the file_xxx referenced in the stub

Plus the harness scripts:

scripts/v42-assemble-bench.mjs — token/item bench (the 333 vs 689 numbers above)
scripts/v42-drilldown-harness.mjs — real-LLM drilldown test (OpenRouter, multi-mode prompts: explicit/medium/soft/realistic/conversational)
scripts/v42-dump-prompt.mjs — transcript dumper for sub-agent A/B testing

6. Where this lands

Architecturally: additive (new column + new on-disk file path), reversible (UPDATE messages SET large_content = NULL + rm -rf <storage-dir>), default-off (stubLargeToolPayloads=false). Nothing about v4.2 forces operators to opt in.

Empirically: ~2× wall-clock context retention at the same token budget; drilldown works; agents don't confabulate; mitigations to address moderate-risk findings are unjustified by first-principles analysis.

The remaining test plan item (live runtime drilldown rate ≥70% measured on real conversational queries) is a post-merge gate, not pre-merge.

Pre-implementation design doc + adversarial review notes. Reviewers raised significant concerns; doc updates pending. Status: NOT for implementation as written. Measurement-first phase: build quality-measurement scaffold, run against baseline, implement variants A++/B/C, compare empirically before deciding which (if any) to ship. Quality-impact prediction running in parallel subagent.

Captures engine.assemble() token counts, context items, role breakdowns, and DB stats for a target session. Used to compare baseline v4.1 vs v4.2 variants empirically. Baseline run on agent-harness DB (2.6GB, 25,433 msgs, 557 summaries for session boot-2026-05-05_11-44-39-074-95d65b06): estimatedTokens: 169,105 (65.5% of 258k budget) contextItems: 139 (87 user + 22 assistant + 30 toolResult) elapsedMs: 53

Adds messages.large_content as a per-row sidecar for heavy tool payloads, plus an off-by-default assembler pass that swaps evictable tool-result content with a compact <lcm-stub …drilldown=lcm_describe(…)> when the sidecar is set. Fresh-tail items are protected. content stays lossless on disk; the migration is purely additive (UPDATE … SET large_content = content WHERE …) and reversible. Changes: - src/db/migration.ts: ensureMessageLargeContentColumn (idempotent ALTER) - src/store/conversation-store.ts: project + map row.large_content -> MessageRecord.largeContent - src/assembler.ts: ResolvedItem carries largeContentBytes/stubToolName/stubToolCallId; buildToolPayloadStub + applyStubSubstitution; gated by AssembleContextInput.stubLargeToolPayloads - src/engine.ts: pass through this.config.stubLargeToolPayloads (default false) - src/tools/{lcm-describe,lcm-grep}-tool.ts: coalesce(large_content, content) so drilldowns serve full payload - scripts/lcm-blob-migrate.mjs: idempotent migration tool with --dry-run, --threshold-bytes, --limit - scripts/v42-assemble-bench.mjs: direct-assembler invocation surfacing stubStats + selectionMode - test/v42-stub-tier.test.ts: 3 unit tests (evictable/fresh-tail boundary, tool_use ↔ tool_result pairing preserved, no-op on legacy unmigrated rows) Empirical bench (live-DB snapshot, conv 0cb8928b, 6,804 msgs, budget 258k): - baseline: 252,288 tokens / 333 items / chronological eviction - v42-stubs: 257,757 tokens / 684 items / chronological eviction - stubbedCount=86, tokensSaved=409,449 → ~2× context items preserved - Sessions without budget pressure: stubbedCount=0, identical assembly - Tests: 1536/1536 pass (added 3 v4.2 tests)

…ersarial review Adversarial review (parallel: code-review + drilldown-validation + migration-safety agents) found the original Variant B stub format `<lcm-stub messageId=… drilldown= lcm_describe(messageId=…,expandMessages=true)>` was UNRESOLVABLE — lcm_describe's schema only accepts `id: "sum_xxx" | "file_xxx"`, never messageId. Every drilldown would have returned `Not found`. The 333→684 item-retention bench result was real, but it would have shipped a feature that emits dead-end hints. Option C reuses the v4.1 large_files storage model end-to-end: - Migration externalizes large tool-result content to disk under ~/.openclaw/lcm-files/<file_id>.txt - INSERT into large_files (already in v4.1 schema) - messages.large_content stores the file_xxx id (not a content copy) - Assembler emits the existing v4.1 [LCM Tool Output: file_xxx | tool=… | N bytes] reference via formatToolOutputReference() — agent has been seeing this format in production for months - Drilldown via lcm_describe(id="file_xxx") — existing v4.1 path with conversation scoping + suppression filtering wired up; no new tool surface Also addresses P1s from review: - applyStubSubstitution skips when role != "toolResult" (legacy degraded rows) - Multi-block tool_result content keeps array shape ([{type:text,text:stub}]) instead of collapsing to string - PRAGMA busy_timeout=30000 in runLcmMigrations + lcm-blob-migrate.mjs to prevent SQLITE_BUSY against a running gateway - WAL checkpoint(TRUNCATE) after large UPDATE to bound WAL growth - Migration runs in 200-row chunked transactions (bounded write-lock duration) Reverts coalesce(large_content, content) in lcm_describe + lcm_grep — no longer needed since drilldown routes through file_xxx, not messageId. Test coverage: - Adds end-to-end drilldown round-trip test that closes the gap the original messageId-based design had: emits stub, looks up large_files, reads disk, asserts payload matches original - 1538/1538 pass (was 1536; added 2 tests: drilldown round-trip + multi-block content shape)

Tests whether a real LLM presented with the v4.2 stub format actually invokes lcm_describe(id="file_xxx") when it needs the elided tool-result content. Closes the empirical gap that adversarial review and unit tests cannot close (unit tests verify the stub is well-formed and the drilldown path works; this verifies the agent reaches for it). USAGE: OPENROUTER_API_KEY=... VOYAGE_API_KEY=... LCM_TEST_VEC0_PATH=... \ npx tsx scripts/v42-drilldown-harness.mjs \ --db audit/v42-bench/lcm-v42-optionc.db \ --session-id 0cb8928b-f925-4be1-a995-a30f30938cf4 \ --scenarios 5 --model anthropic/claude-sonnet-4.5 \ [--explicit (default) | --medium | --soft] EMPIRICAL RESULTS (live-DB snapshot, conv 0cb8928b, 86 stubs in prompt): | Mode | Sonnet 4.5 | gpt-4o-mini | |---------|---------------|--------------| | explicit| 5/5 (100%) | 5/5 (100%) | | medium | 3/5 (60%) | (untested) | | soft | 0/5 (0%) | (untested) | Explicit = user names the fileId and says "use tools". Medium = user references the [LCM Tool Output:] form without telling agent to use tools. Soft = user just asks about content, no mention of elision. INTERPRETATION: - Format is recognizable and drilldown WORKS when agent's attention is on it - Agent does not naturally drill down for soft prompts - v4.2 delivers the assembler-side context-density win regardless, but agent's recall of OLD tool content depends on prompt phrasing - Recommended next step: update lcm_describe tool description to explicitly mention "[LCM Tool Output: file_xxx | …]" references so the model's tool-selection heuristics fire on the pattern alone

…l Output:] references Drilldown harness against migrated DB found that without explicit prompt hints, Sonnet 4.5 doesn't proactively call lcm_describe on stubbed content. Adding a sentence to the lcm_describe tool description so the agent's tool-selection heuristics fire on the [LCM Tool Output: …] pattern itself. Empirical effect (5 scenarios, conv 0cb8928b, Sonnet 4.5): | Mode | Before D | After D | |---------|---------:|--------:| | explicit| 5/5 100% | (unchanged — already 100%) | | medium | 3/5 60% | 4/5 80% PASS | | soft | 0/5 0% | 0/5 0% (benchmark artifact: 86 elided exec calls + generic question is ambiguous) | Mirror of the production change applied to the harness's tool description so the harness signal continues to track production behavior.

…odes Refines the drilldown harness to test the actual production scenario the user pointed out: real users 99.9% of the time ask conversational questions ("what did we work on?", "where are we at?"), not direct probes for specific tool outputs. Previous explicit/medium/soft modes were synthetic in different ways. New modes: --conversational: fixed set of "summarize the session" questions — matches real-user behavior. Agent should answer from assistant turns (which describe what was done) and rarely needs to drill down. Confabulation risk is real but narrow. --realistic: phrased the way a real user would when asking about a SPECIFIC tool call (e.g. "what was in the read of foo.json"), using a disambiguator pulled from the tool input. No mention of [LCM Tool Output:] format. Tests the harder case where the user references a specific elided output by what it was for. --no-stubs: assemble with stubLargeToolPayloads=false to compare baseline behavior against stubs-on for the same conversational question. Pulls disambiguator (path / command / pattern / sessionId) from message_parts via SQL since the assembler may strip tool_use blocks for unpaired tool_results (so we can't rely on assembled.messages). EMPIRICAL RESULTS (Sonnet 4.5, conv 0cb8928b, 86 stubs, post Option D): Mode | Drilldown rate | Notes ----------------- | -------------- | ----- explicit | 100% | Synthetic; user names file_xxx medium | 80% | Synthetic; user mentions [LCM Tool Output:] soft | 0% | Generic question + 86 elided exec calls (ambiguous) realistic | 0% | User names what tool was for; agent confabulates conversational | answered well | Production-realistic; agent uses assistant turns The conversational mode confirms the v4.2 win in practice: substantive, coherent recap drawn from assistant narrative; no tool calls needed. Realistic mode confirms the narrow risk: when a user directly probes for specifics in elided content, the agent may confabulate.

…tion_summary Empirical Opus test found the v4.2 stub format insufficient: when an elided tool_result was orphaned (assistant tool_use block stripped by the assembler's pairing-sanitization pass), the agent had NO way to match a user reference like "the ripgrep against openclaw-ui-source" to a fileId. Opus correctly refused to guess but couldn't drill down either — the user's question went unanswered. Fix: at migration time, query the message_parts table for the tool_input that produced this elided result, render it as a one-line disambiguator, and store it in `large_files.exploration_summary`. The assembler already plumbs exploration_summary into the `formatToolOutputReference` output, so the stub now reads: [LCM Tool Output: file_xxx | tool=exec | 170,105 bytes] Exploration Summary: Tool: exec | Command: bash -lc 'cd /Users/lume/.openclaw/workspace/tmp-openclaw-ui-source && rg -n "ANTHROPIC_API_KEY|…" Use lcm_describe with the file id to inspect the full output. The agent can now match user vocabulary ("the ripgrep") to the stub line and call lcm_describe(id="file_xxx") to fetch the full output. Disambiguator templates handle the common shapes: - Read: `Tool: read | Path: /foo/bar` - Bash/exec: `Tool: exec | Command: <first-line, truncated 240ch>` - Grep: `Tool: grep | Pattern: <p> | Path: <p>` (when applicable) - Process: `Tool: process | Action: poll | Session: foo-bar` - URL: `Tool: <tool> | URL: <url>` - Fallback: `Tool: <tool> | Input keys: a,b,c` Migration is still idempotent (only touches large_content IS NULL rows).

Opus subagent analysis of v4.1 baseline (333 blocks) vs v4.2 stubs (689 blocks) at the same 258K-token budget recommended four mitigations to address moderate-risk findings: 1. Recency cue [t-NNm] on turn headers 2. Semantic stub wrapping <lcm-stub> XML tags 3. Empty-assistant collapsing 4. Resolution markers at completion boundaries Applied first-principles-architectural-decision skill (research, run-the-system, where-it-lives diagrams, adversarial debate) before building any of them. Verdict: REJECT ALL FOUR. Each fails on a specific load-bearing constraint: - #1 fails on prefix-cache stability (clock-based tag changes the rendered string on every assemble, invalidating the cache that v4.2's whole value proposition relies on). User timestamps already exist inline. - #2 fails on "novelty has cost, format already works" — the existing [LCM Tool Output: file_xxx | …] bracket form is correctly parsed by Opus in live tests (drilldown via lcm_describe works on Option F format). Replacing a working v4.1-trained format with a novel XML form is unjustified churn. - #3 fails on Anthropic/OpenAI wire contract. The "empty assistants" contain tool_use blocks (required to live in assistant turns; paired with tool_results by toolCallId). Dropping them would break pairing — providers reject orphan tool_results. - Martian-Engineering#4 fails on detection signal. No reliable way to mark "work completed" — user phrases like "go ahead" / "yes" / "keep digging" oscillate. False positives are strictly worse than no marker (license premature stubbing). Adversarial debate at ≥95% confidence target on each. AGAINST won on all four. Decision record committed for future operators who hit similar moderate-risk findings and reach for similar mitigations. Final v4.2 shipping shape: Options C + D + F at commit e309bed. Architecturally additive, reversible, default-off. Empirically: 333→689 items at same budget; Opus drills down correctly; no confabulation observed.

100yenadmin · 2026-05-07T20:09:33Z

Companion PR for independent review

There's now a parallel PR (#628) with the same v4.2 feature rebased directly onto main, independent of #613. Choose whichever review path works better:

PR	Base	Diff	Use case
#626 (this PR)	`main`	~53K LOC (includes #613)	When #613 lands first, this rebases trivially. Use `git diff 536784c...feat/lcm-stub-tier-stratification` to see only the v4.2 delta (~2,140 LOC).
#628	`main`	~2,080 LOC (v4.2 only)	Review/test v4.2 in isolation against current v3.x main. Whichever lands first; the other rebases trivially.

Same architecture, same Opus-validated drilldown behavior, same decision record. Just different bases for the diff.

Test counts:

(ignore) feat(v4.2): stub-tier stratification — externalize old tool results, agent drills down via lcm_describe(file_xxx) #626 (with feat(lcm): v4.1 —LCM V2 (replaces #516; companion #616 deferred) #613): 1592/1592 pass
feat(v4.2): stub-tier stratification — externalize old tool results (rebased on main, independent of #613) #628 (without feat(lcm): v4.1 —LCM V2 (replaces #516; companion #616 deferred) #613): 868/868 pass

Both pass their respective full suites. Pick whichever fits the review path you want.

…agent drills down via lcm_describe(file_xxx) Squashed v4.2 patch applied directly onto main (independent of PR Martian-Engineering#613). Same feature, same tests, same Opus-validated behavior — just rebased onto the v3.x main baseline so maintainers can review/test v4.2 without needing Martian-Engineering#613 to land first. Architecture: per-row sidecar `messages.large_content` stores the externalized `file_xxx` id pointing to a payload file in `large_files` (existing v4.1 storage table). Assembler replaces evictable tool-result rows with the v4.1 `[LCM Tool Output: file_xxx | tool=… | N bytes]` reference + `Tool: <name> | Command: <input>` disambiguator (via `exploration_summary`). Drilldown via existing `lcm_describe(id="file_xxx")`. Empirical bench (live-DB snapshot, conv 0cb8928b, 258K budget): baseline: 333 items / 252,288 tokens / 0 stubs v4.2: 689 items / 257,849 tokens / 86 stubs → ~2× wall-clock context coverage (74min → 130min) at same budget. → tool_result count identical (101 in both); v4.2 doesn't displace tool outputs, it stubs heavy ones and reuses budget for older history. Drilldown validation (Claude Opus 4.1 subagent A/B): - Conversational summary ("what did we work on?"): substantive answer, zero tool calls needed, no confabulation. - Specific elided-content probe (with tool_input disambiguator): found correct fileId, wrote correct lcm_describe(id="file_xxx"), refused to fabricate. Quote: "the command string contained sed -n '1,260p' scripts/evaos-support/selfheal.sh literally — that's an unambiguous keyword match. The mapping was one grep away." What's NOT stubbed: - Fresh tail (last ~64 turns / 24K tokens) — agent's working memory - Assistant turns — narrative of what was done is always intact - Tool messages without large_content — legacy/unmigrated rows - Tool messages whose runtime role degraded to assistant — phantom drilldown risk avoided Default OFF (config.stubLargeToolPayloads=false). Architecturally additive (new column + new on-disk file path), reversible (UPDATE messages SET large_content = NULL + rm -rf storage-dir + flag off). Mitigations evaluated through first-principles-architectural-decision skill (research / run-the-system / where-it-lives / adversarial debate at ≥95% confidence): REJECT all four (recency cue, semantic stub wrapping, empty-assistant collapsing, resolution markers). Decision record in audit/v42-bench/DECISION-mitigations.md. Tests: 868/868 pass on main (added 5 new v4.2 unit tests including end-to-end drilldown round-trip). Files: src/db/migration.ts — ensureMessageLargeContentColumn (idempotent ALTER) + busy_timeout src/store/conversation-store.ts — MessageRecord.largeContent + projection src/assembler.ts — buildToolPayloadStub + applyStubSubstitution + ResolvedItem.fileId src/engine.ts — config.stubLargeToolPayloads forwarded src/tools/lcm-describe-tool.ts — strengthened description for [LCM Tool Output:] pattern scripts/lcm-blob-migrate.mjs — idempotent, chunked, busy_timeout-protected migration scripts/v42-assemble-bench.mjs — token/item bench scripts/v42-drilldown-harness.mjs — real-LLM drilldown harness (OpenRouter) test/v42-stub-tier.test.ts — 5 unit tests (boundary, pairing, legacy, multi-block, drilldown round-trip) Companion PR: stacked-on-Martian-Engineering#613 version at Martian-Engineering#626.

Eva added 30 commits May 6, 2026 00:47

chore: remove stray Group B adversarial-review sanity scripts

a9a3e40

Eva added 4 commits May 7, 2026 22:38

100yenadmin marked this pull request as ready for review May 7, 2026 17:04

Eva added 2 commits May 8, 2026 01:33

Eva added 9 commits May 8, 2026 02:40

100yenadmin force-pushed the feat/lcm-stub-tier-stratification branch from 78c6223 to 85f922d Compare May 7, 2026 19:41

100yenadmin changed the title ~~feat(v4.2): Variant B — stub-tier stratification (large_content sidecar)~~ feat(v4.2): stub-tier stratification — externalize old tool results, agent drills down via lcm_describe(file_xxx) May 7, 2026

100yenadmin mentioned this pull request May 7, 2026

feat(v4.2): stub-tier stratification — externalize old tool results (rebased on main, independent of #613) #628

Merged

3 tasks

100yenadmin marked this pull request as draft May 7, 2026 20:29

100yenadmin changed the title ~~feat(v4.2): stub-tier stratification — externalize old tool results, agent drills down via lcm_describe(file_xxx)~~ (ignore) feat(v4.2): stub-tier stratification — externalize old tool results, agent drills down via lcm_describe(file_xxx) May 8, 2026

100yenadmin added enhancement New feature or request priority:P3 Moderate bug or backlog item priority:P5 Parked or low-confidence backlog item stale-check Stale issue/PR being checked with the original reporter and removed priority:P3 Moderate bug or backlog item labels May 31, 2026

100yenadmin mentioned this pull request May 31, 2026

Lossless Claw Issue and PR Triage Report - 2026-05-30 #771

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(ignore) feat(v4.2): stub-tier stratification — externalize old tool results, agent drills down via lcm_describe(file_xxx)#626

(ignore) feat(v4.2): stub-tier stratification — externalize old tool results, agent drills down via lcm_describe(file_xxx)#626
100yenadmin wants to merge 112 commits into
Martian-Engineering:mainfrom
100yenadmin:feat/lcm-stub-tier-stratification

100yenadmin commented May 7, 2026 •

edited

Loading

Uh oh!

100yenadmin commented May 7, 2026

Uh oh!

100yenadmin commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

100yenadmin commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem this solves

What this PR does

Architecture

Empirical bench (live-DB snapshot)

Drilldown validation (Opus subagents)

Mitigation evaluation (post-skill review)

What's NOT stubbed

Default off

Tests

How to download and test

Reversibility

Test plan

Files

Commits (vs #613 head 536784c)

Uh oh!

100yenadmin commented May 7, 2026

Empirical validation summary

1. Assembler-side context density

2. Drilldown round-trip (Opus subagents, real model not simulator)

3. Risk analysis (Opus, ranked, grep-grounded)

4. Mitigation evaluation (post-skill review)

5. Tests

6. Where this lands

Uh oh!

100yenadmin commented May 7, 2026

Companion PR for independent review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

100yenadmin commented May 7, 2026 •

edited

Loading

Commits (vs #613 head `536784c`)