feat(rollup): open loop and blocker state tracking#9
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b1acbca2be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| FROM lcm_episodes | ||
| WHERE conversation_id = ? | ||
| AND last_day = ? | ||
| AND status IN ('active', 'stale') |
There was a problem hiding this comment.
Include completed episodes in end-of-day lookup
syncConversationEpisodes uses getEpisodeEndingOnDay to decide whether to extend or create an episode, but this query excludes completed rows. After an episode is completed, the next sync pass over the same historical days cannot find it and creates a new duplicate episode for the same day span, so repeated rebuilds keep adding duplicate completed episodes and skew episode history.
Useful? React with 👍 / 👎.
| topic: Type.Optional( | ||
| Type.String({ | ||
| description: "Filter results to a specific topic or keyword", | ||
| }), |
There was a problem hiding this comment.
Apply the declared topic filter in lcm_recent
The schema now advertises a topic parameter as a keyword filter, but the execution path never reads p.topic to narrow rollups, tracker output, episode search, or fallback retrieval. Calls like lcm_recent({ period: "week", topic: "deploy" }) therefore return unfiltered results, which is a functional mismatch with the tool contract and can mislead callers expecting scoped output.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Adds persistent cross-day tracking for rollup “Key Items” (blockers/open items/decisions/questions) and introduces an “episode” concept for grouping consecutive daily rollups, surfaced via new lcm_recent period modes.
Changes:
- Create/resolve rollup trackers from daily rollup content and expose “open_items” / “blockers” views via
lcm_recent. - Add episode detection + storage and expose
lcm_recentmodes for listing/searching episodes. - Extend daily rollup rendering to include an “Open Items” key-items line and add initial tracker tests.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/lcm-recent-trackers.test.ts | Adds vitest coverage for tracker extraction + lcm_recent blockers output. |
| src/tracker-extractor.ts | Implements parsing/classification of rollup content into trackers and resolution detection. |
| src/tools/lcm-recent-tool.ts | Extends lcm_recent to support trackers and episode modes. |
| src/store/tracker-store.ts | Adds persistence layer for tracker lifecycle (open/resolved/stale) queries. |
| src/store/index.ts | Exports tracker store/types from store barrel. |
| src/store/episode-store.ts | Adds persistence layer for episodes. |
| src/rollup-builder.ts | Emits “Open Items” in daily rollups; runs tracker extraction + episode sync during build. |
| src/episode-detector.ts | Adds keyword-based episode detection across consecutive day rollups. |
| src/db/migration.ts | Adds SQLite tables + indexes for trackers and episodes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| period: Type.Optional( | ||
| Type.String({ | ||
| description: | ||
| 'Time period: "today", "yesterday", "morning", "afternoon", "evening", "7d", "week", "month", "30d", "open_items", "blockers", "episodes", "episode:keyword", "Nh" (up to 72h, e.g. "6h"), or "date:YYYY-MM-DD"', | ||
| }), | ||
| ), |
| period: Type.Optional( | ||
| Type.String({ | ||
| description: | ||
| 'Time period: "today", "yesterday", "morning", "afternoon", "evening", "7d", "week", "month", "30d", "open_items", "blockers", "episodes", "episode:keyword", "Nh" (up to 72h, e.g. "6h"), or "date:YYYY-MM-DD"', |
| topic: Type.Optional( | ||
| Type.String({ | ||
| description: "Filter results to a specific topic or keyword", | ||
| }), | ||
| ), |
| if (resolution.trackerKind) { | ||
| const startDay = getZonedDayString(resolution.start, timezone); | ||
| const openTrackers = trackerStore | ||
| .getOpenTrackers(conversationId, resolution.trackerKind) | ||
| .filter((tracker) => tracker.source_day >= startDay); | ||
|
|
| lines.push(`## Recent Activity: ${resolution.label}`); | ||
| lines.push( | ||
| `**Period:** ${formatDisplayTime(resolution.start, timezone)} — ${formatDisplayTime(resolution.end, timezone)}`, | ||
| ); | ||
| lines.push("**Status:** ready"); | ||
| lines.push(`**Open items:** ${openTrackers.length}`); | ||
| lines.push(""); |
| } | ||
|
|
||
| function stripBulletPrefix(value: string): string { | ||
| return value.replace(/^[-*•\d.)\s\[]+/, "").replace(/^\d{2}:\d{2}\]\s*/, "").trim(); |
| syncConversationEpisodes(conversationId: number, now: Date = new Date()): EpisodeDetectionResult { | ||
| const result: EpisodeDetectionResult = { created: 0, extended: 0, completed: 0, stale: 0 }; | ||
| const dailyRollups = this.rollupStore | ||
| .listRollups(conversationId, "day", 365) | ||
| .filter((rollup) => rollup.status === "ready" || rollup.status === "stale") | ||
| .sort((left, right) => left.period_key.localeCompare(right.period_key)); |
| if (resolution.mode === "episodes") { | ||
| const episodes = episodeStore.getActiveEpisodes(conversationId); | ||
| const lines: string[] = []; | ||
| lines.push("## Active Episodes"); | ||
| lines.push(`**Conversation:** ${conversationId}`); | ||
| lines.push(`**Count:** ${episodes.length}`); | ||
| lines.push(""); | ||
| if (episodes.length === 0) { | ||
| lines.push("No active or stale episodes found."); | ||
| } else { | ||
| for (const episode of episodes) { | ||
| const keywords = parseJsonStringArray(episode.keywords); | ||
| lines.push(`### ${episode.title}`); | ||
| lines.push(`- Status: ${episode.status}`); | ||
| lines.push(`- Duration: ${episode.day_count} day(s)`); | ||
| lines.push(`- Days: ${formatDayRange(episode.first_day, episode.last_day)}`); | ||
| lines.push(`- Last activity: ${episode.last_day}`); | ||
| lines.push(`- Keywords: ${keywords.length > 0 ? keywords.join(", ") : "None"}`); | ||
| lines.push(""); | ||
| } | ||
| } | ||
| return { | ||
| content: [{ type: "text", text: lines.join("\n") }], | ||
| details: { | ||
| status: "ready", | ||
| episodeCount: episodes.length, | ||
| }, | ||
| }; | ||
| } | ||
|
|
||
| if (resolution.mode === "episode") { | ||
| const episode = episodeStore.searchEpisodes(conversationId, resolution.episodeKeyword ?? "")[0] ?? null; | ||
| if (!episode) { | ||
| return jsonResult({ error: `No episode matched keyword \"${resolution.episodeKeyword}\".` }); | ||
| } | ||
|
|
||
| const rollups = parseJsonStringArray(episode.rollup_ids) | ||
| .map((rollupId) => rollupStore.getRollupById(rollupId)) | ||
| .filter((rollup): rollup is NonNullable<typeof rollup> => Boolean(rollup)) | ||
| .sort((left, right) => left.period_key.localeCompare(right.period_key)); | ||
| const keywords = parseJsonStringArray(episode.keywords); | ||
| const lines: string[] = []; | ||
| lines.push(`## Episode: ${episode.title}`); | ||
| lines.push(`**Status:** ${episode.status}`); | ||
| lines.push(`**Duration:** ${episode.day_count} day(s)`); | ||
| lines.push(`**Days:** ${formatDayRange(episode.first_day, episode.last_day)}`); | ||
| lines.push(`**Keywords:** ${keywords.length > 0 ? keywords.join(", ") : "None"}`); | ||
| lines.push(""); | ||
| if (rollups.length === 0) { | ||
| lines.push("No rollups were attached to this episode."); | ||
| } else { | ||
| lines.push("## Day-by-day rollups"); | ||
| lines.push(""); | ||
| for (const rollup of rollups) { | ||
| lines.push(`### ${rollup.period_key}`); | ||
| lines.push(rollup.content.trim()); | ||
| lines.push(""); | ||
| } | ||
| } | ||
| return { | ||
| content: [{ type: "text", text: lines.join("\n") }], | ||
| details: { | ||
| status: "ready", | ||
| episodeId: episode.episode_id, | ||
| rollupCount: rollups.length, | ||
| }, | ||
| }; | ||
| } |
| AND ( | ||
| (source_day >= ? AND source_day <= ?) | ||
| OR (resolved_day IS NOT NULL AND resolved_day >= ? AND resolved_day <= ?) | ||
| ) | ||
| ORDER BY source_day ASC, created_at ASC`, | ||
| ) | ||
| .all(conversationId, startDay, endDay, startDay, endDay) as TrackerRow[]; |
There was a problem hiding this comment.
🚩 Missing changeset for new user-facing features
AGENTS.md requires that PRs changing user-facing behavior include a .changeset/*.md file. This PR introduces new user-facing tool capabilities (episodes, trackers, new period types in lcm_recent), but the existing changeset wise-bears-doubt.md describes an unrelated change ("Externalize inline base64 images..."). A new changeset with at least a minor bump should be added for this feature work. Per AGENTS.md: "Treat a PR as not release-ready until the changeset question has been answered."
Was this helpful? React with 👍 or 👎 to provide feedback.
| function toDayKey(date: Date): string { | ||
| return date.toISOString().slice(0, 10); | ||
| } |
There was a problem hiding this comment.
🔴 toDayKey uses UTC instead of configured timezone, causing incorrect stale episode detection
In episode-detector.ts:253-255, toDayKey converts a Date to a day string using UTC (date.toISOString().slice(0, 10)). However, episode.last_day values come from rollup period_key, which is computed using timezone-aware getLocalDateKey(date, timezone) at src/rollup-builder.ts:298-305. The EpisodeDetector class has no access to the timezone — it's not passed in the constructor or syncConversationEpisodes (line 41). The caller at src/rollup-builder.ts:155-156 does not pass the timezone either.
For non-UTC timezones, toDayKey(now) can differ from the true local day by ±1 day. For example, at 11 PM in America/New_York (which is 3 AM UTC the next day), toDayKey(now) returns "2026-04-14" while the local day is "2026-04-13". This causes episodes to be marked stale prematurely (1 day early) or not marked stale when they should be.
Prompt for agents
The EpisodeDetector.syncConversationEpisodes method at line 113 computes dayGap(episode.last_day, toDayKey(now)), where toDayKey uses UTC (date.toISOString().slice(0,10)). But episode.last_day comes from rollup period_key which uses the configured timezone via getLocalDateKey in rollup-builder.ts.
To fix: either pass the timezone to EpisodeDetector (add it to the constructor or the syncConversationEpisodes method), or import getLocalDateKey from rollup-builder.ts and use it instead of toDayKey. The caller in rollup-builder.ts:154-156 has this.config.timezone available and should pass it through.
Was this helpful? React with 👍 or 👎 to provide feedback.
| throw new Error( | ||
| 'period must be one of "today", "yesterday", "7d", "week", "month", "30d", or "date:YYYY-MM-DD".', | ||
| 'period must be one of "today", "yesterday", "morning", "afternoon", "evening", "7d", "week", "month", "30d", "open_items", "blockers", "episodes", "episode:keyword", "Nh" (up to 72h), or "date:YYYY-MM-DD".', | ||
| ); |
There was a problem hiding this comment.
🔴 Schema advertises "morning", "afternoon", "evening", and "Nh" periods that are not implemented in resolvePeriod
The LcmRecentSchema description at line 17 and the error message at line 302 both list "morning", "afternoon", "evening", and "Nh" (e.g., "6h") as valid period values. However, resolvePeriod has no handler for any of these — they all fall through to the throw at line 301-303. Since an LLM agent reads the tool's description to decide which parameter values to use, it will attempt these documented periods and receive errors every time.
Prompt for agents
The resolvePeriod function in lcm-recent-tool.ts needs handlers for the periods advertised in the schema description (line 17) and error message (line 302): "morning", "afternoon", "evening", and the "Nh" hour-based pattern (e.g. "6h", up to 72h).
For time-of-day periods, compute appropriate start/end times in the user timezone. For the Nh pattern, match with a regex like /^(\d{1,2})h$/ and compute a window from (now - N hours) to now.
Alternatively, if these periods are not yet ready, remove them from the schema description and the error message so LLM agents do not attempt to use them.
Was this helpful? React with 👍 or 👎 to provide feedback.
| period: Type.Optional( | ||
| Type.String({ | ||
| description: | ||
| 'Time period: "today", "yesterday", "morning", "afternoon", "evening", "7d", "week", "month", "30d", "open_items", "blockers", "episodes", "episode:keyword", "Nh" (up to 72h, e.g. "6h"), or "date:YYYY-MM-DD"', | ||
| }), |
There was a problem hiding this comment.
🔴 period is marked Optional in schema but resolvePeriod has no default for empty/missing value
The period field was changed from Type.String (required) to Type.Optional(Type.String(...)) at lines 14-18. When omitted, the execute function falls through to resolvePeriod(String(p.period ?? ""), timezone) at line 405, which calls resolvePeriod("", timezone). The empty string matches no branch and throws an error. The tool gracefully returns the error message, but an LLM agent seeing the parameter as optional will sometimes omit it, expecting a default like "today". This makes every no-period call fail.
| period: Type.Optional( | |
| Type.String({ | |
| description: | |
| 'Time period: "today", "yesterday", "morning", "afternoon", "evening", "7d", "week", "month", "30d", "open_items", "blockers", "episodes", "episode:keyword", "Nh" (up to 72h, e.g. "6h"), or "date:YYYY-MM-DD"', | |
| }), | |
| period: Type.Optional( | |
| Type.String({ | |
| description: | |
| 'Time period: "today", "yesterday", "morning", "afternoon", "evening", "7d", "week", "month", "30d", "open_items", "blockers", "episodes", "episode:keyword", "Nh" (up to 72h, e.g. "6h"), or "date:YYYY-MM-DD". Defaults to "today".', | |
| }), | |
| ), |
Was this helpful? React with 👍 or 👎 to provide feedback.
| topic: Type.Optional( | ||
| Type.String({ | ||
| description: "Filter results to a specific topic or keyword", | ||
| }), |
There was a problem hiding this comment.
🟡 topic parameter declared in schema but never used in execute function
The topic parameter is defined at lines 20-23 in LcmRecentSchema with description: "Filter results to a specific topic or keyword". However, p.topic is never referenced anywhere in the execute function — it's completely ignored. If an LLM agent passes topic: "authentication" expecting filtered results, it receives unfiltered output with no indication that the filter was ignored.
Prompt for agents
The topic parameter is declared in LcmRecentSchema at line 20-23 but never read or applied in the execute function. Either implement topic filtering (e.g. by filtering rollup content or passing it to the episode/tracker search), or remove the parameter from the schema until it's implemented. Having an unused parameter in the tool schema misleads LLM agents into thinking filtering is active.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
|
||
| const keywordCache = new Map<string, string[]>(); | ||
|
|
||
| for (let index = 0; index < dailyRollups.length - 1; index += 1) { |
There was a problem hiding this comment.
📝 Info: EpisodeDetector.syncConversationEpisodes does not process the last rollup in isolation
The for loop at src/episode-detector.ts:54 iterates index < dailyRollups.length - 1, processing pairs (current, next). This means the last rollup is only ever processed as next, never as current. If the very last rollup in the list represents the end of an episode (no subsequent day), the episode won't be completed in the loop body. However, this is handled by the stale detection loop at lines 109-118, which marks episodes as stale after 3 days without activity. This is acceptable behavior — an episode ending on the last known rollup day stays active for a short grace period before being marked stale.
Was this helpful? React with 👍 or 👎 to provide feedback.
| function fuzzyOverlap(left: string, right: string): number { | ||
| const leftWords = tokenize(left); | ||
| const rightWords = tokenize(right); | ||
| if (leftWords.size === 0 || rightWords.size === 0) { | ||
| return 0; | ||
| } | ||
| let overlap = 0; | ||
| for (const word of leftWords) { | ||
| if (rightWords.has(word)) { | ||
| overlap += 1; | ||
| } | ||
| } | ||
| return overlap / leftWords.size; |
There was a problem hiding this comment.
📝 Info: fuzzyOverlap is asymmetric — only divides by left set size
At src/tracker-extractor.ts:186, fuzzyOverlap returns overlap / leftWords.size, making it asymmetric. This means fuzzyOverlap(A, B) can differ from fuzzyOverlap(B, A). This matters in two call sites: isSameTracker (line 167) always puts the existing tracker content as left, and isResolvedByLine (line 171) puts the tracker content as left and the resolved line as right. This means a short tracker content will have a high overlap score even if the resolved line has many extra words, which is the desired behavior (a short blocker description is considered resolved if most of its words appear in the resolution line). The asymmetry appears intentional for this use case, not a bug.
Was this helpful? React with 👍 or 👎 to provide feedback.
| extractTrackersFromRollup({ | ||
| conversationId, | ||
| dateKey, | ||
| rollupId, | ||
| rollupContent: draft.content, | ||
| trackerStore: new TrackerStore(this.store.db), | ||
| }); |
There was a problem hiding this comment.
📝 Info: New TrackerStore and EpisodeStore instantiated on every buildDayRollup / buildDailyRollups call
In src/rollup-builder.ts:271, a new TrackerStore(this.store.db) is created inside the transaction callback for every buildDayRollup call. Similarly at line 155, new EpisodeStore(this.store.db) is created for each buildDailyRollups invocation. These stores are lightweight wrappers around the database handle with no caching or state, so the repeated instantiation has negligible overhead. In lcm-recent-tool.ts:479-480, TrackerStore and EpisodeStore are also created per-request. This is consistent with how RollupStore is already used in the tool. Not a bug, but could be cleaned up if store creation becomes more expensive in the future.
Was this helpful? React with 👍 or 👎 to provide feedback.
| extractTrackersFromRollup({ | ||
| conversationId, | ||
| dateKey, | ||
| rollupId, | ||
| rollupContent: draft.content, | ||
| trackerStore: new TrackerStore(this.store.db), | ||
| }); |
There was a problem hiding this comment.
📝 Info: extractTrackersFromRollup called inside database transaction could cause nested transaction issues
At src/rollup-builder.ts:266-272, extractTrackersFromRollup is called inside withDatabaseTransaction(..., 'BEGIN IMMEDIATE', ...). The tracker extractor calls trackerStore.getOpenTrackers and trackerStore.createTracker and trackerStore.resolveTracker, which all call db.prepare(...).run(...) directly without their own transactions. Since these are plain SQL statements executed within the already-open transaction, they participate in it correctly. The TrackerStore methods (src/store/tracker-store.ts) don't use withDatabaseTransaction internally, so there's no nested transaction conflict. This is safe.
Was this helpful? React with 👍 or 👎 to provide feedback.
| getActiveEpisodes(conversationId: number): EpisodeRow[] { | ||
| return this.db | ||
| .prepare( | ||
| `SELECT | ||
| episode_id, | ||
| conversation_id, | ||
| title, | ||
| status, | ||
| first_day, | ||
| last_day, | ||
| keywords, | ||
| rollup_ids, | ||
| day_count, | ||
| created_at, | ||
| updated_at | ||
| FROM lcm_episodes | ||
| WHERE conversation_id = ? | ||
| AND status IN ('active', 'stale') | ||
| ORDER BY last_day DESC, updated_at DESC`, | ||
| ) | ||
| .all(conversationId) as unknown as EpisodeRow[]; | ||
| } |
There was a problem hiding this comment.
📝 Info: getActiveEpisodes returns both 'active' and 'stale' episodes despite the method name
In src/store/episode-store.ts:113-134, getActiveEpisodes queries for status IN ('active', 'stale'). The method name suggests only active episodes, but it also returns stale ones. The caller in episode-detector.ts:109-117 filters with if (episode.status !== 'active') continue, which makes this safe. The method name is slightly misleading but the behavior is intentional — stale episodes are included so callers can decide what to do with them. The lcm-recent-tool at line 484 also uses this to show both active and stale episodes to the user.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Closing this local/fork tracker PR as historical source material. The idea is still valuable, but it should be rebuilt as an observed/non-authoritative tracker layer after upstream Martian-Engineering#517, not merged from this branch as-is. Follow-up issue: Canonical plan:
|
|
Closed as part of LCM PR-stack cleanup. See the preceding comment for the canonical upstream PR or follow-up issue. |
|
Fork cleanup note: tracker/open-loop concepts are represented upstream by Martian-Engineering/lossless-claw PR Martian-Engineering#530, which keeps them as observed-work evidence with advisory status, confidence, and provenance rather than authoritative tasks. |
…c failure mode * src/db/migration.ts: `ensureMessagePartsTable` JSDoc now explains the post-Martian-Engineering#569 split-bulk approach exec's statements individually, so the legacy silent-abort failure mode no longer applies on new installs. The guard remains as defense-in-depth for legacy DBs upgraded across the silent-abort window. Inline comment at the registration site rewritten to match. * test/migration.test.ts: drop the `PR #9` reference (an internal agent-generated PR-numbering artifact); reword to PR-agnostic phrasing and reference the actual issue/PR numbers (Martian-Engineering#569, Martian-Engineering#482) for the migration change. No code change. Addresses review on Martian-Engineering#579.
…MED + 1 LOW) Three Opus 1M-context agents reviewed the P1-P8 commit (e182f24) at ≥95% confidence. Fixed everything HIGH/MED + a small LOW. All 1328 tests still passing. HIGH #1 (semantic-search.ts:286): entity-only return path was missing the new mandatory cosineSimilarity field — would have crashed downstream `.toFixed(3)` calls when caller had embedded entities/themes and no summary candidates returned. Added cosine derivation to that branch. HIGH #2 (lcm-grep-tool.ts:268): full_text mode was applying our new sanitizeFts5Pattern AND the existing store-layer sanitizer (in conversation-store / summary-store via fts5-sanitize.ts). Composition is actually safe (verified by tracing) but redundant; removed the tool-layer sanitize from full_text path. Verbatim path keeps it (verbatim has its own SQL path bypassing the store sanitizer). HIGH #3 (lcm-grep-tool.ts:725-735): when FTS5 isn't available, the catch-block fallback to `m.content LIKE ?` was looking for the raw pattern in `binds` to replace — but `binds` was poisoned by sanitizeFts5Pattern (`v4.1` → `"v4.1"`). findIndex returned -1, no replacement happened, LIKE got the literal phrase-quoted form. All sanitized verbatim queries silently returned 0 hits on no-FTS5 SQLite installations. Fixed: replace at known-position index 0 (the FTS-MATCH bind is always pushed first). HIGH #4 (lcm-grep-tool.ts:99): role enum included only user / assistant / tool / all — but messages table contains 'system' role too. system messages were silently unfilterable. Added 'system' to schema enum and to the runtime VALID_ROLES set. MED #5 (semantic-search.ts:127): cosineSimilarity doc-comment thresholds said ≥0.8/0.6/0.4 but actual impl used ≥0.65/0.5/0.35. Doc fixed. MED #6 (lcm-describe-tool.ts:241): early header signal said "N candidates; details below" based on raw childIds.length, but detail block could say "0/N (all suppressed)" if everything was suppressed — contradictory signals. Reworded header to "N raw candidate(s) before suppression filter; survivors + details below" so it doesn't lie. MED #7 (lcm-describe-tool.ts:381): expandMessagesOffset had no upper bound, enabling adversarial DoS via huge OFFSET scans. Clamped at 100k (well past any realistic 216-msg leaf). MED #8 (lcm-search-entities-tool.ts:208): the P8 catalogStatus probe ran COUNT(*) on lcm_entities globally — full-table scan on multi-million-entity DBs. Replaced with EXISTS(SELECT 1 ... LIMIT 1) which short-circuits at first row. LOW #9 (lcm-describe-tool.ts:418): when expandMessagesOffset >= totalMessages, status was misleadingly "ok" with 0 results. Added distinct "offset-past-end" status variant so callers can distinguish "leaf is empty" vs "you paginated past the end". Verified end-to-end on snapshot DB: - role: "system" no longer schema-rejected - offset 50000 (clamped to 100k cap) returns "offset-past-end" status Tests: 1328 passing (no regressions; existing tests cover the changed contracts via type-checked fields).
…W closed Ten parallel Opus 1M-context agents reviewed PR Martian-Engineering#613 partitioned by surface (migration / voyage / synthesis / hybrid+retrieval / agent tools / concurrency / extraction / operator / tests / docs+manifest). All HIGH+MED findings closed below; QA runner improved alongside. DATA-CORRUPTION / AVAILABILITY HIGH FIXES ========================================= Synthesis (Auditor #3 #1 #2 #5): - INSERT → INSERT OR IGNORE on lcm_synthesis_cache so concurrent callers don't crash with UNIQUE collision; latch-loser re-SELECTs and either returns cached result or "building elsewhere" hint. - Reap zombie 'building' rows older than 10 min before INSERT (prevents process-killed-mid-dispatch availability latch). - Audit GC: prune 'started' audit rows >1h and 'completed'/'failed' rows >30 days on every synthesize_around call. Bounded growth. Voyage (Auditor #2 #1 #2 #3 #4): - MAX_TOKENS_PER_EMBED_DOC: 30k → 27k (Voyage tokenizer counts ~9.5% higher than DB token_count; 30k × 1.095 = 32.85k > 32k Voyage cap → 400 errors on 28-30k stored-token leaves). - BACKOFF_CAP_MS: 30s → 25s (so worst-case retry path 25s + 30s + 30s = 85s leaves 5s margin under WORKER_LOCK_TTL_MS=90s). - heartbeatLock now requires `expires_at > now` predicate, refusing to extend an already-expired lock (prevented two-workers-think-both-own race when our long Voyage call exceeded TTL). - writeBatch wraps each row in SAVEPOINT so per-row failure rolls back JUST that row's vec0+meta partial writes (was leaving phantom vec0 rows when meta-side INSERT failed). Hybrid retrieval (Auditor #4 #2 #3): - FTS adapter in lcm-grep-tool now over-fetches + post-filters on sessionKeys/summaryKinds (was silently dropping these filters, leaking cross-session content into hybrid results — violated v4.1 §10 session-family scoping invariant). - Semantic-search time filter changed from `s.created_at` to `julianday(COALESCE(latest_at, created_at))` to match FTS arm. Was returning divergent sets for the same since/before window. Entity coref (Auditor #7 #1 #2 #3 #4 #5): - Entity ID generation: Math.random() (32-bit, ~64K collision) → crypto.randomUUID()-derived 48-bit suffix. - Mention ID: 16-char prefix truncation → FNV-1a content hash. Long surfaces sharing the first 16 chars no longer silently collide. - Entity INSERT → INSERT OR IGNORE + re-SELECT winner. Prevents ROLLBACK + retry-forever loop when two ticks process the same canonical surface concurrently. - occurrence_count: bump ONLY when a new mention row is actually inserted (was double-counting on idempotent re-process). - Extractor 16K char silent truncation now logs a warn line with the dropped-chars count. Concurrency (Auditor #6 #4): - extraction-autostart now calls tickExtraction (orchestrator-wrapped with acquireLock/releaseLock) instead of runCoreferenceTick directly. Prevents two gateway processes from double-processing the queue. Migration (Auditor #1 #3): - widenLcmSynthesisCacheTierCheck_v413 now DELETEs orphaned lcm_synthesis_audit rows before DROP-ing lcm_synthesis_cache. With foreign_keys=OFF during migration (the standard pattern), audit rows would have become dangling references; now they're cleaned. OPERATOR SURFACE (Auditor #8 BLOCKER #1) ======================================== - /lcm purge command now wired (was dead code). Soft mode only (immediate cut from PR). Defaults to dry-run preview; --apply to actually suppress. --allow-main-session gates Eva's primary thread. Required: --reason "..." + at least one criterion (--session-key, --summary-ids, --since, --before, --min-token-count). MED FIXES ========= - dispatch.ts verify_fidelity regex: `/^\s*OK\b/i` → `/(?:^|\n)\s*OK\b/i` so model preambles before "OK" don't false-positive a hallucination flag (Auditor #3 #4). - lcm_describe budget=0 now emits an explicit "delegated grant exhausted" line instead of silently showing budget=over on every node (Auditor #5 #3). - lcm_get_entity / lcm_search_entities entityType docs now list the actual extractor-produced types (person_name, pr_number, agent_id, etc.) instead of the fictitious ('person', 'project', 'pr', 'commit', 'file') that never matched (Auditor #7 #8). QA RUNNER IMPROVEMENTS (Auditor #9) ==================================== - adv-empty-pattern: vacuous predicate fixed; now asserts either graceful error OR 0 matches. - Added 2 missing-tool smokes: adv-lcm-get-entity-smoke and adv-lcm-expand-query-smoke (8 tools now exercised, was 5 of 8). - Determinism: replaced `ORDER BY RANDOM()` and unsorted `LIMIT 1` with stable `ORDER BY summary_id ASC LIMIT 1 OFFSET ?` so re-runs pick the same leaves and report deltas cleanly. - JSON output now includes `schemaVersion: "1.0.0"`. - Voyage cost rate corrected: 0.00012 → 0.00018 per 1K tokens (under-reported by ~33%). DOC RECONCILIATION ================== - PR_DESCRIPTION.md: 22/25 claim now annotated with live-harness refinement (14/25 high confidence + 8/25 degraded UX + 3/25 fallback). - HARNESS_REPORT_2026-05-06.md: prepended status banner + per-bug [FIXED in commit X] annotations so reviewers reading the report end-to-end see what's still open vs. closed. VERIFICATION ============ - 1328/1328 tests passing (no regressions; 2 tests updated for intentional behavior changes — voyage cap 30k→27k, batching test sizes 30k→25k to stay under new cap). - QA runner: smoke 8/8, adversarial 10/10, full 30/30 — all clean. - Total cost ~$0.11 per full QA run. DEFERRED TO CYCLE-3 (acknowledged in PR description, not blocking merge) ========================================================================= - Auditor #6 #1-#3 (concurrency doc overclaims about busy_timeout + fallback-soak + heartbeat-on-worker-thread): in-process model means these guarantees aren't load-bearing today. Doc to be reconciled when worker-thread isolation lands in cycle-3. - Auditor #7 #6 idle GC for zero-mention entities: not blocking; occurrence_count only ever bumps up, never down. - P9 / P10 from harness report: low priority, no immediate workaround needed.
Wave-2 ran 10 Opus 1M-context agents over the post-Wave-1 commit. Key findings + fixes: CRITICAL CRASH BUG ================== Wave-2 Auditor #1 finding #1 (HIGH): the synthesis cache loser-path SELECT queried column `output` but the schema has `content` (migration.ts:1506). EVERY concurrent ready-cache hit threw `no such column: output`. Single-flight winner-already-ready fast-path was completely broken. Fix: changed SELECT to use `content`, response field renamed `text`. DATA-CORRECTNESS HIGH ===================== Auditor #1 #2: zombie cache janitor only reaped `'building'` rows; `'failed'` rows would block all future synthesis of the same window forever. Now reaps both. Added `recent_failure` response shape so caller can distinguish from `building_elsewhere`. Auditor #2 finding F1: parseRetryAfterMs silently clamped Voyage server-supplied Retry-After to BACKOFF_CAP_MS (25s), so a `Retry-After: 60` was retried at 25s — still rate-limited, wasting a retry slot. Also tightly coupled with WORKER_LOCK_TTL_MS=90s. Fix: honor server retry-after up to 5min cap; if it exceeds the lock-aware budget (60s), throw rate_limit immediately so caller releases lock and the next autostart tick retries cleanly. Auditor #6 BUG-2 + BUG-3 (HIGH): /lcm purge dry-run preview used its own SQL with `datetime(created_at)` while runPurge used raw `created_at >= ?`. Edge cases (timezones, microseconds) gave divergent counts; --summary-ids dry-run returned input length without filtering for actually-existing leaves. Also the empty- criteria dry-run scared operators with whole-DB count. Fix: extracted `previewPurgeAffected(db, opts)` from purge.ts and wired the dry-run to use it. Added validation parity, --allow-main- session warning, race-window note in output. Auditor #7 finding A1 (HIGH): time-filter inconsistency across tools — summary FTS + semantic used `julianday(COALESCE(latest_at, created_at))` (post Wave-1) but synthesize-around still used `datetime(created_at)` and verbatim grep used `datetime(m.created_at)`. Cross-tool: same `since`/`before` window returned different result sets depending on which tool the agent picked. Fix: synthesize-around now uses `julianday(COALESCE(latest_at, created_at))`. Verbatim grep (messages — no latest_at) now uses `julianday(m.created_at)` for syntactic parity. TEST COVERAGE GAP ================= Auditor #8 finding F1: zero test coverage for the Wave-1 migration DELETE-before-DROP fix. Fix: added 3 new tests in v41-synthesis-tables.test.ts: - DELETE prunes only orphan-pointing rows, preserves target_summary_id-pointing rows - re-running runLcmMigrations on already-widened DB is a no-op - schema includes wide CHECK including 'monthly' on first migration Auditor #8 finding F2: bare catch in migration too broad — could swallow corrupted-DB errors. Now narrowed to expected "no such table.*lcm_synthesis_audit" pattern; re-throws otherwise. QA RUNNER IMPROVEMENTS ====================== Auditor #9 HIGH-2: OFFSET overflow returned `undefined` row, target became `undefined`, predicate accepted any error → tests passed on empty corpus. Fix: fall back to OFFSET 0 (first leaf) if requested offset exceeds row count. Sentinel `__NO_LEAVES_IN_CORPUS__` when even that fails. Auditor #9 HIGH-3: B/C predicates only checked for `r.error` → 0-hit returns silently passed. Fix: added `Array.isArray(r.details?.hits)` assertion + per-hit shape validation (content, role for verbatim). DOC RECONCILIATION ================== Auditor #10 F1: HARNESS_REPORT internally inconsistent (banner said "30/30 pass" but verdict body still showed 14/8/3). Reconciled: explicit "two numbers reflect two rubrics" explanation. Auditor #10 F2: THE_FIVE_QUESTIONS.md still said "22/25 PRIMARY coverage" without live-harness annotation. Added post-fix verification note pointing to QA runner + HARNESS_REPORT. Auditor #10 F3: PR_DESCRIPTION listed "5 operator commands" but the plugin exposes 9 (status, health, worker, reconcile-session-keys, eval, purge, backup, rotate, doctor + help). Fixed to 9 with descriptions. CROSS-TOOL NAMING PARITY ========================= Auditor #7 A2 (MED): synthesize-around emits `voyage_tokens_consumed` (snake_case) while semantic-recall emits `voyageTokensConsumed` (camelCase). The tool's output uses snake_case throughout for internal consistency, so we added `voyageTokensConsumed` as a camelCase alias alongside the original. VERIFICATION ============ - 1331/1331 tests passing (1328 baseline + 3 new migration tests) - QA runner full suite: 30/30 pass - QA runner adversarial suite: 10/10 pass - Total cost: ~$0.11 per full QA run DEFERRED (acknowledged, not blocking merge) ============================================ - Auditor #2 F3 (heartbeat between batches, not mid-batch): the SAVEPOINT-per-row + heartbeatLock-with-expires_at-predicate combination already detects lock theft cleanly; mid-batch heartbeat is a cycle-3 hardening item. - Auditor #6 #11 (operator permission gate on /lcm purge): the command runs without an explicit auth gate at the plugin registration site. Gate is delegated to the OpenClaw plugin contract layer (per the existing convention with reconcile- session-keys, doctor clean apply, etc.). If/when OpenClaw exposes isOperatorSession() to plugins, all destructive subcommands will consume it together. - Auditor #1 #4 (verify_fidelity regex still has edge case where "OK" appears mid-line in negative context): improvement over Wave-1; full negative-context detection requires a more sophisticated parser. - Auditor #1 #5 (audit GC scans full table per call): cost is ~1ms; future move to scheduled background sweep. - Auditor #3 F2/F3 (entity coref single-flight contract): improvements documented; in-process inFlight + DB-row-level lock combination is sufficient for current single-process deployments. - Auditor #9 HIGH-1 (QA-runner durationMs varies across runs): timing fields are inherently non-deterministic; row selection IS now stable which is the actual reproducibility property.
Wave-3 ran 10 Opus 1M-context agents on the post-Wave-2 commit. Three agents (#3, #8, #9) couldn't see the post-Wave-2 tree — they looked at stale checkouts and produced no usable findings. The remaining seven surfaced 11 real issues. DATA-CORRECTNESS HIGH ===================== Auditor #1 H1: `recent_failure` response (Wave-2 addition) didn't include `failure_reason` even though we stored it on the row — caller saw a generic hint instead of the actual cause one column away. Fix: SELECT `failure_reason` from the loser-path query and surface it in the response. Truncate to 200 chars in the hint. Auditor #1 H2: 10-min `failed`-row TTL caused hammering during long Voyage outages — every 10 min, every distinct (session, range, fp) tuple would re-attempt LLM, fail, mark failed, repeat. With many windows this cascaded into a steady DDoS against the LLM provider. Fix: exponential backoff per cache row — `TTL_MIN * 2^audit_attempts`, capped at 6h. Audit row count gives us attempt history per cache_id. Auditor #1 H3: `building_elsewhere` had no max-retries hint — if the winner process died between INSERT and the next zombie sweep, every concurrent caller would loop indefinitely. Fix: compute `retry_after_ms = max(0, building_started_at + 10min - now)` so callers can sleep precisely once instead of polling. Auditor #1 M1: audit GC's 30-day branch had no index — full-table scan on every `synthesize_around` call. Fix: added partial index `lcm_synthesis_audit_completed_gc_idx` on `(ran_at) WHERE status IN ('completed', 'failed')` so both GC branches are O(log n). Auditor #1 M2: janitor DELETE + INSERT OR IGNORE were not atomic — cross-process callers could sneak in between, causing benign latch loss + unexpected `building_elsewhere` responses. Fix: wrapped both in `BEGIN IMMEDIATE` ... `COMMIT` so the operation is serialized at the SQLite write-lock level. Auditor #4 #3 (HIGH): `lcm_grep mode='semantic'` details.hits[] was missing `conversationId` (broke parity with hybrid + verbatim modes) and missing `cosineSimilarity` + `confidenceBand` (broke parity with `lcm_semantic_recall`). Cross-tool agents JSON-parsing the response shape would hit drift. Fix: details.hits now mirrors `lcm_semantic_recall` exactly: {summaryId, conversationId, sessionKey, kind, distance, cosineSimilarity, tokenCount, createdAt}. Tool now also emits `confidenceBand` at the top level + warns on low/noise just like semantic-recall. DOC FIXES ========= Auditor #6 #2/#3: README.md was stale — listed only 3 v3-era tools (`lcm_grep`, `lcm_describe`, `lcm_expand`) and 5 of the 9 commands. Fix: rewrote the tool list (8 tools with one-liners) and command section (9 subcommands with full flags). TEST COVERAGE FILLS (Auditor #7 top-3 priority gaps) ===================================================== Added 8 new tests (1331 → 1339): 1. `operator-purge.test.ts` previewPurgeAffected parity (4 tests): - Range purge: preview count == affectedLeafIds.length - --summary-ids: filters out non-leaf, already-suppressed, nonexistent - since/before time filter: preview matches apply - Empty match: preview returns 0 cleanly 2. `voyage-client.test.ts` lock-budget retry behavior (2 tests): - Retry-After > 60s threshold: throws immediately, does NOT sleep, elapsed time < 2s (proven by wall-clock measurement) - Retry-After ≤ 60s: server-supplied value honored, retries as expected 3. `lcm-synthesize-around-tool.test.ts` schema column-name regression (2 tests): - Schema has `content` (not `output`); all 6 columns the loser-path SELECT references exist - Literal SELECT used by loser-path executes without error against the real schema (proves the Wave-2 crash bug can't regress) VERIFICATION ============ - 1339/1339 tests passing - QA runner full suite: 30/30 - QA runner adversarial: 10/10 - Total cost ~$0.11 per full QA run DEFERRED (acknowledged, not blocking) ====================================== - Auditor #1 L1 (test exercises only the SQL DELETE not the full migration step): the DELETE-in-isolation is sufficient for what changed; the migration step itself has its own coverage in `v41-pre-existing-schema-migration.test.ts`. - Auditor #2 F2/F3 (60s lock-budget threshold has zero margin under worst-case scenarios): the Wave-1 heartbeat-with-expires_at predicate detects lock theft cleanly even if budget is exhausted; tightening the threshold further is a future hardening item. - Auditor #4 confirmed-clean items (suppression filter parity, error envelope shape, conversation-scope error message) — no further work needed. - Auditor #5 (E2E smoke): documented real UX gaps in `lcm_synthesize_around` discoverability (target= vs query=, window_kind required) — would require schema-description rewrites; queued for cycle-3 ergonomics pass. Audit cycle stats: - Wave-1: 17 HIGH + 9 MED + 1 LOW closed across 1 commit - Wave-2: 19 findings (4 HIGH + 4 MED + 1 LOW + others) closed - Wave-3: 11 findings closed (this commit) - Total: 36+11 = 47 findings closed across 3 commits - 1339 tests passing
… + 15 P1 closed After Eva's correct push for full-PR re-audits (Waves 5-6 were focused on diffs only and missed regressions in untouched surfaces), Wave-7 ran 22 parallel Opus 1M-context agents at ~1k LOC each across the full ~22K LOC production codebase. Surfaced 7 actionable P0s + ~30 P1s + ~25 P2s + ~15 P3s. (1 P0 from Auditor #17 was confused — was reading a stale clone path; ignored.) P0 — DATA / SECURITY / CORRECTNESS (7 closed) ============================================= Auditor #14 P0-1 (CRITICAL — security): /lcm purge --apply lacked any operator-session gate. The purge.ts module docstring explicitly required "callers MUST gate via deps.isOperatorSession() or equivalent" but the lcm-command.ts dispatch site at line 2626 wired runPurge with ZERO check. Any agent that could issue /lcm slash commands could purge another session's data — including Eva's primary thread via --allow-main-session. Fix: gate the entire `case "purge":` dispatch on `ctx.senderIsOwner` (the OpenClaw plugin SDK owner-only flag). Both dry-run preview AND --apply require owner; preview is gated because it leaks which leaves match the criteria. Auditor #14 P0-2 (data loss): Purge cascade orphaned shared messages. The UPDATE messages SET suppressed_at WHERE message_id IN (SELECT ... FROM summary_messages WHERE summary_id IN (...)) silently suppressed messages even when they were referenced by NON-purged leaves. assemble() filters on suppressed_at IS NULL → those non-purged leaves lost their underlying message content invisibly. Fix: added NOT EXISTS predicate that requires every other referencing summary to ALSO be in the purge set OR already suppressed before suppressing the message. Auditor #6 P0 (cache pollution): sessionKeyForCache fell back to "" in period mode when targetSummary was null AND input.sessionKey was empty. The cache UNIQUE constraint then collapsed multiple users' caches together — caller A's synthesis would surface in caller B's loser-path SELECT. Fix: 4-tier fallback chain — targetSummary's key → input.sessionKey → conversationIds[0]'s session_key (looked up from conversations table) → "agent:main:main" as last-resort default. Auditor #9 P0-2: expandMessages did not honor the W4 budget=0 expansion-block; only expandChildren did. A delegated caller with grant=0 calling expandMessages=true got full message content despite the documented "expansion is blocked" assertion. Fix: identical budgetExhausted gate added to the expandMessages branch. Auditor #12 P0-A: Per-row SAVEPOINT MISSING in entity-coreference batch tx. A single bad surface (FK violation, encoding issue, CHECK failure) ROLLBACKed the WHOLE LEAF — discarding all valid mentions already inserted AND failing to bump attempts (the dead-letter gate), producing an infinite-retry loop on poison surfaces. Fix: each entity surface now gets its own SAVEPOINT inside the batch tx. Per-row failure rolls back JUST that surface; siblings + queue UPDATE survive. Failures recorded in itemDetail.error per-index for operator visibility. Auditor #9 P0-1: describe()'s "raw count" header LIED. It labeled `s.childIds.length` as "raw candidate(s) before suppression filter" but childIds was already suppression-filtered upstream by getSummaryChildren default. Agents reading the header believed they were seeing pre-filter counts. Fix: re-query the actual raw count via a cheap COUNT(*) on summary_parents and emit honest "X of Y raw" phrasing. When all children suppressed, distinguishes from "no children" (terminal node) — was previously indistinguishable. Auditor #19 P0: scripts/v41-synthesize-around-smoke.mjs still used copyFileSync against the live WAL DB (W4 fixed v41-live-db-harness.mjs + preflight but missed this third script). Mid-checkpoint copies produce malformed snapshots. Fix: VACUUM INTO atomic snapshot. P1 — HIGH IMPACT (15 closed) ============================= - Auditor #1 P1: searchLikeCjk used `new Date()` instead of parseUtcTimestamp → CJK fallback timestamps offset by host's local TZ. Other 4 search paths used parseUtcTimestamp; CJK was the outlier. - Auditor #2 P1: Voyage responseBody privacy. W4 fixed only the 400 path; 401/403/429/5xx/4xx-other still attached raw bodyText to the exception. Same Sentry/log-capture vector. Fix: route ALL non-200 responseBody through summarizeBody for parity. - Auditor #4/13 P1: tickExtraction ignored result.lockLostMidTick. W4 added the field but the wrapper returned `lockAcquired: true` regardless. Now flips to false when heartbeat reported lock-loss mid-tick → autostart can detect + back off. - Auditor #5 P1.1: best-of-N used Promise.all → one failed candidate threw away successful peers' work. Fix: Promise.allSettled. Throw only if ALL fail; judge picks among survivors. - Auditor #5 P1.2: best-of-N with N=1 still ran judge — judge prompt expects 0..N-1 indexed candidates; many models emit 1-indexed and trip judge_failure. Fix: skip judge when only 1 candidate survived. - Auditor #6 P1: parsePeriodShortcut regex over-accepted undocumented variants (last-3day, last-3-d). Fix: tightened to /^last-(\d+)d$| ^last-(\d+)-days$/ matching only documented forms. - Auditor #8 P1-3: sort silent override. Agent passing sort=relevance with mode=regex got recency without warning. Fix: details now surfaces sortIgnored: true + requestedSort/effectiveSort. - Auditor #8 P1-2: kFts/kSemantic over-fetch was max(limit, 50). At limit=200, rerank had ZERO headroom. Fix: 3× limit, floored at 50, capped at 500 (Voyage rerank budget). - Auditor #21 + #8 P1-6: hybrid confidenceBand thresholds reuse cosine calibration on rerank scores (different scale). Fix: emit confidenceBandSource: "cosine" | "rerank" so callers know which signal drove the band. - Auditor #12 P1-A: extractor placeholder pre-scan (W4 promised but never implemented). Fix: refuse extraction if leaf content contains XML envelope-like patterns (defense-in-depth against injection). - Auditor #12 P1-E: dead-letter UPDATE failure left attempts at 0 → infinite retry. Fix: try second simpler bump-only UPDATE if the first (with last_error) fails. - Auditor #18 P1: promptAwareEviction violates "structural-only" invariant. Fix: documented as opt-in with WARNING comment in config.ts that flagging it on breaks deterministic replay. - Auditor #20 P1-3: README synthesize_around description was anchor-required-only — period mode (the lcm_recent replacement) not mentioned. Fix: 3-mode breakdown. - Auditor #20 P1-4: THE_FIVE_QUESTIONS stale prose declared "themes/procedures/entities" all live. Themes + procedures were CUT (preserved in Martian-Engineering#616). Fix: explicit coverage status note. VERIFICATION ============ - 1345/1345 unit tests passing (no regressions) - QA runner full: 30/30 pass - QA runner adversarial: 10/10 pass (not re-run; W6 baseline) - Total cost ~$0.11 per full QA run DEFERRED (acknowledged) ======================== - A14 P1: lcm_purge_audit table — needs schema migration; defer to cycle-3. Workaround: purge_session_id is returned + suppress_reason is recorded per leaf row. - A18 P1: summarizeWithEscalation silent over-cap truncation — separate from W4 fallback marker fix; cycle-3 ergonomics. - A8 P1-5: details.hits[] shape drift across 5 grep modes — by-design difference (regex/full_text are aggregates; hybrid/semantic/verbatim are per-row). Documented in agent-tools.md. - A8 P1-4: verbatim recency-only ordering — by-design (citation use case prioritizes "what was said most recently"). - A10 P1-01: lcm_expand 24-day legacy timeout — sub-agent-only path, bounded by grant TTL. - A10 P1-06: runExpand `?? 0` fallthrough — multi-conv grant path not exercised by lcm_expand_query (always single-conv). - Various P2/P3 cosmetic items.
…0 + 9 P1 closed + 4 new regression tests Wave-9 was the first audit cycle to give every agent FULL FILE context (not just diffs) plus cross-cutting checklists tailored to their slice, plus all prior wave findings as known-closed reference. Eva's directive: "agents need ENOUGH CONTEXT not to introduce new issues while fixing minor ones." Wave-9 also added a TS-strict closure pass (separate commit 11f10a6) that brought PR-introduced TS errors from 30 → 0. 11 agents (slicing by responsibility, ~14.7k LOC src + 12.5k LOC tests + 2.2k LOC scripts): #1 Lossless core — engine, assembler, retrieval, summarize, compaction #2 Migration + schema — db/migration, all migration tests #3 Storage layer — summary-store, conversation-store #4 Search tools — lcm_grep, lcm_semantic_recall, hybrid, semantic #5 Drilldown tools — lcm_describe, lcm_expand, lcm_expand_query #6 Entity + extraction — lcm_get_entity, lcm_search_entities, coreference #7 Synthesis — synthesize_around, dispatch, prompt-registry, seed #8 Voyage stack — voyage/client, embeddings/store/backfill/semantic #9 Worker + concurrency — concurrency/*, autostarts, worker-orchestrator #10 Operator surface — purge, health, reconcile, eval-runner, plugin #11 Scripts/QA-runner — coverage-gap audit Eva caught after launch Findings: 1 P0 + 13 P1 + 22 P2 + 42 P3 = ~77 unique (Agent #2 P2 and Agent #7 P1 converged on same `{{date_range}}` bug.) This commit closes the P0 + 9 of 13 P1s + adds 4 regression tests. Remaining P1s + all P2/P3 are documented in PR comment for follow-up. P0 (CLOSED) — Owner gate parity (Agent #10): - /lcm reconcile-session-keys --apply lacked senderIsOwner (Wave-7 P0-1 had only added it to /lcm purge). Cross-session data theft vector: non-owner agent could re-key Eva's primary thread into an attacker bucket via --allow-main-session. - /lcm worker tick embedding-backfill same gap (lower-impact: DoS-by-billing on the operator's Voyage account). - Both fixed: same gate pattern as case "purge" applied to both. - 3 new regression tests pin the gate behavior so future refactors can't silently regress. P1 fixes (9 of 13): P1.1 (Agent #5) — Citation-fabrication count threaded through ExpandQueryReply. Wave-4+W6+W8 chain validated citedIds internally (rejected fabricated IDs against summaries table) but buildExpandQueryReply silently dropped the counts. Agent now sees citedIdsRejectedAsFabricated + citedIdsExceededValidationCap in the JSON reply (omitted when zero, summed across buckets in multi-conv path). P1.2 (Agent #5) — lcm_describe expandChildren/expandMessages now consumes the grant token budget. Previously the budget was CHECKED (budgetExhausted detection) but never DECREMENTED. With 50 children + 50 messages × ~2K tokens each = ~100K tokens delivered per call without grant cap touching. Now sums consumed tokens and calls authManager.consumeTokenBudget() for sub-agent sessions. Closes the unbudgeted side-channel that defeated the W4/W6 expansion budget. P1.3 (Agent #4) — lcm_grep --mode semantic VoyageError contract parity. Previously caught only `auth` and SemanticSearchUnavailable; let rate_limit/server_error/network/bad_request/unexpected propagate as unhandled tool errors. lcm_semantic_recall correctly catches all VoyageError kinds. Now mirrored — both surfaces routed for Question B have identical error contract. P1.4 (Agent #4) — lcm_grep --mode verbatim CJK fallback. messages_fts uses tokenize='porter unicode61' which can't segment CJK ideographs — MATCH on 中文 returned 0 rows WITHOUT throwing, so the exception-driven LIKE fallback never fired. Now containsCjk(pattern) detected at JS layer, routes directly to LIKE substring match (skipping FTS join entirely). 1 new regression test covers Chinese characters. P1.5 (Agent #10) — reconcileSessionKeys TOCTOU race. affectedConvs snapshot was taken OUTSIDE BEGIN IMMEDIATE; concurrent INSERT/UPDATE between snapshot and tx-acquire could be UPDATE-moved without an audit row, silently dropping it → loss-of-undo on a destructive op. Same pattern as Wave-8 P1's runSoftPurgeAtomic fix. Refactored: active-conflict pre-check + affectedConvs SELECT + UPDATEs all run inside the same BEGIN IMMEDIATE. P1.6 (Agent #10) — runRecallEval setTimeout leak. Promise.race spawned a timer that was never cleared on adapter resolve. N=100 queries × 30s = 30s tail-latency floor + event-loop liveness held open (process never exits in scripts). Added try/finally with clearTimeout. P1.8 (Agent #1) — Compaction fallback marker regression. Wave-4 P0 fix in summarize.ts tagged fallback content with "[LCM fallback summary - model unavailable]" — but because the marker adds ~25 tokens, the resulting summary is LARGER than the source, so summarizeWithEscalation rejected it as "didn't compress" and fell through to compaction.ts's OWN buildDeterministicFallback which emitted raw truncated content with NO marker, silently undoing the W4 fix for any source <= max(targetTokens*4, 256) chars (i.e. most leaves under LLM outage). Fix: prepend the same marker in compaction.ts's fallback. Empty-source path tagged for parity. P1.9 (Agent #2 + #7 convergence) — {{date_range}} placeholder orphaned in seed prompts vs renderer. dispatch.renderPrompt only substituted source_text/tier/memory_type. Seeded daily/weekly/ monthly templates used {{date_range}} literally; SynthesizeRequest had no dateRange field. Currently latent (synthesize_around clamps to custom/filtered) but becomes P0 the moment a daily/weekly/monthly synthesis worker wires up. Same class as Final.review.3 Loop 4 Bug 4.2. Fix: dropped {{date_range}} from seeded templates (use "from a single day/week/month" phrasing instead). Caller can bake explicit ranges into sourceText if needed. P1.10-P1.13 (Agent #11) — QA harness coverage gaps: P1.10 — process.chdir("/tmp/lossless-claw-upstream") hardcoded made the QA harness unrunnable anywhere except that exact path. Replaced with a sentinel-file existence check that errors fast with a clear "run from repo root" message. P1.11 — adv-lcm-expand-query-smoke was vacuous: predicate returned null unconditionally, args omitted required `prompt` field. Now exercises full dispatch path with real prompt + asserts response shape (answer + citedIds, or graceful LLM-unavailable error). P1.12 — Period mode (lcm_recent replacement, most reviewer-debated capability) had ZERO harness coverage. Added 2 new test cases: period='yesterday' and period='last-7d' (covers the W7-tightened hyphenated parser). P1.13 — lcm_grep regex/full_text modes had ZERO harness coverage (2 of 5 documented modes). Added 2 new test cases asserting the regex/full_text response shape (totalMatches/messageCount/ summaryCount, not details.hits which is hybrid-only). Verifications: - npx tsc --noEmit → 739 errors (exactly matches origin/main baseline; ZERO PR-introduced TS errors) - npx vitest run → 1353/1353 passing (1349 baseline + 3 owner-gate + 1 CJK regression tests) - All Wave-9 fixes verified at code level on real file paths Deferred P1s (4 of 13) — handled in follow-up commits / cycle-3: - P1.7: TOCTOU between affectedConvs and active-conflict pre-check is now closed (folded into P1.5 fix above). - Agent #5 P2 multi-bucket DEFAULT_MAX_CONVERSATION_BUCKETS=3 silent drop is documented but deferred (ergonomic, not safety). - Agent #4 cosineSimilarity not clamped in hybrid mode: trivial 2-line fix but not safety. - Agent #5 dead `runDelegatedExpansionLoop` in lcm_expand: cleanup task, no behavior change. Pattern observation: Wave-9's full-file-context approach paid off — caught the same class of bug (missing owner gate) on the SISTER case of a previously-fixed P0, which a narrow-diff audit could not have spotted. Future audits should keep this approach.
… 4 sub-agent test layers + 8 source bugs closed A separate reviewer raised 12 findings on PR Martian-Engineering#613 with the strategic bar "don't just make the findings disappear; make the PR truthful under real operator scenarios." User correctly noted "wasn't sure if verified" so I verified each before fixing. Verification result: 12-for-12 real bugs. Combined with 4 parallel test-quality sub-agents addressing antipatterns A8 (concurrency) + A9 (schema drift) + A1/A4 (adversarial scenarios + fixture-test circularity) + A4-at-scale (stress fixture). # Reviewer findings (all 12 closed) ## P1 (5) - **#1 Period synthesis timezone** (src/tools/lcm-synthesize-around-tool.ts): parsePeriodShortcut anchored "today/yesterday/this-week/last-week/ this-month/last-month" at UTC midnight. A Bangkok operator (UTC+7) at 02:00 local asking "yesterday" got UTC-yesterday — ~17 hours off. Operator-trust violation. Now uses Intl.DateTimeFormat to compute local-day boundaries in lcm.timezone (configured IANA TZ); samples the offset at local noon to avoid DST-fold ambiguity. Relative forms (last-Nh, last-Nd) stay UTC-anchored (now-minus-N, not day-anchored). - **#2 Synthesis cache key** (src/db/migration.ts + src/tools/lcm-synthesize-around-tool.ts): UNIQUE index keyed only on (session_key, range_start, range_end, leaf_fingerprint, grep_filter). Two correctness bugs: (a) tier='custom' then tier='filtered' for same range/leaves silently returned wrong-tier cached text, (b) registerPrompt changing the active prompt left cache serving stale text from the old prompt. Now includes tier_label + prompt_id in both the UNIQUE index and the lookup SELECT. Cache is rebuildable so wiping under the new key is safe. - **#4 /lcm eval owner gate** (src/plugin/lcm-command.ts): /lcm eval mutates lcm_eval_run + lcm_eval_query_result tables AND can use Voyage in hybrid mode (small but non-zero quota cost). Wave-9 Agent #10 had classified it as READ_ONLY — the reviewer correctly challenged that classification. Now gated on senderIsOwner and added to the authorization-invariant test's DESTRUCTIVE_OPERATOR_CASES list. - **#5 Voyage rerank token budget** (src/embeddings/hybrid-search.ts): rerank sent ALL candidates' full content with no enforcement of the ~600K-token cap. Realistic queries with many large condensed summaries hit Voyage 400 → silent RRF degradation, losing the +52.5pp paraphrastic recall lift. Now packs candidates into rerank input cumulatively until 85% of MAX_TOKENS_PER_RERANK_CALL, dropping tail when over budget. Surfaces rerankPackTruncated + rerankPackedCount in HybridSearchResult. - **#6 lcm_describe base content not charged** (src/tools/lcm-describe-tool.ts): Wave-9 P1.2 fix added consumeTokenBudget for expandedChildren + expandedMessages but skipped the base summary's s.content (which lines.push()es ALL of it). A sub-agent could lcm_describe a 30K-token condensed summary with NO expansion flags and drain context for free. Now charges base s.tokenCount too. ## P2 (5) - **#3 Suppressed entity leakage** (src/tools/lcm-get-entity-tool.ts + src/tools/lcm-search-entities-tool.ts): when ALL mentions of an entity were suppressed via /lcm purge, the entity row in lcm_entities still leaked canonical_text + alternate_surfaces + metadata via both tools. The reviewer's framing: "suppression means invisible to agents, period." Both tools now require at least one unsuppressed mention via EXISTS guard. The "not found" branch now covers both "no such entity" AND "all mentions suppressed" indistinguishably (so an attacker can't infer entity existence). Updated test fixtures' insertEntity helpers to auto-create a default visible mention; tests that explicitly want the all-suppressed case opt out via noDefaultMention: true. - **#7 Pending-extractions count** (src/extraction/entity-coreference.ts): countPendingExtractions filtered only on (kind, completed_at IS NULL), but runCoreferenceTick's selector ALSO requires (attempts < 5, summaries.suppressed_at IS NULL). Mismatch caused autostart to spin forever on rows the tick would never select. Predicate now exactly matches the selector. - **#8 QA runner period coverage + exit semantics** (scripts/v41-qa-runner.mjs): period test cases I added in Wave-9 P1.12 omitted window_kind="period" (required by the tool), so they only hit schema-validation early-return and the regex match on 'period' made them trivially pass. Added the required field. Plus failedImportant had no exit branch — runner exited 0 on any "important" failure, advisory-only. Added exit code 1 for important failures so the runner can act as a release gate. - **#9 sqlite-vec install honesty** (package.json + semantic-infra-init.ts): sqlite-vec wasn't in any dependencies block, init log was log.info (low visibility), and PR_DESCRIPTION emphasized VOYAGE_API_KEY alone. Added to optionalDependencies; bumped log to log.warn with explicit install instructions + clear "what becomes unavailable" message. - **#10 Backfill complete message lies** (src/plugin/lcm-command.ts): countBackfillPending excludes leaves with token_count > MAX_TOKENS_PER_EMBED_DOC, so an over-cap leaf was neither pending nor backfilled. Worker-tick output printed "✅ Backfill complete" even when over-cap leaves remained unembedded. Added countOverCapPendingForBackfill helper; completion message now distinguishes "in-range complete + over-cap remain" from full coverage. ## P3 (2) - **#11 lcm_synthesize_around description** (src/tools/lcm-synthesize-around-tool.ts): agent-tool description still said "Two modes" (time + semantic) while schema declared three. Rewrote description + JSDoc to mention all three (period, time, semantic) and explicitly call out 'period' as the lcm_recent replacement / "what did we work on yesterday" surface. - **#12 NUL byte in source** (src/tools/lcm-synthesize-around-tool.ts:331): fingerprintLeaves used a literal NUL byte (\x00) as a hashing separator, making the file binary to grep. Replaced with the escape sequence "\0" (functionally identical at runtime, readable in source). File is now searchable. # Sub-agent test layers (4 in parallel) ## Sub-agent #1 — Concurrency / TOCTOU (test/v41-concurrency-invariants.test.ts, ~1044 LOC, 8 tests) Worker-thread-based parallel-writer harness reproduces and pins race-condition fixes: reconcileSessionKeys race (Wave-9 P1.5), runSoftPurgeAtomic race (Wave-8 P1), worker-lock acquire (5-way), heartbeat-during-LLM-call (Wave-9 Agent #8 P2), recordEmbedding DELETE-before-INSERT atomicity. Verified regression-detection by simulating pre-fix code. 0 new bugs found. ## Sub-agent #2 — Schema/placeholder drift (test/v41-schema-drift-invariants.test.ts, ~654 LOC, 19 tests) Static-analysis tests via readFileSync + regex. Catches: placeholder drift in seeded prompts vs renderer (Wave-9 P1.9 class), tier_label CHECK constraint coverage vs TS union (Final.review.3 Bug 4.4 class), manifest-vs-registered-tool drift (Wave-9 vapor-tools class), parser/handler symmetry, FK ON-DELETE explicitness. **Found 3 P3 FK drift bugs** — 3 declarations missing explicit ON DELETE clauses. Closed in this commit (lcm_synthesis_cache.prompt_id, lcm_synthesis_audit.prompt_id, lcm_embedding_meta.embedding_model → all now `ON DELETE RESTRICT`). ## Sub-agent #3 — Adversarial scenarios + fixture-test circularity audit (test/v41-adversarial-scenarios.test.ts, ~1149 LOC, 37 tests) Audit of original 25 scenarios: 16/26 strong, 9/26 weak ("only totalMatches > 0"), 1 sentinel. Strengthened 6 weak tests in v41-five-questions.test.ts (B1-B5, E2) to assert specific summary IDs. **Found 1 real fixture bug**: summaries_fts insert used `rowid` but schema declares `(summary_id UNINDEXED, content)` — original B1-B5 tests "passed" only because they matched at the messages layer, never actually exercising summary FTS. Fixed in fixture; the strengthened B1-B5 tests now actually exercise summary FTS. 37 hard adversarial scenarios spanning paraphrase, ambiguity/ranking, compound queries, negative queries, content injection (placeholder/XML/script/ SQL-injection), ranking sensitivity, cross-tool composition, suppression boundary. ## Sub-agent #4 — Stress fixture (test/fixtures/v41-stress-corpus.ts + test/v41-stress-fixture.test.ts, ~898 LOC, 11 tests) Deterministic generator for 1500-2500 leaves with realistic distribution (30% last-7-days, dense days with 100+ leaves, 5-10% suppressed, 5% CJK, near-duplicates, 5 adversarial-content leaves). 11 stress tests cover build smoke, determinism, distribution, dense-day query, suppression cascade, FTS5 perf, vec0 KNN (graceful no-op when vec0 unavailable), adversarial-content non-breaking, near-duplicate handling, recency floor. # Wave-10 reviewer regression coverage (test/v41-wave10-reviewer-regressions.test.ts, 6 tests) Pins fixes for #2 (cache UNIQUE index w/ tier+prompt), #3 (suppressed entity invisibility), #7 (pending count predicate), #10 (over-cap counting). #1 has its own dedicated v41-period-timezone.test.ts (8 tests). #4 covered by extending v41-authorization-invariants.test.ts DESTRUCTIVE_OPERATOR_CASES. # Verification - **1490/1490 tests passing** (1401 pre-Wave-10 + 89 new from this commit) - **677 TS errors** (FEWER than the 739 main baseline — type-tightening fixes cascaded from the source changes) - 4 sub-agent test files all green - 6 reviewer-regression tests all green - Authorization invariant test now covers `eval` → catches future removal of the gate # What's NOT in this commit (future work) - Mutation testing CI integration (stryker is too slow for per-PR; config exists for ad-hoc invocation) - Wave-1-9 antipattern tabulation update with Wave-10 findings
…ed 12/12 real) Fresh re-audit at 37e2b71 found 12 issues; 11 closed in this commit, 1 documented as known limitation. Reviewer was 12-for-12 real (Wave-10 was also 12-for-12; reviewer track record: 24-for-24). # CI blockers - **#1 (P1)** Auth invariant test hardcoded `/tmp/lossless-claw-upstream` path. CI failed because that path doesn't exist on GitHub runners; local runs accidentally succeeded by reading whatever stale checkout was at that path. Now resolves via `import.meta.url` → `__dirname/../src/plugin/lcm-command.ts`. Works in any worktree. - **#10 (P2)** `pnpm-lock.yaml` was stale after the Wave-10 `optionalDependencies` addition. Regenerated via `pnpm install --lockfile-only`; verified `pnpm install --frozen-lockfile` succeeds. # Security parity - **#2 (P1)** `/lcm doctor apply` and `/lcm doctor clean apply` lacked `senderIsOwner` gate. Wave-9 Agent #10 had classified the doctor cases as READ_ONLY, but the `apply` flag inside dispatches to the summarizer (cost) AND mutates summaries (state) for `doctor apply`, and DELETEs cleaner matches for `doctor clean apply`. Mirror the purge / reconcile / worker-tick / eval gate pattern. Read-only variants (no `--apply`) stay open. Plus updated `test/lcm-command.test.ts`'s `createCommandContext` helper to default `senderIsOwner: true` so existing tests for the doctor mutating paths continue passing — Wave-9 negative tests still explicitly pass `senderIsOwner: false` via overrides. Plus added 4 new tests to `v41-authorization-invariants.test.ts` pinning the Wave-11 doctor-apply gate behavior (apply-rejected, read-only-allowed for both `doctor` and `doctor clean`). - **#5 (P1)** `lcm_describe` early-budget-gate. The Wave-10 fix charged base summary tokens against the grant AFTER emitting `s.content`. For a sub-agent at zero remaining budget, the content was already disclosed before accounting could prevent it. Added an EARLY gate: if delegated session AND base summary tokens > remaining grant, redact `s.content` with a clear "[REDACTED — base summary content is N tokens but grant has only M remaining]" message and skip the charge. Closes the disclosure-before-accounting path. # Correctness - **#3 (P1)** Timezone fractional offsets + DST. Wave-10's "sample offset at noon" approach broke on: - Half-hour zones: Asia/Kolkata (UTC+5:30) → showed +5 not +5:30 - Quarter-hour zones: Asia/Kathmandu (UTC+5:45) - DST transition days: LA spring-forward 2026-03-08 → noon is in PDT (-7) but local midnight was in PST (-8); my function used the noon offset for the whole day → wrong by 1 hour Replaced with iterative converge-to-midnight algorithm: 1. Format `at` in target tz to get y/m/d 2. Probe = naive `Date.UTC(y, m-1, d, 0, 0, 0)` 3. Format probe in target tz; compute delta from target midnight 4. Adjust probe; repeat until delta=0 (typically 1-2 iters) Handles all IANA timezones, DST transitions, and arbitrary offsets. Added 3 new regression tests: - Asia/Kolkata 'yesterday' (UTC+5:30) — half-hour offset - Asia/Kathmandu 'today' (UTC+5:45) — quarter-hour offset - America/Los_Angeles 2026-03-08 — spring-forward day, asserting 'today' duration is exactly 23h - **#6 (P1)** Hybrid rerank now skips individually oversized candidates instead of bailing. Pre-fix: when the FIRST candidate exceeded the 510K-token (85% of 600K) rerank budget, the packer set `rerankPacked=[]` and broke out, disabling rerank for the whole result set. Now: oversized candidates are individually skipped (counted in `rerankPackSkippedOversized`) and packing continues with later candidates that fit. Result: a single huge FTS hit no longer takes down the whole rerank. - **#7 (P1)** Voyage `output_dimension` not forwarded. Configurable embedding dimensions (`LCM_EMBEDDING_DIM=2048` registers a 2048-dim profile in `lcm_embedding_profile`) but `embedTexts()` never sent `output_dimension` to Voyage, so Voyage returned its default (1024). vec0 INSERT then failed with dim mismatch on the per-model table. Added `outputDimension?: number` to `VoyageEmbedOptions`; forwarded via backfill (`opts.voyageOutputDimension`) and semantic-search query embed (`active.dim`). Default unchanged (omit → Voyage 1024). # Documentation accuracy - **#4 (P1)** Synthesis dispatch model claim. Tool description said "per-tier dispatch (haiku/sonnet/opus/thinking)" but actual LLM call routes through the configured summarizer chain (which ignores `args.model`). Source code already had honest comment in `buildLlmCallFromSummarizer` ("the summarizer wrapper ignores the dispatch-supplied model"); the tool description and PR description overclaimed. Updated tool description to be accurate: dispatch records the per-tier model name in the audit table, but the actual LLM call uses the operator's configured summarizer chain. # Polish - **#9 (P2)** Health archive filter. `readActiveProfile` selected on `active = 1` alone, ignoring `archive_after IS NOT NULL`. Semantic retrieval correctly filters archived; health was reporting a profile semantic search would not actually use during model cutover. Now matches: `WHERE active = 1 AND archive_after IS NULL`. - **#11 (P2)** Changeset rewritten. Old changeset only mentioned session-family recall. New changeset documents the full v4.1 release surface: 8 agent tools (with new modes), 2 worker autostarts, 9 operator commands (with owner-gating), schema changes, sqlite-vec optionalDependency, configuration env vars, and what was cut to Martian-Engineering#616. - **#12 (P3)** Stale entity-search docblock. The header comment said "entities with all-suppressed mentions can still appear here"; Wave-10 added the EXISTS guard so they no longer can. Updated comment to reflect the actual filter behavior. # Known limitation (deferred) - **#8 (P2)** Cache key still ignores resolved model. Adding `model_used` to the UNIQUE index doesn't help because model resolution is dynamic (the summarizer chain picks at call time, not before INSERT). The proper fix is invalidate-on-mismatch at cache-hit time, which is a larger refactor. Documented in the entry above + tracked for follow-up. # Verification - `npx vitest run`: **1513 / 1513 tests passing** (1502 → 1513; +11 new regression tests for Wave-11 fixes) - `npx tsc --noEmit`: **677 errors** (still below 739 main baseline; no PR-introduced TS errors) - `pnpm install --frozen-lockfile --ignore-scripts --lockfile-only`: **succeeds** (was failing pre-fix with ERR_PNPM_OUTDATED_LOCKFILE) - Authorization invariant test: now resolves the source path relative to test file via `__dirname` — works in any checkout location
Adds cross-day tracker for blockers, open items, decisions, and questions. Tracks opened/resolved state across daily rollups. Exposes via lcm_recent(period='open_items'|'blockers'). Part of lcm_recent Phase 2.