fix(agents): persist delivered assistant text [AI]#77445
Conversation
|
Codex review: needs real behavior proof before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. source-level reproduction is high confidence: current main has PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Merge the transcript mirror approach only after a redacted real embedded tool-using run shows delivered assistant text persisted before the next user turn while usage/accounting still comes from the real assistant message. Do we have a high-confidence way to reproduce the issue? Yes, source-level reproduction is high confidence: current main has Is this the best way to solve the issue? Yes, the current branch is a maintainable narrow fix: it writes only a filtered assistant-text mirror, preserves real assistant usage/accounting, and avoids compaction/error paths. The remaining blocker is real behavior proof, not an obvious code defect in the diff. Label changes:
Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against c2004fe6622d. |
1776aca to
dcec373
Compare
|
Thanks, fixed in Changes made:
Local validation after the final rebase:
GitHub reports the PR as mergeable; the new CI run is still completing. |
dcec373 to
766d33a
Compare
|
Good catch, fixed the short-final-reply edge case in Changes made:
Local validation after rebasing onto current
|
martingarramon
left a comment
There was a problem hiding this comment.
Reviewing the latest revision after dcec373b (preserved real lastAssistant/usage) and 766d33ad (short-final-reply edge case). The factoring is now correct — synthesized mirror is purely transcript-side, real attempt-assistant retains usage/cache metadata. Three substantive concerns survive on current head:
1. Hardcoded api: "openai-responses" in synthesized mirror — src/agents/pi-embedded-runner/run/assistant-text-persistence.ts:88
reconcileAssistantTextsWithTranscript synthesizes an AssistantMessage with api: "openai-responses" regardless of the real provider. When an Anthropic / Ollama / xAI turn delivers via this reconciliation path, the durable transcript records a mirror that lies about its origin. Downstream consumers branching on api (replay hygiene, per-api transcript hooks, future analytics) will mis-route on these mirrored messages, and the bug is silent — no test fails because the only test fixtures use provider: "openai-codex".
Fix-shape: either accept api as a caller param (the call site already has params.provider available, and providers map cleanly to api modes), or use a discriminator like api: "reconciled-mirror" that downstream code can match-and-skip. Either way, add a non-openai-provider test fixture to lock the contract.
2. prePromptMessageCount semantics undocumented — assistant-text-persistence.ts:42 and call site attempt.ts:3236
The parameter slices messagesSnapshot.slice(Math.max(0, prePromptMessageCount)) with no JSDoc, no inline rationale, and no boundary test. If the caller passes a stale count (e.g., from before a repairSessionFileIfNeeded rewrite or a compaction-adjacent path), the slice silently mis-bounds: too-low → pre-prompt messages re-checked for coverage and possibly suppress valid persistence; too-high → empty slice → reconciliation always appends, including duplicates the snapshot would have caught.
Fix-shape: JSDoc on the parameter clarifying it must be the messagesSnapshot.length AT prompt entry (before any synchronous appends in this attempt), plus a unit test for boundary values 0 / messagesSnapshot.length / messagesSnapshot.length + 1.
3. messagesSnapshot.push(message) mutates caller-owned array — assistant-text-persistence.ts:91
The function mutates the input messagesSnapshot: AgentMessage[] parameter via push. The signature does not mark it readonly or Mutable*, and the caller in attempt.ts:3232 may read messagesSnapshot further along in the function — any later reads now see the synthetic mirror as if it were part of the original snapshot. Concretely: cache-observability code at attempt.ts:3242 (completePromptCacheObservation) and any post-reconcile consumer get a snapshot whose length silently grew.
If the mutation is intentional (so cacheBreak accounting reflects the reconciled message), the contract should be loud — rename to mutableMessagesSnapshot, mark as such in JSDoc, or split into resolveUnpersistedAssistantTexts (pure read) + appendReconciledAssistantText (mutating, returns new array). Today the side-effect is invisible at the call site.
What's clearly right:
- Five-condition scope guard at
attempt.ts:3225-3230(!promptError && !aborted && !yieldAborted && !timedOutDuringCompaction && !compactionOccurredThisAttempt) is the correct fence — reconciliation only on clean successful turns. - Delivery-directive stripping covers
MEDIA:/[[audio_as_voice]]/NO_REPLYcorrectly per the dedicated test case. - Short-final-reply segment-match (≤16 chars) is a sensible mitigation against incidental substring matches in long earlier assistant commentary; the regression test makes the contract explicit.
The bug being fixed is real and the factoring after sercada's two fixups is correct. Above three are residual hardening items worth resolving before merge — concern (1) is the most likely to bite on first non-openai turn that hits this path.
766d33a to
dd58d7f
Compare
|
Refresh pushed in What changed in this refresh:
Local validation on this machine:
Attempted but blocked before tests ran:
Real behavior proof is still pending; I have not claimed that gate. |
dd58d7f to
035b1b5
Compare
|
Follow-up pushed in This fixes the CI TypeScript failure from the previous refresh by passing the defined Local validation on this follow-up:
The local pre-commit hook attempted a monorepo |
035b1b5 to
88af34c
Compare
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
88af34c to
9b2b46f
Compare
Summary
assistantTextswithout a matching durable assistant text message.Related #77447, #77448.
Real behavior proof (required for external PRs)
origin/mainat commits1227486ee2and9b2b46f229, using the repo runner against the actual embedded-runner transcript reconciliation path.node scripts/test-projects.mjs src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/usage-reporting.test.tsandgit diff --check HEAD~1..HEAD.MEDIA:directives, andNO_REPLYpayloads are not duplicated as visible prose. The attempt keeps the real modellastAssistantfor accounting: the regression preserves a non-zerototalTokens=125usage snapshot while the durable transcript mirror remains zero-usage.user -> assistant(toolCall) -> toolResult -> next usereven though the user had seen a final assistant answer. Before the follow-up fix, the zero-usage transcript mirror could also replacelastAssistant, risking bad usage/accounting metadata.Root Cause
assistantTextsis the user-visible final text source for some streaming/tool-call flows, but session persistence can already contain only the tool-call assistant/tool-result sequence. Nothing reconciled the delivered final text back into the durable transcript before the next turn.Testing
User-visible / Behavior Changes
Successful tool-using turns now keep delivered final assistant text in durable context for later turns. Delivery directives such as
MEDIA:and silentNO_REPLYpayloads are not persisted as assistant prose.Security Impact