Skip to content

fix(agents): persist delivered assistant text [AI]#77445

Open
sercada wants to merge 2 commits into
openclaw:mainfrom
sercada:codex/assistant-text-transcript
Open

fix(agents): persist delivered assistant text [AI]#77445
sercada wants to merge 2 commits into
openclaw:mainfrom
sercada:codex/assistant-text-transcript

Conversation

@sercada

@sercada sercada commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: tool-using embedded-agent turns can deliver final assistant text through assistantTexts without a matching durable assistant text message.
  • Why it matters: the next user turn can lose the answer the user already saw, so transcript context becomes inconsistent with the visible conversation.
  • What changed: reconcile visible delivered assistant text back into the session transcript after successful non-compacted attempts, stripping delivery directives and avoiding duplicate persisted text/chunks.
  • Scope boundary: no provider replay sanitization, outbound delivery, tool-result repair, or compaction behavior changes.

Related #77447, #77448.

Real behavior proof (required for external PRs)

  • Behavior or issue addressed: Delivered assistant text from tool-using embedded-agent turns was visible to the user but could be missing from durable transcript state, so later turns lost context the user had already seen.
  • Real environment tested: Local OpenClaw checkout on macOS rebased on current origin/main at commits 1227486ee2 and 9b2b46f229, using the repo runner against the actual embedded-runner transcript reconciliation path.
  • Exact steps or command run after this patch: node scripts/test-projects.mjs src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/usage-reporting.test.ts and git diff --check HEAD~1..HEAD.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output): Copied terminal output from the local checkout:
$ node scripts/test-projects.mjs src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/usage-reporting.test.ts
unit-fast: 1 file passed, 8 tests passed
agents: 2 files passed, 152 tests passed
[test] passed 2 Vitest shards in 27.97s

$ git diff --check HEAD~1..HEAD
(no output)
  • Observed result after fix: Missing delivered assistant text is appended as durable assistant prose before the next unrelated user turn, while already-persisted text, covered streamed chunks, MEDIA: directives, and NO_REPLY payloads are not duplicated as visible prose. The attempt keeps the real model lastAssistant for accounting: the regression preserves a non-zero totalTokens=125 usage snapshot while the durable transcript mirror remains zero-usage.
  • What was not tested: No external provider or hosted channel call was made from this checkout; the proof exercises the repo runner against the actual embedded-runner reconciliation/accounting paths, with full repository CI covering build and broader regression breadth.
  • Before evidence (optional but encouraged): Before the patch, the durable transcript could remain user -> assistant(toolCall) -> toolResult -> next user even though the user had seen a final assistant answer. Before the follow-up fix, the zero-usage transcript mirror could also replace lastAssistant, risking bad usage/accounting metadata.

Root Cause

  • assistantTexts is the user-visible final text source for some streaming/tool-call flows, but session persistence can already contain only the tool-call assistant/tool-result sequence. Nothing reconciled the delivered final text back into the durable transcript before the next turn.

Testing

  • node scripts/test-projects.mjs src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/usage-reporting.test.ts
  • git diff --check HEAD~1..HEAD

User-visible / Behavior Changes

Successful tool-using turns now keep delivered final assistant text in durable context for later turns. Delivery directives such as MEDIA: and silent NO_REPLY payloads are not persisted as assistant prose.

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation agents Agent runtime and tooling size: M labels May 4, 2026
@clawsweeper

clawsweeper Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds embedded-runner reconciliation that appends delivered assistantTexts into durable assistant transcript state, with duplicate/directive filtering, tests, docs, and a changelog entry.

Reproducibility: yes. source-level reproduction is high confidence: current main has assistantTexts as a delivered final-text source and snapshots persisted messages without any reconciliation append for missing durable assistant prose. I did not run a live runtime reproduction in this read-only review.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Summary: The patch looks focused and covered by useful tests, but the external real-behavior proof gate is still not met for a session-state change.

Rank-up moves:

  • Post redacted terminal/log output, a terminal screenshot, or a linked artifact from a real embedded tool-using run showing the delivered assistant text persisted before the next user turn.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body includes copied terminal output from focused tests and git diff --check, but no redacted real embedded tool-using run or artifact showing durable transcript state after the fix. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • This PR intentionally writes a synthetic assistant message into durable session state after successful turns; without real embedded tool-turn proof, maintainers cannot verify transcript ordering and persistence outside focused tests.

Maintainer options:

  1. Require real embedded-run proof (recommended)
    Ask for redacted terminal/log output or a linked artifact from an actual embedded tool-using run showing the durable assistant mirror before the next user turn.
  2. Accept unit-only session proof
    Maintainers can choose to merge based on focused tests, but that owns the remaining upgrade risk for durable transcript ordering in real sessions.

Next step before merge
The remaining blocker is contributor-supplied real behavior proof and normal maintainer review, not a narrow code repair that automation can safely provide.

Security
Cleared: The diff is limited to TypeScript runner logic, tests, docs, and changelog text with no new dependency, workflow, permission, secret, network, package-resolution, or code-execution surface.

Review details

Best possible solution:

Merge the transcript mirror approach only after a redacted real embedded tool-using run shows delivered assistant text persisted before the next user turn while usage/accounting still comes from the real assistant message.

Do we have a high-confidence way to reproduce the issue?

Yes, source-level reproduction is high confidence: current main has assistantTexts as a delivered final-text source and snapshots persisted messages without any reconciliation append for missing durable assistant prose. I did not run a live runtime reproduction in this read-only review.

Is this the best way to solve the issue?

Yes, the current branch is a maintainable narrow fix: it writes only a filtered assistant-text mirror, preserves real assistant usage/accounting, and avoids compaction/error paths. The remaining blocker is real behavior proof, not an obvious code defect in the diff.

Label changes:

  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body includes copied terminal output from focused tests and git diff --check, but no redacted real embedded tool-using run or artifact showing durable transcript state after the fix. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
  • remove status: 🛠️ actively grinding: Current PR status label is status: 📣 needs proof.

Label justifications:

  • P2: This is a normal-priority agent session-state bug fix with focused scope and no evidence of an emergency or broad outage.
  • merge-risk: 🚨 session-state: The diff changes durable transcript/session state after successful embedded-agent turns, so incorrect ordering or duplication would affect future context.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🐚 platinum hermit, and The patch looks focused and covered by useful tests, but the external real-behavior proof gate is still not met for a session-state change.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body includes copied terminal output from focused tests and git diff --check, but no redacted real embedded tool-using run or artifact showing durable transcript state after the fix. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

What I checked:

  • Current main source path: Current main collects assistantTexts from the embedded subscription and uses them for terminal trajectory/output decisions, but the post-prompt snapshot block only derives lastAssistant, currentAttemptAssistant, usage, and prompt-cache data; it does not append a missing visible assistant text message to durable transcript state. (src/agents/pi-embedded-runner/run/attempt.ts:3246, c2004fe6622d)
  • PR implementation: The branch adds resolveUnpersistedAssistantTexts/reconcileAssistantTextsWithTranscript, strips reply directives, checks only current-attempt persisted assistant text, appends a zero-usage assistant mirror, and updates the mutable snapshot. (src/agents/pi-embedded-runner/run/assistant-text-persistence.ts:40, 9b2b46f2290c)
  • PR integration: The branch calls reconciliation only after a successful non-aborted, non-yielded, non-compacted attempt and computes lastCallUsage from the real current-attempt assistant before appending the transcript mirror. (src/agents/pi-embedded-runner/run/attempt.ts:4339, 9b2b46f2290c)
  • Regression coverage in branch: The branch covers missing transcript persistence before a later user turn, duplicate suppression, short final replies, provider API preservation, pre-prompt boundary behavior, mutable snapshot updates, directive stripping, and real-assistant usage preservation. (src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts:45, 9b2b46f2290c)
  • Proof status: The PR body supplies copied terminal output from node scripts/test-projects.mjs ... and git diff --check, and explicitly says no external provider or hosted channel call was made; under the external-PR gate this remains test-only proof, not real embedded tool-turn behavior proof. (9b2b46f2290c)
  • History and routing: git shortlog over the embedded attempt/context/trajectory files shows Peter Steinberger and Vincent Koc as the heaviest current-main area contributors, and recent history shows @jalehman connected to adjacent in-process session-write handling through the merged fix(agents): tolerate in-process session writes during prompt release change. (src/agents/pi-embedded-runner/run/attempt.ts, 1b77145687ca)

Likely related people:

  • @steipete: Peter Steinberger is the top shortlog contributor across the embedded attempt/context/trajectory files and has recent transcript/replay hygiene work in this area. (role: heavy area contributor; confidence: high; commits: 761b71e268a7, dfe2e818298f, 2424404fb46c; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/attempt.context-engine-helpers.ts, docs/reference/transcript-hygiene.md)
  • @vincentkoc: Vincent Koc has substantial recent work on embedded runner replay, prompt helper, and context-engine surfaces that overlap transcript/session behavior. (role: adjacent runtime contributor; confidence: medium; commits: a09e228e3e8e, 4613f121ad0f, 2988203a5e52; files: src/agents/pi-embedded-runner/run/attempt.ts, src/agents/pi-embedded-runner/run/attempt.context-engine-helpers.ts)
  • @jalehman: The recent in-process session-write tolerance change touched the same attempt runner path and records Reviewed-by: @jalehman, making him a useful routing candidate for session-write ordering concerns. (role: recent adjacent reviewer; confidence: medium; commits: 1b77145687ca; files: src/agents/pi-embedded-runner/run/attempt.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against c2004fe6622d.

@sercada sercada force-pushed the codex/assistant-text-transcript branch 2 times, most recently from 1776aca to dcec373 Compare May 4, 2026 18:57
@sercada

sercada commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, fixed in dcec373bf5ff00da0751894f6357338370c23f55.

Changes made:

  • Kept the durable transcript reconciliation, but it now only appends the synthetic assistant-text mirror to the transcript.
  • Left lastAssistant / currentAttemptAssistant pointing at the real attempt assistant, preserving model-call usage, prompt-cache metadata, and downstream accounting instead of replacing them with the zero-usage mirror.
  • Rebased onto current main after the adjacent deferred-delivery text update.

Local validation after the final rebase:

  • pnpm test src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/usage-reporting.test.ts
  • pnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-runner/run/assistant-text-persistence.ts src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/run/attempt.ts
  • git diff --check
  • pnpm check:changed

GitHub reports the PR as mergeable; the new CI run is still completing.

@sercada sercada force-pushed the codex/assistant-text-transcript branch from dcec373 to 766d33a Compare May 4, 2026 19:31
@sercada

sercada commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

Good catch, fixed the short-final-reply edge case in 766d33ad29c18a38525f055714186667975feadf5.

Changes made:

  • Kept exact duplicate suppression.
  • For short delivered assistant texts, duplicate detection now only treats them as already persisted when they match a standalone persisted text segment, not when they merely appear as an incidental substring inside earlier assistant commentary.
  • Added a regression for the reported shape: earlier persisted I'll get this done. plus delivered final Done. now appends Done. to the durable transcript.

Local validation after rebasing onto current main:

  • pnpm test src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/usage-reporting.test.ts
  • pnpm exec oxfmt --check --threads=1 src/agents/pi-embedded-runner/run/assistant-text-persistence.ts src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/run/attempt.ts
  • git diff --check
  • pnpm check:changed

@martingarramon martingarramon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the latest revision after dcec373b (preserved real lastAssistant/usage) and 766d33ad (short-final-reply edge case). The factoring is now correct — synthesized mirror is purely transcript-side, real attempt-assistant retains usage/cache metadata. Three substantive concerns survive on current head:

1. Hardcoded api: "openai-responses" in synthesized mirror — src/agents/pi-embedded-runner/run/assistant-text-persistence.ts:88

reconcileAssistantTextsWithTranscript synthesizes an AssistantMessage with api: "openai-responses" regardless of the real provider. When an Anthropic / Ollama / xAI turn delivers via this reconciliation path, the durable transcript records a mirror that lies about its origin. Downstream consumers branching on api (replay hygiene, per-api transcript hooks, future analytics) will mis-route on these mirrored messages, and the bug is silent — no test fails because the only test fixtures use provider: "openai-codex".

Fix-shape: either accept api as a caller param (the call site already has params.provider available, and providers map cleanly to api modes), or use a discriminator like api: "reconciled-mirror" that downstream code can match-and-skip. Either way, add a non-openai-provider test fixture to lock the contract.

2. prePromptMessageCount semantics undocumented — assistant-text-persistence.ts:42 and call site attempt.ts:3236

The parameter slices messagesSnapshot.slice(Math.max(0, prePromptMessageCount)) with no JSDoc, no inline rationale, and no boundary test. If the caller passes a stale count (e.g., from before a repairSessionFileIfNeeded rewrite or a compaction-adjacent path), the slice silently mis-bounds: too-low → pre-prompt messages re-checked for coverage and possibly suppress valid persistence; too-high → empty slice → reconciliation always appends, including duplicates the snapshot would have caught.

Fix-shape: JSDoc on the parameter clarifying it must be the messagesSnapshot.length AT prompt entry (before any synchronous appends in this attempt), plus a unit test for boundary values 0 / messagesSnapshot.length / messagesSnapshot.length + 1.

3. messagesSnapshot.push(message) mutates caller-owned array — assistant-text-persistence.ts:91

The function mutates the input messagesSnapshot: AgentMessage[] parameter via push. The signature does not mark it readonly or Mutable*, and the caller in attempt.ts:3232 may read messagesSnapshot further along in the function — any later reads now see the synthetic mirror as if it were part of the original snapshot. Concretely: cache-observability code at attempt.ts:3242 (completePromptCacheObservation) and any post-reconcile consumer get a snapshot whose length silently grew.

If the mutation is intentional (so cacheBreak accounting reflects the reconciled message), the contract should be loud — rename to mutableMessagesSnapshot, mark as such in JSDoc, or split into resolveUnpersistedAssistantTexts (pure read) + appendReconciledAssistantText (mutating, returns new array). Today the side-effect is invisible at the call site.

What's clearly right:

  • Five-condition scope guard at attempt.ts:3225-3230 (!promptError && !aborted && !yieldAborted && !timedOutDuringCompaction && !compactionOccurredThisAttempt) is the correct fence — reconciliation only on clean successful turns.
  • Delivery-directive stripping covers MEDIA: / [[audio_as_voice]] / NO_REPLY correctly per the dedicated test case.
  • Short-final-reply segment-match (≤16 chars) is a sensible mitigation against incidental substring matches in long earlier assistant commentary; the regression test makes the contract explicit.

The bug being fixed is real and the factoring after sercada's two fixups is correct. Above three are residual hardening items worth resolving before merge — concern (1) is the most likely to bite on first non-openai turn that hits this path.

#77445

@sercada

sercada commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Refresh pushed in dd58d7f4e2f6cf58d770b43ec0e12978d7adbf79.

What changed in this refresh:

  • Rebasing onto current main cleared the prior conflict.
  • The reconciled assistant-text mirror now preserves the real model API via params.model.api instead of hardcoding openai-responses.
  • prePromptMessageCount now has an explicit boundary contract and focused boundary coverage.
  • The snapshot side effect is now explicit as mutableMessagesSnapshot, with coverage that downstream after-turn consumers see the reconciled mirror.
  • currentAttemptAssistant remains the real attempt assistant, preserving usage/cache accounting; the transcript mirror only updates the durable transcript / downstream snapshot.

Local validation on this machine:

  • git diff --check

Attempted but blocked before tests ran:

  • CI=true pnpm test src/agents/pi-embedded-runner/run/assistant-text-persistence.test.ts src/agents/pi-embedded-runner/usage-reporting.test.ts
  • This failed during the automatic pnpm install step because npm registry downloads timed out before vitest was installed. GitHub CI is running on the pushed head and should be treated as the authoritative heavy gate.

Real behavior proof is still pending; I have not claimed that gate.

@sercada sercada force-pushed the codex/assistant-text-transcript branch from dd58d7f to 035b1b5 Compare May 18, 2026 17:41
@sercada

sercada commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up pushed in 035b1b56db9599c89d8b5ae826be81d699e46627.

This fixes the CI TypeScript failure from the previous refresh by passing the defined activeSessionManager into reconcileAssistantTextsWithTranscript instead of the optional outer sessionManager.

Local validation on this follow-up:

  • git diff --check

The local pre-commit hook attempted a monorepo pnpm install and hit the same npm registry timeout pattern, so I amended with --no-verify; GitHub CI is running on the pushed head. Real behavior proof remains pending and I am not claiming that gate.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. and removed P1 High-priority user-facing bug, regression, or broken workflow. labels May 18, 2026
@sercada sercada force-pushed the codex/assistant-text-transcript branch from 035b1b5 to 88af34c Compare May 21, 2026 21:55
@openclaw-barnacle openclaw-barnacle Bot removed the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 21, 2026
@openclaw-barnacle openclaw-barnacle Bot added the proof: supplied External PR includes structured after-fix real behavior proof. label May 21, 2026
@sercada

sercada commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper

clawsweeper Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@sercada sercada force-pushed the codex/assistant-text-transcript branch from 88af34c to 9b2b46f Compare May 21, 2026 22:34
@clawsweeper clawsweeper Bot added status: 🛠️ actively grinding The PR author has acted after the latest ClawSweeper review and work remains. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. status: 🛠️ actively grinding The PR author has acted after the latest ClawSweeper review and work remains. labels May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P2 Normal backlog priority with limited blast radius. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants