fix(agents): prevent Bedrock replay death loop on empty assistant content#71627
Conversation
🔒 Aisle Security AnalysisWe found 2 potential security issue(s) in this PR:
1. 🟡 Symlink/TOCTOU arbitrary file overwrite via session file repair backup/tmp paths
DescriptionThe session repair routine performs multiple writes to paths derived from If an attacker can influence where
Vulnerable code: const backupPath = `${sessionFile}.bak-${process.pid}-${Date.now()}`;
const tmpPath = `${sessionFile}.repair-${process.pid}-${Date.now()}.tmp`;
...
await fs.writeFile(backupPath, content, "utf-8");
...
await fs.writeFile(tmpPath, cleaned, "utf-8");
...
await fs.rename(tmpPath, sessionFile);Impact depends on deployment assumptions, but in a multi-user environment or where session storage directories are writable by other principals, this pattern is a common primitive for arbitrary file clobbering via symlinks/races. RecommendationHarden the repair writes against symlink/race attacks and constrain repairs to trusted session directories. Recommended mitigations (apply as appropriate for your threat model):
Example (sketch): const dir = path.dirname(sessionFile);
const base = path.resolve(TRUSTED_SESSIONS_DIR);
const resolved = path.resolve(sessionFile);
if (!resolved.startsWith(base + path.sep)) throw new Error("invalid sessionFile path");
const st = await fs.lstat(sessionFile);
if (st.isSymbolicLink()) throw new Error("refuse symlink sessionFile");
const backupPath = path.join(dir, `${path.basename(sessionFile)}.bak-${process.pid}-${Date.now()}`);
const backupFd = await fs.open(backupPath, "wx", st.mode);
await backupFd.writeFile(content, "utf-8");
await backupFd.close();
const tmpFd = await fs.open(tmpPath, "wx", st.mode);
await tmpFd.writeFile(cleaned, "utf-8");
await tmpFd.close();
await fs.rename(tmpPath, sessionFile);Also consider 2. 🟡 Diagnostic events global state can be poisoned via globalThis Symbol.for key (configurable) leading to event exfiltration/DoS
Description
If any other code running in the same Node.js process (e.g., plugins, dynamically loaded dependencies) can execute before or after this module loads, it can:
Impact depends on what diagnostic payloads contain, but diagnostic events include tool execution, model calls, message delivery, and log events; these commonly carry identifiers and operational metadata and may include sensitive operational details in some deployments. Vulnerable code: const DIAGNOSTIC_EVENTS_STATE_KEY = Symbol.for("openclaw.diagnosticEvents.state.v1");
...
const existing = globalRecord[DIAGNOSTIC_EVENTS_STATE_KEY];
if (isDiagnosticEventsState(existing)) {
return existing;
}
Object.defineProperty(globalThis, DIAGNOSTIC_EVENTS_STATE_KEY, {
configurable: true,
enumerable: false,
value: state,
writable: false,
});This creates a global state poisoning surface that can lead to:
RecommendationHarden the diagnostic singleton so it cannot be swapped/poisoned via globally discoverable keys:
Example hardening: const DIAGNOSTIC_EVENTS_STATE_KEY = Symbol("openclaw.diagnosticEvents.state.v1");
let cached: DiagnosticEventsGlobalState | undefined;
function getDiagnosticEventsState(): DiagnosticEventsGlobalState {
if (cached) return cached;
const globalRecord = globalThis as Record<PropertyKey, unknown>;
const existing = globalRecord[DIAGNOSTIC_EVENTS_STATE_KEY];
if (isDiagnosticEventsState(existing)) {
cached = existing;
return existing;
}
const state = createDiagnosticEventsState();
Object.defineProperty(globalThis, DIAGNOSTIC_EVENTS_STATE_KEY, {
configurable: false,
enumerable: false,
value: state,
writable: false,
});
cached = state;
return state;
}Also consider rejecting/repairing an Analyzed PR: #71627 at commit Last updated on: 2026-04-25T17:59:24Z |
Greptile SummaryThis PR fixes the Bedrock Converse replay death loop by addressing three hygiene gaps: One minor description inaccuracy worth noting: the PR description states Confidence Score: 4/5Safe to merge — no P0 or P1 issues; all three repair layers are well-tested and the on-disk repair is atomic with backup. All changed files have unit test coverage pinning the critical invariants (non-empty content, sentinel identity, idempotency, mirror filtering). The file-repair path reuses the existing backup-and-rename machinery, so crash safety is unchanged. No P0 or P1 issues found. No files require special attention; replay-history.ts change is the most impactful but is well-guarded by the existing strict-provider ordering fallback. Reviews (1): Last reviewed commit: "fix(agents): prevent Bedrock replay deat..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 350a9b7861
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
5d76dd3 to
664cf7b
Compare
664cf7b to
2388602
Compare
d280b36 to
5899a93
Compare
5899a93 to
29b69a1
Compare
29b69a1 to
042f29d
Compare
|
Landed via temp rebase onto
Thanks @openperf! |
…tent (openclaw#71627) * fix(agents): prevent Bedrock replay death loop on empty assistant content Fixes openclaw#71572 * docs: document Bedrock replay repair (openclaw#71627) (thanks @openperf) * fix(diagnostics): share diagnostic event state across sdk graphs --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…tent (openclaw#71627) * fix(agents): prevent Bedrock replay death loop on empty assistant content Fixes openclaw#71572 * docs: document Bedrock replay repair (openclaw#71627) (thanks @openperf) * fix(diagnostics): share diagnostic event state across sdk graphs --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…tent (openclaw#71627) * fix(agents): prevent Bedrock replay death loop on empty assistant content Fixes openclaw#71572 * docs: document Bedrock replay repair (openclaw#71627) (thanks @openperf) * fix(diagnostics): share diagnostic event state across sdk graphs --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
Summary
{"role":"assistant","content":[],"stopReason":"error",...}to the session JSONL viabuildStreamErrorAssistantMessage(src/agents/stream-message-shared.ts:75-90). On the next turn Bedrock rejects the replay with"The content field in the Message object at messages.N is empty. Add a ContentBlock object to the content field and try again."Each subsequent user message replays the same poisoned turn, so the session enters a death loop and a gateway restart does not help (the JSONL is already corrupt). Affected sessions reported in Bedrock replay can enter empty assistant content death loop #71572 cover both embedded agent runs and Feishu channel delivery.buildStreamErrorAssistantMessagecreates the error turn withcontent: []. For Bedrock Converse, an assistant message must contain at least oneContentBlock, so this entry is invalid the moment it is persisted.normalizeAssistantReplayContentinsrc/agents/pi-embedded-runner/replay-history.tsonly normalises legacystringcontent; it skips empty arrays entirely. It also forwards openclaw transcript-only mirror entries (provider:"openclaw",model:"delivery-mirror"fromsrc/config/sessions/transcript.ts:128andmodel:"gateway-injected"fromsrc/gateway/server-methods/chat-transcript-inject.ts:89) to the actual provider, which duplicates content and, on strict providers, breaks turn-ordering rules.repairSessionFileIfNeededinsrc/agents/session-file-repair.tsonly drops malformed JSONL lines. The poisoned line is valid JSON, so repair leaves it untouched and the death loop survives every restart.stopReason: "error"turns — emptycontent: []is also legitimately produced by the silent-reply / NO_REPLY path (stopReason: "stop",output: 0) locked in bysrc/agents/pi-embedded-runner/run.empty-error-retry.test.ts. Substituting a failure sentinel into those turns would fabricate a failure statement in the next provider request and change model behavior even when no failure occurred, so they are left untouched.buildStreamErrorAssistantMessagewrites a single canonical sentinel text block (STREAM_ERROR_FALLBACK_TEXT = "[assistant turn failed before producing content]") intocontent. The raw provider error string stays in the peererrorMessagefield and is intentionally NOT placed incontentbecause that array is replayed back to the model on the next turn — provider error strings can carry hostnames or upstream metadata, so replaying them as assistant content opens a prompt-injection / information-disclosure surface (CWE-200). Clients/UIs readerrorMessagedirectly; providers do not include it in their wire payloads.normalizeAssistantReplayContentnow (a) drops openclawdelivery-mirror/gateway-injectedtranscript-only assistant entries from the in-memory replay copy (the persisted JSONL is unchanged, so user-facing transcript surfaces stay intact) and (b) substitutes the sentinel text block whencontentis an empty array andstopReason === "error". Existing legacy string-content wrapping behavior is preserved.repairSessionFileIfNeededrewrites persisted assistant entries that havecontent: []andstopReason: "error"to the sentinel block, in addition to the existing malformed-JSONL drop. Idempotent: a session with no eligible entries returnsrepaired:falseand bytes on disk are unchanged.stream-message-shared.tsand imported by both replay-history and session-file-repair so a session repaired offline reads byte-identically to a live stream-error turn — that byte-identity is what keeps the repair pass idempotent.src/agents/stream-message-shared.ts— exportSTREAM_ERROR_FALLBACK_TEXT;buildStreamErrorAssistantMessagereturns a non-empty content block carrying only the sentinel (raw error stays inerrorMessage).src/agents/pi-embedded-runner/replay-history.ts— extendnormalizeAssistantReplayContentto drop transcript-only openclaw assistant entries and to substitute the sentinel for empty content arrays gated onstopReason === "error"; preserve original-array-reference fast path when nothing changes.src/agents/session-file-repair.ts— extendrepairSessionFileIfNeededto detect-and-rewrite assistant entries with empty content andstopReason: "error"; addrewrittenAssistantMessagestoRepairReport(additive, optional); warn message reports only the non-zero counters via a smallbuildRepairSummaryPartshelper; combine drop and rewrite signals into therepairedflag.src/agents/stream-message-shared.test.ts(new, co-located) — covers: never returns empty content; content carries only the sentinel and never echoes raw error text (CWE-200 guard); same sentinel whenerrorMessageis blank.src/agents/pi-embedded-runner/replay-history.test.ts(new, co-located) — covers: empty content array becomes sentinel only when stopReason=error; silent-reply turn (stopReason:"stop",content:[]) is preserved untouched; legacy string-content wrapping still works;delivery-mirrorandgateway-injectedare filtered from replay; original array reference returned when nothing changes.src/agents/session-file-repair.test.ts— adds: persisted assistantcontent:[]withstopReason:"error"is rewritten with backup; warn message reports only non-zero counters; drop+rewrite warn reports both phrasings; silent-reply on disk (stopReason:"stop",content:[]) is NOT rewritten; idempotent on a healed session.normalizeAssistantReplayContent; UI/transcript history surfaces remain unaffected.stopReason:"stop",content:[]) are never rewritten — neither in replay nor on disk. The fix is exclusively scoped to the death-loop-causing error turns.errorMessagefield on the assistant message is preserved verbatim, including blank/whitespace input. Aisle's CWE-200 finding onerrorMessagefield redaction is intentionally out of scope: that field is the primary diagnostic signal UIs and operator log paths read to surface failure causes to users; redaction is a cross-cutting concern across all error paths and belongs in a separate hardening PR (or storage/transport-layer access control), not in this single message constructor.RepairReportonly gains an optional field (rewrittenAssistantMessages?: number); existing callers that ignore the return value or only readrepaired/droppedLinesare unaffected.repairSessionFileIfNeededis pre-existing behavior, not introduced by this PR — the entirefs.readFile/fs.writeFilepath is older than this fix. Streaming reads are independent hardening worth their own PR.buildStreamErrorAssistantMessage(extensions/ollama/src/stream.ts:289) intentionally left untouched here; it is provider-scoped and not in the Bedrock replay path.Reproduction
anthropic.claude-3-haikuor any other Bedrock-hosted model).ValidationException, or any error path that flows throughbuildStreamErrorAssistantMessage)."content": [],"stopReason": "error"."The content field in the Message object at messages.N is empty. Add a ContentBlock object to the content field and try again."After this fix:
"content": [{"type":"text","text": "[assistant turn failed before producing content]"}]. The persisted turn is replay-valid; the original error string remains available inerrorMessagefor clients/UIs.delivery-mirror/gateway-injectedmirror entries from the provider request, removing a separate class of strict-provider rejections.stopReason:"stop",content:[]) are preserved unchanged on every code path.Risk / Mitigation
buildStreamErrorAssistantMessageshape could affect downstream consumers that special-casecontent.length === 0.src/agents/openai-ws-stream.ts:1284,src/agents/pi-embedded-runner/run/attempt.stop-reason-recovery.ts:49,78,119) pass the value through to the agent loop's error event, which already iteratescontentdefensively. TheerrorMessagefield is preserved verbatim for clients that surface it separately.delivery-mirror/gateway-injectedfrom replay could remove user-visible turns from the model's context.transcript.ts:264 isRedundantDeliveryMirror). Persisted JSONL is unchanged, so transcript history surfaces (WebChat, TUI, REST, SSE) keep showing them. The strict-provider ordering fallback atreplay-history.ts:564-574is unchanged and remains the safety net..bak-${pid}-${ts}backup before any rewrite and uses an atomictmp → renameswap. The new rewrite path reuses the same write/backup machinery; failure paths still clean up the tmp file and returnrepaired:falsewith areason. ThestopReason === "error"gate ensures only failed turns are eligible — silent-reply turns are preserved as-is. Idempotency test ensures a second call is a no-op on a healed session.rewrittenAssistantMessagestoRepairReportcould be a typed-API regression for external callers.?:) and additive; all in-tree callers either ignore the return value or only readrepaired/droppedLines/reason. No external consumers found in the repo.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #71572