Skip to content

fix(agents): prevent Bedrock replay death loop on empty assistant content#71627

Merged
steipete merged 3 commits intoopenclaw:mainfrom
openperf:fix/bedrock-replay-empty-content-71572-v2
Apr 25, 2026
Merged

fix(agents): prevent Bedrock replay death loop on empty assistant content#71627
steipete merged 3 commits intoopenclaw:mainfrom
openperf:fix/bedrock-replay-empty-content-71572-v2

Conversation

@openperf
Copy link
Copy Markdown
Member

@openperf openperf commented Apr 25, 2026

Summary

  • Problem: When AWS Bedrock Converse fires a stream/provider error before any assistant content block is produced, OpenClaw persists {"role":"assistant","content":[],"stopReason":"error",...} to the session JSONL via buildStreamErrorAssistantMessage (src/agents/stream-message-shared.ts:75-90). On the next turn Bedrock rejects the replay with "The content field in the Message object at messages.N is empty. Add a ContentBlock object to the content field and try again." Each subsequent user message replays the same poisoned turn, so the session enters a death loop and a gateway restart does not help (the JSONL is already corrupt). Affected sessions reported in Bedrock replay can enter empty assistant content death loop #71572 cover both embedded agent runs and Feishu channel delivery.
  • Root Cause: Three transcript-hygiene gaps line up to make the failure permanent:
    1. buildStreamErrorAssistantMessage creates the error turn with content: []. For Bedrock Converse, an assistant message must contain at least one ContentBlock, so this entry is invalid the moment it is persisted.
    2. normalizeAssistantReplayContent in src/agents/pi-embedded-runner/replay-history.ts only normalises legacy string content; it skips empty arrays entirely. It also forwards openclaw transcript-only mirror entries (provider:"openclaw", model:"delivery-mirror" from src/config/sessions/transcript.ts:128 and model:"gateway-injected" from src/gateway/server-methods/chat-transcript-inject.ts:89) to the actual provider, which duplicates content and, on strict providers, breaks turn-ordering rules.
    3. repairSessionFileIfNeeded in src/agents/session-file-repair.ts only drops malformed JSONL lines. The poisoned line is valid JSON, so repair leaves it untouched and the death loop survives every restart.
  • Fix: Address each gap at its origin so a poisoned session cannot form, an in-flight session cannot replay poison even from older builds, and a session already poisoned on disk recovers without manual surgery. The fixes are scoped strictly to stopReason: "error" turns — empty content: [] is also legitimately produced by the silent-reply / NO_REPLY path (stopReason: "stop", output: 0) locked in by src/agents/pi-embedded-runner/run.empty-error-retry.test.ts. Substituting a failure sentinel into those turns would fabricate a failure statement in the next provider request and change model behavior even when no failure occurred, so they are left untouched.
    1. buildStreamErrorAssistantMessage writes a single canonical sentinel text block (STREAM_ERROR_FALLBACK_TEXT = "[assistant turn failed before producing content]") into content. The raw provider error string stays in the peer errorMessage field and is intentionally NOT placed in content because that array is replayed back to the model on the next turn — provider error strings can carry hostnames or upstream metadata, so replaying them as assistant content opens a prompt-injection / information-disclosure surface (CWE-200). Clients/UIs read errorMessage directly; providers do not include it in their wire payloads.
    2. normalizeAssistantReplayContent now (a) drops openclaw delivery-mirror / gateway-injected transcript-only assistant entries from the in-memory replay copy (the persisted JSONL is unchanged, so user-facing transcript surfaces stay intact) and (b) substitutes the sentinel text block when content is an empty array and stopReason === "error". Existing legacy string-content wrapping behavior is preserved.
    3. repairSessionFileIfNeeded rewrites persisted assistant entries that have content: [] and stopReason: "error" to the sentinel block, in addition to the existing malformed-JSONL drop. Idempotent: a session with no eligible entries returns repaired:false and bytes on disk are unchanged.
    • The sentinel string is exported from stream-message-shared.ts and imported by both replay-history and session-file-repair so a session repaired offline reads byte-identically to a live stream-error turn — that byte-identity is what keeps the repair pass idempotent.
  • What changed:
    • src/agents/stream-message-shared.ts — export STREAM_ERROR_FALLBACK_TEXT; buildStreamErrorAssistantMessage returns a non-empty content block carrying only the sentinel (raw error stays in errorMessage).
    • src/agents/pi-embedded-runner/replay-history.ts — extend normalizeAssistantReplayContent to drop transcript-only openclaw assistant entries and to substitute the sentinel for empty content arrays gated on stopReason === "error"; preserve original-array-reference fast path when nothing changes.
    • src/agents/session-file-repair.ts — extend repairSessionFileIfNeeded to detect-and-rewrite assistant entries with empty content and stopReason: "error"; add rewrittenAssistantMessages to RepairReport (additive, optional); warn message reports only the non-zero counters via a small buildRepairSummaryParts helper; combine drop and rewrite signals into the repaired flag.
    • src/agents/stream-message-shared.test.ts (new, co-located) — covers: never returns empty content; content carries only the sentinel and never echoes raw error text (CWE-200 guard); same sentinel when errorMessage is blank.
    • src/agents/pi-embedded-runner/replay-history.test.ts (new, co-located) — covers: empty content array becomes sentinel only when stopReason=error; silent-reply turn (stopReason:"stop", content:[]) is preserved untouched; legacy string-content wrapping still works; delivery-mirror and gateway-injected are filtered from replay; original array reference returned when nothing changes.
    • src/agents/session-file-repair.test.ts — adds: persisted assistant content:[] with stopReason:"error" is rewritten with backup; warn message reports only non-zero counters; drop+rewrite warn reports both phrasings; silent-reply on disk (stopReason:"stop", content:[]) is NOT rewritten; idempotent on a healed session.
  • What did NOT change (scope boundary):
    • No persisted JSONL entry is dropped from disk. Mirror filtering happens only in the in-memory replay copy in normalizeAssistantReplayContent; UI/transcript history surfaces remain unaffected.
    • Silent-reply turns (stopReason:"stop", content:[]) are never rewritten — neither in replay nor on disk. The fix is exclusively scoped to the death-loop-causing error turns.
    • errorMessage field on the assistant message is preserved verbatim, including blank/whitespace input. Aisle's CWE-200 finding on errorMessage field redaction is intentionally out of scope: that field is the primary diagnostic signal UIs and operator log paths read to surface failure causes to users; redaction is a cross-cutting concern across all error paths and belongs in a separate hardening PR (or storage/transport-layer access control), not in this single message constructor.
    • RepairReport only gains an optional field (rewrittenAssistantMessages?: number); existing callers that ignore the return value or only read repaired/droppedLines are unaffected.
    • Aisle's CWE-400 (unbounded read) finding on repairSessionFileIfNeeded is pre-existing behavior, not introduced by this PR — the entire fs.readFile/fs.writeFile path is older than this fix. Streaming reads are independent hardening worth their own PR.
    • No changes to provider plugins, transcript-policy, image sanitization, tool-call repair, compaction, or session-file format/version.
    • The Ollama plugin has its own local buildStreamErrorAssistantMessage (extensions/ollama/src/stream.ts:289) intentionally left untouched here; it is provider-scoped and not in the Bedrock replay path.

Reproduction

  1. Run an OpenClaw 2026.4.23 / 2026.4.25 gateway against AWS Bedrock Converse (e.g. anthropic.claude-3-haiku or any other Bedrock-hosted model).
  2. From an embedded agent or Feishu channel, send a prompt that triggers a stream/provider error before the model emits any content block (network drop, transient ValidationException, or any error path that flows through buildStreamErrorAssistantMessage).
  3. Observe the session JSONL: the new assistant entry has "content": [], "stopReason": "error".
  4. Send any next user message in the same session. The Bedrock SDK rejects with "The content field in the Message object at messages.N is empty. Add a ContentBlock object to the content field and try again."
  5. Restart the gateway and retry — the same validation error fires because the JSONL is unchanged.

After this fix:

  • Step 3 produces "content": [{"type":"text","text": "[assistant turn failed before producing content]"}]. The persisted turn is replay-valid; the original error string remains available in errorMessage for clients/UIs.
  • For sessions already poisoned by older builds, the repair pass on next load rewrites the empty-content error entry to a sentinel text block (with backup) and the session resumes; rerunning repair is a no-op.
  • Replay drops openclaw delivery-mirror / gateway-injected mirror entries from the provider request, removing a separate class of strict-provider rejections.
  • Silent-reply turns (stopReason:"stop", content:[]) are preserved unchanged on every code path.

Risk / Mitigation

  • Risk: Changing buildStreamErrorAssistantMessage shape could affect downstream consumers that special-case content.length === 0.
    • Mitigation: All call sites surveyed (src/agents/openai-ws-stream.ts:1284, src/agents/pi-embedded-runner/run/attempt.stop-reason-recovery.ts:49,78,119) pass the value through to the agent loop's error event, which already iterates content defensively. The errorMessage field is preserved verbatim for clients that surface it separately.
  • Risk: Filtering delivery-mirror / gateway-injected from replay could remove user-visible turns from the model's context.
    • Mitigation: These entries are transcript-only by design (see transcript.ts:264 isRedundantDeliveryMirror). Persisted JSONL is unchanged, so transcript history surfaces (WebChat, TUI, REST, SSE) keep showing them. The strict-provider ordering fallback at replay-history.ts:564-574 is unchanged and remains the safety net.
  • Risk: Repair rewriting on-disk content could damage a session if the rewrite path crashes or rewrites a non-error turn.
    • Mitigation: Repair already writes a .bak-${pid}-${ts} backup before any rewrite and uses an atomic tmp → rename swap. The new rewrite path reuses the same write/backup machinery; failure paths still clean up the tmp file and return repaired:false with a reason. The stopReason === "error" gate ensures only failed turns are eligible — silent-reply turns are preserved as-is. Idempotency test ensures a second call is a no-op on a healed session.
  • Risk: Adding rewrittenAssistantMessages to RepairReport could be a typed-API regression for external callers.
    • Mitigation: Field is optional (?:) and additive; all in-tree callers either ignore the return value or only read repaired/droppedLines/reason. No external consumers found in the repo.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway
  • Agents (embedded runner / replay)
  • Session storage (file repair)
  • Provider replay (Bedrock Converse)

Linked Issue/PR

Fixes #71572

@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Apr 25, 2026

🔒 Aisle Security Analysis

We found 2 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Symlink/TOCTOU arbitrary file overwrite via session file repair backup/tmp paths
2 🟡 Medium Diagnostic events global state can be poisoned via globalThis Symbol.for key (configurable) leading to event exfiltration/DoS
1. 🟡 Symlink/TOCTOU arbitrary file overwrite via session file repair backup/tmp paths
Property Value
Severity Medium
CWE CWE-59
Location src/agents/session-file-repair.ts:144-157

Description

The session repair routine performs multiple writes to paths derived from sessionFile without protecting against symlink attacks or time-of-check/time-of-use (TOCTOU) races.

If an attacker can influence where sessionFile points (or can create files/symlinks in the same directory), they may be able to cause the process to overwrite arbitrary files accessible to the process:

  • backupPath and tmpPath are computed from sessionFile and then written with fs.writeFile(...).
  • The code does not use lstat/open(..., O_NOFOLLOW)-style protections, does not ensure the paths are not symlinks, and does not use exclusive creation ('wx') for the backup/tmp files.
  • The change in this diff expands when repair runs (it now rewrites syntactically-valid assistant entries with content: [] + stopReason: "error"), increasing the situations where these unsafe writes occur.

Vulnerable code:

const backupPath = `${sessionFile}.bak-${process.pid}-${Date.now()}`;
const tmpPath = `${sessionFile}.repair-${process.pid}-${Date.now()}.tmp`;
...
await fs.writeFile(backupPath, content, "utf-8");
...
await fs.writeFile(tmpPath, cleaned, "utf-8");
...
await fs.rename(tmpPath, sessionFile);

Impact depends on deployment assumptions, but in a multi-user environment or where session storage directories are writable by other principals, this pattern is a common primitive for arbitrary file clobbering via symlinks/races.

Recommendation

Harden the repair writes against symlink/race attacks and constrain repairs to trusted session directories.

Recommended mitigations (apply as appropriate for your threat model):

  1. Constrain sessionFile to a known base directory and reject paths that escape it (e.g., via path.resolve + prefix check).
  2. Refuse to operate on symlinks by using lstat on sessionFile and on the directory, and/or open with no-follow flags where available.
  3. Create temp/backup files safely:
    • Use exclusive creation to prevent clobbering existing files.
    • Prefer fs.open(tmpPath, 'wx', mode) and write to the file descriptor.
    • Consider creating the temp file via fs.mkdtemp in the same directory and writing within it.

Example (sketch):

const dir = path.dirname(sessionFile);
const base = path.resolve(TRUSTED_SESSIONS_DIR);
const resolved = path.resolve(sessionFile);
if (!resolved.startsWith(base + path.sep)) throw new Error("invalid sessionFile path");

const st = await fs.lstat(sessionFile);
if (st.isSymbolicLink()) throw new Error("refuse symlink sessionFile");

const backupPath = path.join(dir, `${path.basename(sessionFile)}.bak-${process.pid}-${Date.now()}`);
const backupFd = await fs.open(backupPath, "wx", st.mode);
await backupFd.writeFile(content, "utf-8");
await backupFd.close();

const tmpFd = await fs.open(tmpPath, "wx", st.mode);
await tmpFd.writeFile(cleaned, "utf-8");
await tmpFd.close();
await fs.rename(tmpPath, sessionFile);

Also consider fsync/directory fsync if durability/atomicity guarantees matter.

2. 🟡 Diagnostic events global state can be poisoned via globalThis Symbol.for key (configurable) leading to event exfiltration/DoS
Property Value
Severity Medium
CWE CWE-200
Location src/infra/diagnostic-events.ts:393-451

Description

getDiagnosticEventsState() stores and later retrieves the diagnostic-events singleton from globalThis using a globally-discoverable Symbol.for(...) key. The property is defined as configurable: true, and getDiagnosticEventsState() re-reads the value from globalThis on every call.

If any other code running in the same Node.js process (e.g., plugins, dynamically loaded dependencies) can execute before or after this module loads, it can:

  • Pre-populate globalThis[Symbol.for("openclaw.diagnosticEvents.state.v1")] with an object that passes isDiagnosticEventsState() and contains attacker-controlled listeners/asyncQueue.
  • Replace the state after initialization because the global property is configurable: true (redefinable even when writable: false). Because getDiagnosticEventsState() re-reads from globalThis, subsequent diagnostic emissions will use the attacker-supplied state.

Impact depends on what diagnostic payloads contain, but diagnostic events include tool execution, model calls, message delivery, and log events; these commonly carry identifiers and operational metadata and may include sensitive operational details in some deployments.

Vulnerable code:

const DIAGNOSTIC_EVENTS_STATE_KEY = Symbol.for("openclaw.diagnosticEvents.state.v1");
...
const existing = globalRecord[DIAGNOSTIC_EVENTS_STATE_KEY];
if (isDiagnosticEventsState(existing)) {
  return existing;
}
Object.defineProperty(globalThis, DIAGNOSTIC_EVENTS_STATE_KEY, {
  configurable: true,
  enumerable: false,
  value: state,
  writable: false,
});

This creates a global state poisoning surface that can lead to:

  • Information disclosure: attacker-controlled listeners can receive diagnostic events.
  • Denial of service: attacker can install a state with a huge asyncQueue or listeners that throw/loop, increasing CPU/memory pressure.

Recommendation

Harden the diagnostic singleton so it cannot be swapped/poisoned via globally discoverable keys:

  • Prefer module-local state (no globalThis) unless cross-bundle deduplication is required.
  • If globalThis is required:
    • Use a non-global symbol (Symbol(...)) held in module scope rather than Symbol.for(...).
    • Make the property non-configurable so it cannot be replaced after initialization.
    • Validate more strictly (e.g., ensure listeners uses the built-in Set iterator; cap asyncQueue length even when adopting an existing state).

Example hardening:

const DIAGNOSTIC_EVENTS_STATE_KEY = Symbol("openclaw.diagnosticEvents.state.v1");

let cached: DiagnosticEventsGlobalState | undefined;

function getDiagnosticEventsState(): DiagnosticEventsGlobalState {
  if (cached) return cached;

  const globalRecord = globalThis as Record<PropertyKey, unknown>;
  const existing = globalRecord[DIAGNOSTIC_EVENTS_STATE_KEY];
  if (isDiagnosticEventsState(existing)) {
    cached = existing;
    return existing;
  }

  const state = createDiagnosticEventsState();
  Object.defineProperty(globalThis, DIAGNOSTIC_EVENTS_STATE_KEY, {
    configurable: false,
    enumerable: false,
    value: state,
    writable: false,
  });
  cached = state;
  return state;
}

Also consider rejecting/repairing an existing.asyncQueue if it exceeds MAX_ASYNC_DIAGNOSTIC_EVENTS.


Analyzed PR: #71627 at commit 042f29d

Last updated on: 2026-04-25T17:59:24Z

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M maintainer Maintainer-authored PR labels Apr 25, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

This PR fixes the Bedrock Converse replay death loop by addressing three hygiene gaps: buildStreamErrorAssistantMessage now always emits a sentinel text block instead of content: [], normalizeAssistantReplayContent drops transcript-only delivery-mirror/gateway-injected entries from in-memory replay and substitutes the sentinel for any empty-content assistant entry, and repairSessionFileIfNeeded rewrites already-persisted poisoned entries on disk with idempotent backup-and-rename semantics. Test coverage is solid across all three layers.

One minor description inaccuracy worth noting: the PR description states buildStreamErrorAssistantMessage now writes [{type:\"text\", text: errorMessage || sentinel}], but the implementation always places only the sentinel in content (never the raw error text). The code comment and the unit test \"places only the sentinel in content and never echoes the raw error text\" make clear this is intentional — replaying provider error strings as assistant content could leak hostnames/metadata to the model (CWE-200). The errorMessage field is preserved verbatim for UI/client surfaces.

Confidence Score: 4/5

Safe to merge — no P0 or P1 issues; all three repair layers are well-tested and the on-disk repair is atomic with backup.

All changed files have unit test coverage pinning the critical invariants (non-empty content, sentinel identity, idempotency, mirror filtering). The file-repair path reuses the existing backup-and-rename machinery, so crash safety is unchanged. No P0 or P1 issues found.

No files require special attention; replay-history.ts change is the most impactful but is well-guarded by the existing strict-provider ordering fallback.

Reviews (1): Last reviewed commit: "fix(agents): prevent Bedrock replay deat..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 350a9b7861

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/pi-embedded-runner/replay-history.ts Outdated
Comment thread src/agents/session-file-repair.ts
@openperf openperf force-pushed the fix/bedrock-replay-empty-content-71572-v2 branch 4 times, most recently from 5d76dd3 to 664cf7b Compare April 25, 2026 16:32
@steipete steipete force-pushed the fix/bedrock-replay-empty-content-71572-v2 branch from 664cf7b to 2388602 Compare April 25, 2026 17:30
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation size: L and removed size: M labels Apr 25, 2026
@steipete steipete force-pushed the fix/bedrock-replay-empty-content-71572-v2 branch from d280b36 to 5899a93 Compare April 25, 2026 17:46
@steipete steipete force-pushed the fix/bedrock-replay-empty-content-71572-v2 branch from 5899a93 to 29b69a1 Compare April 25, 2026 17:49
@steipete steipete force-pushed the fix/bedrock-replay-empty-content-71572-v2 branch from 29b69a1 to 042f29d Compare April 25, 2026 17:57
@steipete steipete merged commit 930d81a into openclaw:main Apr 25, 2026
65 checks passed
@steipete
Copy link
Copy Markdown
Contributor

Landed via temp rebase onto main.

  • Source head: 042f29d
  • Squash commit: 930d81a
  • CI: 56 checks green, no pending/failing checks before merge.
  • Local gates: pnpm check:test-types, pnpm lint:core, pnpm docs:list, pnpm test src/agents/stream-message-shared.test.ts src/agents/pi-embedded-runner/replay-history.test.ts src/agents/session-file-repair.test.ts, pnpm test src/plugins/loader.test.ts -t "supports legacy plugins subscribing to diagnostic events from the root sdk", pnpm test src/infra/diagnostic-events.test.ts.
  • Live Bedrock proof: direct AWS Bedrock Converse rejects replay with assistant + empty content: [], and accepts the repaired non-empty sentinel assistant turn on us.amazon.nova-micro-v1:0.

Thanks @openperf!

ayesha-aziz123 pushed a commit to ayesha-aziz123/openclaw that referenced this pull request Apr 26, 2026
…tent (openclaw#71627)

* fix(agents): prevent Bedrock replay death loop on empty assistant content

  Fixes openclaw#71572

* docs: document Bedrock replay repair (openclaw#71627) (thanks @openperf)

* fix(diagnostics): share diagnostic event state across sdk graphs

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…tent (openclaw#71627)

* fix(agents): prevent Bedrock replay death loop on empty assistant content

  Fixes openclaw#71572

* docs: document Bedrock replay repair (openclaw#71627) (thanks @openperf)

* fix(diagnostics): share diagnostic event state across sdk graphs

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…tent (openclaw#71627)

* fix(agents): prevent Bedrock replay death loop on empty assistant content

  Fixes openclaw#71572

* docs: document Bedrock replay repair (openclaw#71627) (thanks @openperf)

* fix(diagnostics): share diagnostic event state across sdk graphs

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation maintainer Maintainer-authored PR size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bedrock replay can enter empty assistant content death loop

2 participants