fix(agents): add trajectory flush timeout diagnostics by galiniliev · Pull Request #82962 · openclaw/openclaw

galiniliev · 2026-05-17T06:08:13Z

Summary

Problem: pi-trajectory-flush cleanup timeout warnings only showed the timeout envelope, leaving operators unable to tell whether trajectory flush was waiting on queued writer work, event-loop yield, or file append IO.
Why it matters: the BUG evidence shows repeated cleanup timeout warnings; without bounded writer state, the next investigation cannot distinguish slow file IO from pending queued writes.
What changed: queued file writers now expose non-path diagnostics, trajectory runtime formats that state, and the embedded runner passes it into the generic cleanup timeout warning for pi-trajectory-flush.
What did NOT change (scope boundary): this does not change timeout duration, trajectory file caps, append semantics, or abort behavior for in-flight filesystem writes.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes [Bug]: pi-trajectory-flush timeout warning lacks queued writer state #82961
Related fix(agents): make trajectory cleanup timeout configurable #81622
Related fix: prevent event loop saturation from trajectory flush (setImmediate yield + 10MB cap + 30s timeout) #77133
This PR fixes a bug or regression

Real behavior proof (required for external PRs)

Behavior or issue addressed: pi-trajectory-flush timeout warnings now include bounded trajectory writer state such as pending write count, queued bytes, active operation, and active write bytes.
Real environment tested: local OpenClaw checkout on Windows, Node v24.15.0, direct tsx runtime import of the patched cleanup helper; no live provider or channel credentials involved.
Exact steps or command run after this patch: node --import tsx -e '<invoke runAgentCleanupStep with step=pi-trajectory-flush, timeoutMs=5, and trajectory writer timeout details>'
Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):

$ node --import tsx -e '<invoke runAgentCleanupStep with trajectory writer timeout details>'
{
  "logs": [
    "agent cleanup timed out: runId=proof-run sessionId=proof-session step=pi-trajectory-flush timeoutMs=5 details=pendingWrites=2 queuedBytes=128 activeOperation=file-append yieldBeforeWrite=true activeWriteBytes=64 maxQueuedBytes=1024 maxFileBytes=2048"
  ]
}

Observed result after fix: the timeout warning includes details= with queued writer state and no local file path.
What was not tested: focused Vitest and remote Testbox/AWS proof could not complete from this host; see verification notes below. Live production gateway soak with a real long-running trajectory flush was not tested.
Before evidence (optional but encouraged): BUG evidence showed only agent cleanup timed out: runId=[redacted run id] sessionId=[redacted session id] step=pi-trajectory-flush timeoutMs=10000, with no queue or file-append state.

Root Cause (if applicable)

Root cause: runAgentCleanupStep had no way for owner-specific cleanup steps to attach bounded state to the timeout warning, and QueuedFileWriter.flush() exposed no pending queue diagnostics.
Missing detection / guardrail: cleanup timeout tests asserted the warning string but did not cover owner-provided timeout details; trajectory runtime tests did not assert flush-state formatting.
Contributing context (if known): fix(agents): make trajectory cleanup timeout configurable #81622 made the trajectory flush timeout configurable, but did not add state showing what a timed-out flush is waiting on.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/agents/run-cleanup-timeout.test.ts, src/agents/queued-file-writer.test.ts, src/trajectory/runtime.test.ts
Scenario the test should lock in: cleanup timeout details are included and detail callback failures do not break cleanup; queued writer diagnostics report pending bytes/writes; trajectory runtime formats writer state for timeout logs.
Why this is the smallest reliable guardrail: the behavior lives at the cleanup helper, queued writer, and trajectory runtime seams, not in a provider or channel path.
Existing test that already covers this (if any): existing cleanup timeout tests covered generic timeout logging only.
If no new test is added, why not: N/A.

User-visible / Behavior Changes

Timed-out pi-trajectory-flush cleanup warnings now include bounded writer diagnostics. Defaults and behavior are otherwise unchanged.

Diagram (if applicable)

Before:
pi-trajectory-flush -> runAgentCleanupStep -> timeout warning without flush state

After:
pi-trajectory-flush -> trajectoryRecorder.describeFlushState() -> runAgentCleanupStep -> timeout warning with bounded queue/file-append state

Security Impact (required)

New permissions/capabilities? (Yes/No) No
Secrets/tokens handling changed? (Yes/No) No
New/changed network calls? (Yes/No) No
Command/tool execution surface changed? (Yes/No) No
Data access scope changed? (Yes/No) No
If any Yes, explain risk + mitigation: N/A. Diagnostics intentionally omit file paths and payload contents.

Repro + Verification

Environment

OS: Windows local checkout
Runtime/container: Node v24.15.0; direct tsx runtime proof
Model/provider: N/A
Integration/channel (if any): N/A
Relevant config (redacted): none

Steps

Confirm the BUG evidence only contains the generic pi-trajectory-flush timeout warning.
Apply this patch.
Run the direct tsx cleanup helper proof above.
Run patch hygiene checks.

Expected

The timeout warning includes bounded cleanup details when the cleanup step provides them.
Detail collection failures do not make cleanup fail after timeout.
Generic cleanup warnings without details remain unchanged.

Actual

Direct runtime proof emitted a timeout warning with details=pendingWrites=2 queuedBytes=128 activeOperation=file-append ....
git diff --check passed.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Verification attempted:

git diff --check
node --import tsx -e '<invoke runAgentCleanupStep with trajectory writer timeout details>'
node scripts/run-vitest.mjs src/agents/run-cleanup-timeout.test.ts src/agents/queued-file-writer.test.ts src/trajectory/runtime.test.ts
node node_modules\vitest\vitest.mjs run src/agents/run-cleanup-timeout.test.ts --reporter=verbose --pool=threads --maxWorkers=1 --no-file-parallelism --configLoader=runner
codex review --uncommitted

Results:

git diff --check -> passed
node --import tsx ... -> emitted timeout warning with details=pendingWrites=2 queuedBytes=128 activeOperation=file-append yieldBeforeWrite=true activeWriteBytes=64 maxQueuedBytes=1024 maxFileBytes=2048
node scripts/run-vitest.mjs ... -> blocked locally by Windows spawn EPERM
node node_modules\vitest\vitest.mjs ... -> timed out without test output; lingering child was stopped
codex review --uncommitted -> blocked by OpenAI API 401 Unauthorized

Remote proof attempts:

Crabbox/Testbox via blacksmith-testbox -> blocked: blacksmith CLI not found on PATH
Crabbox AWS -> blocked: AWS credentials unavailable on this host

Human Verification (required)

Verified scenarios: manual source review of the queued writer promise chain, trajectory runtime flush-state formatting, cleanup timeout logging path, and direct runtime proof for the patched timeout message.
Edge cases checked: timeout details omitted when unavailable; timeout detail callback exceptions are converted into detailsError= without rejecting cleanup; writer diagnostics omit file paths.
What you did not verify: Vitest pass, full changed gate, Testbox/AWS proof, and live gateway soak due host/tooling blockers listed above.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

No bot review conversations exist on this PR yet.

Compatibility / Migration

Backward compatible? (Yes/No) Yes
Config/env changes? (Yes/No) No
Migration needed? (Yes/No) No
If yes, exact upgrade steps: N/A

Risks and Mitigations

Risk: timeout logs become longer for trajectory flush timeouts.
- Mitigation: details are bounded scalar counters/enums only and omit file paths/payloads.
Risk: diagnostic callback could fail while logging timeout.
- Mitigation: runAgentCleanupStep catches detail callback failures and logs detailsError= instead of throwing.

clawsweeper · 2026-05-17T06:09:19Z

Codex review: needs maintainer review before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds optional cleanup-timeout details, queued-file-writer diagnostics, trajectory flush-state formatting, a pi-trajectory-flush hook, focused tests, and a changelog entry.

Reproducibility: yes. at source level: current main's cleanup helper and pi-trajectory-flush call site can only emit the generic timeout envelope, while the linked bug provides representative repeated timeout lines. I did not reproduce a live stalled gateway flush.

PR rating
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Summary: A narrow, well-scoped bug-fix PR with sufficient direct runtime proof and no blocking code findings, but focused test execution still needs CI or maintainer-side confirmation.

Rank-up moves:

Run or wait for focused CI covering src/agents/run-cleanup-timeout.test.ts, src/agents/queued-file-writer.test.ts, and src/trajectory/runtime.test.ts before merge.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

PR egg
✨ Hatched: 💎 rare Cosmic Lint Imp

       _..------.._          
    .-'  .-.  .-.  '-.       
   /    ( * )( * )    \      
  |        .--.        |     
  |   <\   ====   />   |     
   \    '.______.'    /      
    '-._   ____   _.-'       
        `-.____.-'           
       __/|_||_|\__          
      /__.'    '.__\         
       `-----------'         
 *===================*

Rarity: 💎 rare.
Trait: watches the merge queue.
Share on X: post this hatch
Copy: My PR egg hatched a 💎 rare Cosmic Lint Imp in ClawSweeper.

What is this egg doing here?

Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
How to hatch it: reach status: 👀 ready for maintainer look or status: 🚀 automerge armed; that usually means sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness.
The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

Real behavior proof
Sufficient (live_output): The PR body includes copied after-fix live output from a local Node/tsx runtime invocation showing the timeout warning with bounded trajectory writer details and no file path.

Risk before merge
Why this matters: - The PR body says focused Vitest and remote Testbox proof did not complete, so CI or equivalent focused test proof should still gate merge.

I did not run a live gateway trajectory flush timeout in this read-only review; the evidence is source-level plus the PR body's direct runtime output.

Maintainer options:

Decide the mitigation before merge
Land this narrow diagnostics path after maintainer review and focused tests or CI confirm the logging additions do not alter writer flush semantics.
Pause or close
Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge
No automated repair is needed; the remaining action is maintainer review and validation of an active protected-label PR.

Security
Cleared: The diff adds scalar diagnostics and tests only; it does not add dependencies, permissions, network calls, secret handling, or path/payload logging.

Review details

Best possible solution:

Land this narrow diagnostics path after maintainer review and focused tests or CI confirm the logging additions do not alter writer flush semantics.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main's cleanup helper and pi-trajectory-flush call site can only emit the generic timeout envelope, while the linked bug provides representative repeated timeout lines. I did not reproduce a live stalled gateway flush.

Is this the best way to solve the issue?

Yes, the PR uses the narrow maintainable path: expose bounded writer state at the writer, format it in the trajectory owner, and pass it through a generic cleanup timeout seam with error isolation.

Label justifications:

P2: This is a normal-priority agent diagnostics bug fix with limited blast radius and no crash, security bypass, or broad channel outage.

What I checked:

Current main timeout warning lacks owner details: runAgentCleanupStep on current main logs only run id, session id, step, and timeoutMs after a cleanup timeout. (src/agents/run-cleanup-timeout.ts:90, 04eac15f43d5)
Current main trajectory caller passes no diagnostic hook: The embedded runner currently calls runAgentCleanupStep for pi-trajectory-flush with only the cleanup callback, so the warning cannot include queued writer state. (src/agents/pi-embedded-runner/run/attempt.ts:4786, 04eac15f43d5)
PR adds bounded timeout details seam: The PR head adds getTimeoutDetails, trims optional detail strings, and catches detail callback failures as detailsError= without changing cleanup completion behavior. (src/agents/run-cleanup-timeout.ts:31, 0a76f908e380)
PR exposes non-path queued writer diagnostics: The PR head adds scalar queue diagnostics for pending writes, queued bytes, active operation, active write bytes, queue/file caps, and yield state. (src/agents/queued-file-writer.ts:7, 0a76f908e380)
PR wires trajectory flush diagnostics: The PR head formats writer diagnostics inside the trajectory owner and passes trajectoryRecorder?.describeFlushState() to the pi-trajectory-flush cleanup step. (src/trajectory/runtime.ts:212, 0a76f908e380)
PR adds focused regression coverage: The diff adds tests for timeout details, detail callback failures, queued writer diagnostics, and trajectory flush-state formatting. (src/agents/run-cleanup-timeout.test.ts:65, 0a76f908e380)

Likely related people:

steipete: git log -S'getQueuedFileWriter' points to Peter Steinberger introducing the shared queued JSONL writer, and current blame for the central files is dominated by the current imported main snapshot. (role: introduced shared writer and heavy adjacent contributor; confidence: medium; commits: 817b5812e10a, 0903fa61d08e; files: src/agents/queued-file-writer.ts, src/agents/run-cleanup-timeout.ts, src/trajectory/runtime.ts)
BunsDev: The related merged trajectory cleanup timeout PR changed runAgentCleanupStep and trajectory docs immediately before this diagnostics work. (role: recent area contributor; confidence: high; commits: 5d4a8b00721a, d330193b2d34; files: src/agents/run-cleanup-timeout.ts, docs/tools/trajectory.md)
joshavant: Recent history shows adjacent maintenance in the embedded attempt runner that owns the pi-trajectory-flush cleanup call site. (role: recent adjacent contributor; confidence: low; commits: cc835b6d7276; files: src/agents/pi-embedded-runner/run/attempt.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 04eac15f43d5.

YonganZhang · 2026-05-17T21:00:24Z

Drive-by +1 from #83050 — clawsweeper correctly flagged my parallel attempt as superseded; my version only added the getCleanupDiagnostic hook scaffold and reported writer=present/absent, whereas this PR actually exposes the queued-writer counters the issue asked for (queued bytes, active op, etc.), which is what operators actually need on a 5-second-page recovery.

One tiny suggestion if not already wired: the diagnostic-capture callback should run inside a try/catch so a future change to queuedFileWriter.getDiagnostic() that throws on a half-initialized writer can never suppress the timeout warning itself (the warning is the user's only signal that flush hung). Trivial to add if not present — diagnosticError=<msg> as the appended suffix.

Closing my parallel PR. Thanks for the broader fix.

galiniliev · 2026-05-19T06:29:26Z

Verification before merge:

Source review: confirmed the fix adds bounded trajectory writer diagnostics through runAgentCleanupStep, QueuedFileWriter.describeQueue(), createTrajectoryRuntimeRecorder().describeFlushState(), and the pi-trajectory-flush cleanup caller without changing flush timeout duration, file caps, append semantics, or abort behavior.
Local commands: git diff --check upstream/main...HEAD; node scripts/run-vitest.mjs src/agents/run-cleanup-timeout.test.ts src/agents/queued-file-writer.test.ts src/trajectory/runtime.test.ts (5 files, 41 tests passed).
CI on head 0a76f908e3800f6c8864e8feda0061b4fabcc07c: CI run 26080042512 passed; CodeQL high run 26080042491 passed; Critical Quality run 26080042519 passed; Real behavior proof run 26080041471 passed; OpenGrep PR diff run 26080042456 passed.
Behavior proof: PR body includes after-fix runtime output showing agent cleanup timed out ... step=pi-trajectory-flush ... details=pendingWrites=2 queuedBytes=128 activeOperation=file-append ..., and CI accepted the Real behavior proof marker for the exact head.
Known proof gap: I did not run a live gateway soak with real slow filesystem IO; this PR adds diagnostics for the observed missing-state warning and leaves the underlying stall investigation to the next incident/proof path.

…026.5.20) (#615) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [ghcr.io/openclaw/openclaw](https://openclaw.ai) ([source](https://github.com/openclaw/openclaw)) | patch | `2026.5.19` → `2026.5.20` | --- > ⚠️ **Warning** > > Some dependencies could not be looked up. Check the [Dependency Dashboard](issues/567) for more information. --- ### Release Notes <details> <summary>openclaw/openclaw (ghcr.io/openclaw/openclaw)</summary> ### [`v2026.5.20`](https://github.com/openclaw/openclaw/blob/HEAD/CHANGELOG.md#2026520) [Compare Source](openclaw/openclaw@v2026.5.19...v2026.5.20) ##### Changes - Exec approvals: remove the old `cat SKILL.md && printf ... && <skill-wrapper>` allowlist compatibility path so skill files must be loaded with the read tool and only the real skill executable is auto-allowed. - Discord: let voice sessions follow configured Discord users into voice channels, with allowed-channel checks, multi-user handoff, bounded reconciliation, and DAVE recovery preservation. ([#84264](openclaw/openclaw#84264)) Thanks [@fuller-stack-dev](https://github.com/fuller-stack-dev). - Discord/voice: include bounded `IDENTITY.md`, `USER.md`, and `SOUL.md` profile context in realtime voice session instructions by default, with `voice.realtime.bootstrapContextFiles: []` available to disable it. ([#84499](openclaw/openclaw#84499)) Thanks [@fuller-stack-dev](https://github.com/fuller-stack-dev). - Dependencies: bump the bundled Codex harness to `@openai/codex` `0.132.0` and refresh the app-server model-list docs for the new catalog. - CLI/policy: add the bundled Policy plugin for policy-backed channel conformance checks, doctor lint findings, and opt-in workspace repair. ([#80407](openclaw/openclaw#80407)) Thanks [@giodl73-repo](https://github.com/giodl73-repo). - Agents/config: allow `agents.list[].experimental.localModelLean` so lean local-model mode can be enabled for one configured agent instead of globally. - Providers/xAI: add device-code OAuth login so remote and headless setups can authorize xAI without a localhost browser callback. ([#84005](openclaw/openclaw#84005)) Thanks [@fuller-stack-dev](https://github.com/fuller-stack-dev). - Providers/OpenRouter: honor provider-level `params.provider` routing policy for OpenRouter requests, with model and agent params overriding the defaults. Thanks [@amknight](https://github.com/amknight). ##### Fixes - CLI/tasks: include stale-running task maintenance decisions in `openclaw tasks maintenance --json` so retained and reconcile candidates explain backing-session, cron, CLI, and wedged-subagent state. ([#84691](openclaw/openclaw#84691)) Thanks [@efpiva](https://github.com/efpiva). - Codex app-server: keep system-prompt reports working when bootstrap hooks provide workspace files with only a path and content, so hook-supplied SOUL/IDENTITY/TOOLS/USER context still reports injected characters correctly. ([#84736](openclaw/openclaw#84736)) Thanks [@JARVIS-Glasses](https://github.com/JARVIS-Glasses). - Providers/MiniMax music: stop advertising `durationSeconds` control and remove prompt-injected duration hints, so `music_generate` reports MiniMax duration as an unsupported override instead of suggesting MiniMax can enforce track length. Fixes [#84508](openclaw/openclaw#84508). Thanks [@neeravmakwana](https://github.com/neeravmakwana). - Doctor: warn when sandbox tool policy hides configured MCP server tools before provider requests. ([#84699](openclaw/openclaw#84699)) Thanks [@nxmxbbd](https://github.com/nxmxbbd). - WhatsApp: update Baileys to `7.0.0-rc12`. - Build: suppress per-locale `rolldown-plugin-dts:fake-js` CommonJS dts warnings emitted while bundling the intentionally-inlined `zod/v4/locales/*.d.cts` files, so `pnpm build` output stays readable after the 0.25.1 plugin bump. Thanks [@romneyda](https://github.com/romneyda). - CLI/nodes: route lazy plugin-registration logs to stderr for JSON-mode `openclaw nodes` commands so stdout stays parseable. ([#84684](openclaw/openclaw#84684)) Thanks [@TurboTheTurtle](https://github.com/TurboTheTurtle). - Approvals: route manual `/approve` decisions through the trusted approval runtime so active exec and plugin approvals no longer look unknown or expired. - Mac app: update the About settings copyright year to 2026. ([#84385](openclaw/openclaw#84385)) Thanks [@pejmanjohn](https://github.com/pejmanjohn). - Dependencies: update `@openclaw/fs-safe` to `0.2.7` so OpenClaw's default Python-helper-off policy keeps best-effort Node write fallbacks for private stores, secret writes, run logs, and media attachments on Linux/macOS. - Infra/secrets: restore the fail-closed contract for `tryReadSecretFileSync` so credential loaders that pass `rejectSymlink: true` (Telegram, LINE, Zalo, IRC, Nextcloud Talk tokens) refuse symlinked credential files instead of silently accepting them, and the infra-state CI shard's secret-file symlink test passes again. Thanks [@romneyda](https://github.com/romneyda). - Browser: honor the configured image sanitization limit for screenshots and labeled snapshots so browser-captured images follow the same resize policy as other image results. ([#84595](openclaw/openclaw#84595)) - Doctor: remove unrecognized `models.providers.*.models[*].compat.thinkingFormat` values during `doctor --fix` so stale provider model config can validate after upgrade. Fixes [#77803](openclaw/openclaw#77803). - Doctor: warn when `openclaw.json` stores plaintext secret-bearing config fields, including model provider API keys and sensitive provider headers. ([#84718](openclaw/openclaw#84718)) Thanks [@lukaIvanic](https://github.com/lukaIvanic). - Status: show the configured default, session-selected model, reason, clear hint, and docs link when a session remains pinned to a model that differs from `agents.defaults.model.primary`. - WebChat: clear stale typing indicators when session change events mark the active chat run complete. - Mac app: keep local packaging signed with a stable app identity for permission testing and fix Control UI production builds under current Vite/Highlight.js exports. - macOS app: update the embedded Peekaboo bridge to 3.2.1 so OpenClaw-hosted UI automation works with current Peekaboo CLI capture flows. - Cron: deliver preferred final assistant output for successful scheduled runs when trailing plain tool warnings remain in diagnostics instead of marking the run failed. - fix(mattermost): fail closed on missing channel type \[AI]. ([#84091](openclaw/openclaw#84091)) Thanks [@pgondhi987](https://github.com/pgondhi987). - Recheck rebuilt system.run argv \[AI]. ([#84090](openclaw/openclaw#84090)) Thanks [@pgondhi987](https://github.com/pgondhi987). - CLI: keep the private QA subcommand out of exported command descriptors unless `OPENCLAW_ENABLE_PRIVATE_QA_CLI=1`, so root help and subcommand markers match runtime registration. ([#84519](openclaw/openclaw#84519)) - CLI/cron: bound `openclaw cron show` job lookup pagination so non-advancing or unbounded `cron.list` responses fail instead of hanging the command. Fixes [#83856](openclaw/openclaw#83856). ([#83989](openclaw/openclaw#83989)) - Agents/messages: stop message-tool-only turns after a successful source-channel `message` send while keeping transcript mirrors under the session write lock. ([#84289](openclaw/openclaw#84289)) - Agents: filter silent heartbeat response-tool transcript artifacts out of embedded context snapshots so later user turns are not polluted by heartbeat no-op messages. ([#83477](openclaw/openclaw#83477)) Thanks [@fuller-stack-dev](https://github.com/fuller-stack-dev). - Agents/OpenAI: log repeated strict tool-schema downgrade diagnostics once per provider/model/tool signature, reducing duplicate debug noise while preserving `strict=false` fallback behavior. Fixes [#82930](openclaw/openclaw#82930). ([#82933](openclaw/openclaw#82933)) Thanks [@galiniliev](https://github.com/galiniliev). - Agents/code mode: spell out the `exec` tool's JavaScript/TypeScript, no Node module, and catalog-bridge constraints in model-visible schema text so agents can use enabled tools without trial-and-error. ([#84269](openclaw/openclaw#84269)) Thanks [@Kaspre](https://github.com/Kaspre). - Codex: give `image_generate` dynamic-tool calls a 120s default watchdog when no per-call or configured image timeout is set, so image generation no longer falls back to the generic 30s bridge timeout. ([#84254](openclaw/openclaw#84254)) Thanks [@moritzmmayerhofer](https://github.com/moritzmmayerhofer). - Codex: avoid duplicate dynamic tool terminal diagnostics while large diagnostic backlogs drain without blocking tool responses. ([#82937](openclaw/openclaw#82937)) Thanks [@galiniliev](https://github.com/galiniliev). - CLI/message: include a stable top-level `messageId` in `openclaw message --json` output when channel sends return one. ([#84191](openclaw/openclaw#84191)) Thanks [@100menotu001](https://github.com/100menotu001). - Cron: preserve legacy top-level array `jobs.json` stores when loading or adding scheduled jobs so old cron jobs are no longer treated as an empty store during upgrade. Fixes [#60799](openclaw/openclaw#60799). ([#84433](openclaw/openclaw#84433)) Thanks [@IWhatsskill](https://github.com/IWhatsskill). - Gateway/agents: use an agent's `identity.name` in Gateway agent summaries when `agents.list[].name` is unset, so configured agent labels remain visible in clients. ([#84355](openclaw/openclaw#84355); refs [#57835](openclaw/openclaw#57835)) Thanks [@luoyanglang](https://github.com/luoyanglang). - Channels/replies: keep normal `/verbose` failed-tool progress compact in message-tool replies and prevent late text-only tool output from appearing after the final answer. ([#84303](openclaw/openclaw#84303)) Thanks [@VACInc](https://github.com/VACInc). - Plugins/hooks: apply a default 30-second timeout to `before_compaction` and `after_compaction` hooks so a hung plugin handler no longer blocks compaction completion. ([#84153](openclaw/openclaw#84153)) - Discord: preserve disabled presentation buttons when adapting and rendering Discord message controls. ([#84188](openclaw/openclaw#84188)) Thanks [@100menotu001](https://github.com/100menotu001). - Twitch: add a test-only client-manager registry reset helper so non-isolated Twitch tests can clear cached managers between cases. Fixes [#83887](openclaw/openclaw#83887). ([#84244](openclaw/openclaw#84244)) Thanks [@hclsys](https://github.com/hclsys). - Cron: run main-session scheduled work on a cron-owned wake lane while preserving reply delivery context, so background cron turns no longer block human main-session chat. Fixes [#82766](openclaw/openclaw#82766). ([#82767](openclaw/openclaw#82767)) Thanks [@galiniliev](https://github.com/galiniliev). - Cron: use structured embedded-run denial metadata for isolated scheduled tasks so blocked exec requests fail the job without treating ordinary assistant prose as a denial. ([#84067](openclaw/openclaw#84067)) Thanks [@abnershang](https://github.com/abnershang). - Cron: keep recovered tool warnings diagnostic for successful scheduled runs so final cron output is delivered instead of being replaced by a post-processing warning. ([#84045](openclaw/openclaw#84045)) Thanks [@abnershang](https://github.com/abnershang). - Plugins/perf: thread explicit plugin discovery results through `loadBundledCapabilityRuntimeRegistry`, `resolveBundledPluginSources`, and `listChannelCatalogEntries` so callers that already hold a discovery result skip redundant filesystem walks. Thanks [@SebTardif](https://github.com/SebTardif). - harden update restart script creation \[AI]. ([#84088](openclaw/openclaw#84088)) Thanks [@pgondhi987](https://github.com/pgondhi987). - Docker: keep the bundled Codex plugin in official release image keep lists so the default OpenAI agent harness remains available after Docker pruning. Fixes [#83613](openclaw/openclaw#83613). ([#83626](openclaw/openclaw#83626)) Thanks [@YuanHanzhong](https://github.com/YuanHanzhong). - CLI/channels: preserve the first line of `openclaw channels logs` output when the rolling tail window starts exactly on a line boundary, mirroring the already-fixed `readLogSlice` behavior in `src/logging/log-tail.ts`. - Control UI: treat terminal session status as authoritative over stale active-run flags so completed terminal runs stop showing abort/live UI. ([#84057](openclaw/openclaw#84057)) - CLI: preserve embedded equals signs in inline root option values instead of truncating after the second separator. ([#83995](openclaw/openclaw#83995)) Thanks [@ThiagoCAltoe](https://github.com/ThiagoCAltoe). - Matrix/config: accept `messages.queue.byChannel.matrix` queue overrides and keep queue provider schema/type keys aligned for Matrix, Google Chat, and Mattermost. Thanks [@bdjben](https://github.com/bdjben). - CLI: format `openclaw acp client` failures through the shared error formatter so object-shaped errors stay readable instead of printing `[object Object]`. Fixes [#83904](openclaw/openclaw#83904). ([#84080](openclaw/openclaw#84080)) - Providers/Ollama: default unknown-capabilities models to tool-capable so discovered native Ollama models can use tools when `/api/show` omits capabilities. ([#84055](openclaw/openclaw#84055)) Thanks [@dutifulbob](https://github.com/dutifulbob). - Installer/Windows: launch `install.ps1` onboarding as an attached child process so fresh native Windows installs do not freeze visibly at `Starting setup...` or corrupt the wizard's terminal rendering. - CLI/update: keep restart health checks working across one-version CLI/Gateway protocol skew and use the managed Gateway service Node for all follow-up commands even when the package root is unchanged, so `openclaw update` no longer silently switches the gateway to a different Node binary when multiple Node installations are present. Thanks [@amknight](https://github.com/amknight). - CLI/gateway: include the running Gateway version in `gateway status` JSON output, preserving existing server metadata while falling back to status RPC data for read probes. Fixes [#56222](openclaw/openclaw#56222). Thanks [@galiniliev](https://github.com/galiniliev). - Memory/search: close local embedding providers when active-memory searches time out so pending local model loads and embedding contexts are aborted and released. ([#83858](openclaw/openclaw#83858)) Thanks [@brokemac79](https://github.com/brokemac79). - CLI/nodes: request pending node surface approval scopes before `openclaw nodes approve` so exec-capable node approval can use admin-scoped Gateway credentials instead of failing with `missing scope: operator.admin`. ([#84392](openclaw/openclaw#84392)) Thanks [@joshavant](https://github.com/joshavant). - Gateway: reject slow node event sends before outbound buffers grow unbounded and log the rejected payload diagnostic. ([#84387](openclaw/openclaw#84387)) Thanks [@samzong](https://github.com/samzong). - Agents: include bounded trajectory queued-writer diagnostics in `pi-trajectory-flush` timeout warnings so flush stalls show pending writes, queued bytes, and append state. Fixes [#82961](openclaw/openclaw#82961). ([#82962](openclaw/openclaw#82962)) Thanks [@galiniliev](https://github.com/galiniliev). - Agents/subagents: recover stale completion announces by retrying unsupported transcript-wait wakes without transcript waiting and forcing a message-tool handoff when the requester run is already stale. Fixes [#83699](openclaw/openclaw#83699). ([#83700](openclaw/openclaw#83700)) Thanks [@galiniliev](https://github.com/galiniliev). - Agents/subagents: constrain wildcard subagent target allowlists to configured agents while preserving explicitly listed compatibility targets. Fixes [#84040](openclaw/openclaw#84040). ([#84357](openclaw/openclaw#84357)) Thanks [@joshavant](https://github.com/joshavant). - Providers/Anthropic: route Anthropic model refs selected with Claude CLI auth through the Claude CLI runtime so shorthand refs such as `anthropic/opus-4.7` no longer fall back to embedded Anthropic billing. Fixes [#84222](openclaw/openclaw#84222). ([#84374](openclaw/openclaw#84374)) Thanks [@joshavant](https://github.com/joshavant). - Agents: honor explicit `models.providers.<id>.timeoutSeconds` values above the default idle watchdog for cloud and self-hosted providers, so long first-token waits no longer fall back at \~120s when the provider timeout is higher. ([#83979](openclaw/openclaw#83979)) Thanks [@yujiawei](https://github.com/yujiawei). - Agents/Codex: keep encrypted Responses reasoning replay provenance-bound so stale mirrored Codex transcripts drop invalid encrypted content before request assembly while preserving matching same-session replay. Fixes [#83836](openclaw/openclaw#83836). ([#84367](openclaw/openclaw#84367)) Thanks [@joshavant](https://github.com/joshavant). - Agents/subagents: skip stale embedded-run wake probes for dormant completion requesters, so late subagent completions go straight to requester-agent/direct handoff instead of producing `reason=no_active_run` queue noise. ([#82964](openclaw/openclaw#82964)) Thanks [@galiniliev](https://github.com/galiniliev). - CLI: retry config snapshot reads after a transient failure so one rejected read no longer poisons later commands in the same process. ([#83931](openclaw/openclaw#83931)) Thanks [@honor2030](https://github.com/honor2030). - Media: decode URL path basenames before using them as remote media fallback filenames, so files like `My%20Report.pdf` are surfaced as `My Report.pdf`. Fixes [#84050](openclaw/openclaw#84050). ([#84052](openclaw/openclaw#84052)) Thanks [@jbetala7](https://github.com/jbetala7). - WhatsApp: clarify inbound group diagnostics so observed but unregistered groups point to `channels.whatsapp.groups` without changing routing or sender authorization. ([#83846](openclaw/openclaw#83846)) Thanks [@neeravmakwana](https://github.com/neeravmakwana). - WhatsApp: drain pending outbound deliveries on a 30s periodic timer in addition to the reconnect handler, so messages enqueued while the provider is already connected no longer wait for the next reconnect to send. ([#79083](openclaw/openclaw#79083)) Thanks [@Oviemudiaga](https://github.com/Oviemudiaga). - CLI/TUI: include gateway plugin slash commands in TUI autocomplete, so connected sessions can suggest plugin-owned commands exposed by the running Gateway. ([#83640](openclaw/openclaw#83640)) Thanks [@se7en-agent](https://github.com/se7en-agent). - Gateway/mobile: restore QR setup-code handoff of bounded operator tokens for iOS and Android onboarding while keeping admin and pairing scopes out of bootstrap. ([#83684](openclaw/openclaw#83684)) Thanks [@ngutman](https://github.com/ngutman). - iOS: repair Release archive compilation for the TestFlight build. ([#84255](openclaw/openclaw#84255)) Thanks [@ngutman](https://github.com/ngutman). - Agents/compaction: bound plugin-owned CLI transcript compaction with the host safety timeout so a hung context engine can no longer stall post-turn cleanup. ([#84083](openclaw/openclaw#84083)) Thanks [@100yenadmin](https://github.com/100yenadmin). - Control UI/usage: truncate long context skill, tool, and file names in the usage panel while keeping the full name available on hover. ([#42197](openclaw/openclaw#42197)) Thanks [@Rain120](https://github.com/Rain120). - Codex: respect explicit `models auth order set` and `config.auth.order` precedence over stale `lastGood` in `/codex account`, and show `no working credential` when every explicit-order profile is ineligible instead of marking a lower-ranked profile as active. Fixes [#84386](openclaw/openclaw#84386). ([#84412](openclaw/openclaw#84412)) Thanks [@openperf](https://github.com/openperf). - Agents: honor `messages.suppressToolErrors` for mutating tool failures so configured chat surfaces do not receive separate warning payloads. ([#81561](openclaw/openclaw#81561)) Thanks [@moeedahmed](https://github.com/moeedahmed). - Agents/fallback: surface billing guidance for mixed rate-limit plus billing fallback exhaustion instead of generic failure copy. Fixes [#79396](openclaw/openclaw#79396). ([#79489](openclaw/openclaw#79489)) Thanks [@aayushprsingh](https://github.com/aayushprsingh). </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Reviewed-on: https://git.erwanleboucher.dev/eleboucher/homelab/pulls/615

openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 17, 2026

clawsweeper Bot mentioned this pull request May 17, 2026

[Bug]: pi-trajectory-flush timeout warning lacks queued writer state #82961

Closed

clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. P2 Normal backlog priority with limited blast radius. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. labels May 17, 2026

clawsweeper Bot mentioned this pull request May 17, 2026

fix(#82961): surface cleanup-step diagnostic state on agent cleanup timeout #83050

Closed

galiniliev self-assigned this May 19, 2026

galiniliev and others added 2 commits May 19, 2026 06:16

fix(agents): add trajectory flush timeout diagnostics

6ecfa8f

docs: update changelog for trajectory diagnostics

0a76f90

galiniliev force-pushed the bug-004-pi-trajectory-flush-cleanup branch from 80e2c8a to 0a76f90 Compare May 19, 2026 06:20

clawsweeper Bot added the rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. label May 19, 2026

galiniliev merged commit ddeaebf into openclaw:main May 19, 2026
104 of 105 checks passed

clawsweeper Bot added the status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. label May 19, 2026

clawsweeper Bot mentioned this pull request May 19, 2026

fix: emit warning on consecutive QueuedFileWriter failures #57089

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): add trajectory flush timeout diagnostics#82962

fix(agents): add trajectory flush timeout diagnostics#82962
galiniliev merged 2 commits into
openclaw:mainfrom
galiniliev:bug-004-pi-trajectory-flush-cleanup

galiniliev commented May 17, 2026

Uh oh!

clawsweeper Bot commented May 17, 2026 •

edited

Loading

Uh oh!

YonganZhang commented May 17, 2026

Uh oh!

galiniliev commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

galiniliev commented May 17, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Real behavior proof (required for external PRs)

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

clawsweeper Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YonganZhang commented May 17, 2026

Uh oh!

galiniliev commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clawsweeper Bot commented May 17, 2026 •

edited

Loading