fix(agents): release embedded-attempt session lock on every exit path#86427
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 25, 2026, 8:51 AM ET / 12:51 UTC. Summary PR surface: Source +19, Tests +39. Total +58 across 3 files. Reproducibility: yes. Source inspection shows current main can skip cleanup after trajectory flush throws, and the PR body supplies a before/after terminal harness using the real lock primitive and controller. Review metrics: none identified. Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the focused dispose/finally release after maintainer review and required CI, then close #86014 as implemented while leaving the stalled-withSessionWriteLock watchdog variant out of scope. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main can skip cleanup after trajectory flush throws, and the PR body supplies a before/after terminal harness using the real lock primitive and controller. Is this the best way to solve the issue? Yes. An idempotent controller-owned dispose called from the outer finally is a narrow safety net because normal cleanup still transfers and releases the lock through the existing cleanup path. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against fcf0bff92965. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +19, Tests +39. Total +58 across 3 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
ClawSweeper PR egg ✨ Hatched: 🌱 uncommon Moonlit Test Hopper Hatch commandComment Hatchability rules:
Rarity: 🌱 uncommon. What is this egg doing here?
|
1d1370b to
acef986
Compare
acef986 to
0f0787a
Compare
The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014
201c5f5 to
b5dd866
Compare
|
Verification for PR #86427 on head Behavior addressed: embedded attempts release the eagerly-held session write lock from outer teardown when post-prompt code exits before the normal cleanup block, preventing later requests from wedging behind node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
git diff --check origin/main...HEAD
./node_modules/.bin/oxfmt --check --threads=1 src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.ts src/agents/pi-embedded-runner/run/attempt.tsEvidence after fix: focused embedded session-lock suite passed with 68 tests; diff check and oxfmt check passed; CI run |
…#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes #86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…026.5.26) (#682) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [ghcr.io/openclaw/openclaw](https://openclaw.ai) ([source](https://github.com/openclaw/openclaw)) | patch | `2026.5.22` → `2026.5.26` | --- > ⚠️ **Warning** > > Some dependencies could not be looked up. Check the [Dependency Dashboard](issues/567) for more information. --- ### Release Notes <details> <summary>openclaw/openclaw (ghcr.io/openclaw/openclaw)</summary> ### [`v2026.5.26`](https://github.com/openclaw/openclaw/blob/HEAD/CHANGELOG.md#2026526) [Compare Source](https://github.com/openclaw/openclaw/compare/v2026.5.22...v2026.5.26) ##### Highlights - Faster Gateway and replies: startup avoids repeated plugin, channel, session, usage-cost, warning, scheduled-service, and filesystem scans; visible replies separate user-facing sends from slower follow-up work; Gateway runtime/session caches churn less under load. - Transcripts are core: transcript-backed meeting summaries, source-provider chunks, cleaned user turns, media provenance, Codex mirrors, WebChat replies, and CLI/TUI replay now use one more reliable transcript path. - More channels are production-ready: Telegram keeps typing/progress context and forum topics, iMessage handles attachment roots, remote media staging, and duplicate local Messages sources, WhatsApp restores group/media behavior, Discord improves voice playback and model picking, and Signal/iMessage/WhatsApp get reaction approvals. - Better voice and Talk: realtime Talk runs can be inspected, steered, cancelled, or followed up from Web UI and Discord voice; wake-name handling is more tolerant without letting ambient speech trigger agents. - Safer content boundaries: Browser snapshot reads honor SSRF policy, system-event text cannot spoof nested prompt markers, fetched file text is wrapped as external content, ClickClack inbound sender allowlists run before agent dispatch, stale device tokens are rejected, and serialized tool-call text is scrubbed from replies. - Providers, Codex, and local models are steadier: named auth profiles, OpenAI sampling params, Codex app-server resume/timeout/usage-limit recovery, dynamic tool-schema guards, xAI usage-limit surfacing, Ollama top-p normalization, and local approval resolution reduce provider-specific dead ends. - More reliable install/update/release paths: Alpine installs, trusted runtime fallback roots, stable update channels, Docker/package timeouts, Windows Scheduled Tasks, Windows/macOS proof lanes, Testbox/Crabbox delegation, plugin publish checks, and macOS runner bootstraps all got hardened. - Better observability: Activity tab, gateway secret-prep traces, tool/model stream progress, explicit fast-mode status, systemd Gateway hygiene, OpenTelemetry LLM spans, release performance evidence, and richer telemetry signals make failures easier to inspect. ##### Changes - Transcripts: add core transcript capture and source-provider support for transcript-backed meeting summaries, including the renamed Transcripts docs, CLI surface, source-provider chunks, and cleaned user-turn persistence. - Auth: add named model login profiles and supported credential migration for Hermes, OpenCode, and Codex auth profiles, with explicit opt-out and non-interactive controls. ([#​85667](https://github.com/openclaw/openclaw/issues/85667)) Thanks [@​fuller-stack-dev](https://github.com/fuller-stack-dev). - Diagnostics: trace gateway secret preparation, classify skill/tool usage, surface model stream progress, add OpenTelemetry LLM content spans, and expose alertable telemetry for blocked tools, failover, stale sessions, liveness, oversized payloads, and webhook ingress. ([#​83019](https://github.com/openclaw/openclaw/issues/83019), [#​80370](https://github.com/openclaw/openclaw/issues/80370), [#​86191](https://github.com/openclaw/openclaw/issues/86191)) - Channels: add Signal reaction approvals, iMessage thumb approval reactions, and WhatsApp thumb approval reaction support so mobile approval flows work without textual `/approve` commands. ([#​85894](https://github.com/openclaw/openclaw/issues/85894), [#​85952](https://github.com/openclaw/openclaw/issues/85952), [#​85477](https://github.com/openclaw/openclaw/issues/85477)) - Agents/API: forward OpenAI sampling params through the Gateway and expose estimated context-budget status for active agent runs. ([#​84094](https://github.com/openclaw/openclaw/issues/84094)) - TUI/status: queue prompts submitted while an agent is busy and show explicit fast-mode state plus richer systemd Gateway hygiene in status output. ([#​86722](https://github.com/openclaw/openclaw/issues/86722), [#​87115](https://github.com/openclaw/openclaw/issues/87115), [#​86976](https://github.com/openclaw/openclaw/issues/86976)) - Exec approvals: hide durable approval actions that are unavailable for the current prompt and keep approval runtime tokens local-only so stale prompts cannot offer misleading controls. ([#​86270](https://github.com/openclaw/openclaw/issues/86270), [#​86359](https://github.com/openclaw/openclaw/issues/86359)) - Plugin SDK: add reaction approval helpers and keep diagnostic event root exports discoverable across function-name and alias-bound module graphs. ([#​86735](https://github.com/openclaw/openclaw/issues/86735), [#​87084](https://github.com/openclaw/openclaw/issues/87084)) - Android/iOS: add the Android pair-new-gateway action and improve mobile Talk mode surfaces, including iOS realtime Talk mode and Android offline voice/gateway recovery. ([#​86798](https://github.com/openclaw/openclaw/issues/86798), [#​86355](https://github.com/openclaw/openclaw/issues/86355)) Thanks [@​ngutman](https://github.com/ngutman). - Performance: cache plugin metadata snapshots, package realpaths, stable gateway metadata, model cost indexes, channel resolution, usage-cost indexes, and session/auth hot-path facts so common Gateway and reply paths do less rediscovery. ([#​84649](https://github.com/openclaw/openclaw/issues/84649), [#​85843](https://github.com/openclaw/openclaw/issues/85843), [#​86517](https://github.com/openclaw/openclaw/issues/86517), [#​86678](https://github.com/openclaw/openclaw/issues/86678)) - Voice: expose shared realtime turn-context tracking through the realtime voice SDK and reuse it for Discord speaker attribution and wake-name context recovery. - Voice: reuse shared realtime output activity tracking in Google Meet command and node audio bridges, including recent-output checks for local barge-in detection. - Voice: expose shared realtime output activity tracking through the realtime voice SDK and reuse it for Discord playback activity and barge-in decisions. - Voice: expose shared realtime consult question matching, speakable-result extraction, and alias-aware forced-consult coordination through the realtime voice SDK, then reuse it in Gateway Talk, Voice Call, and Discord voice paths. - Voice: share activation-name matching and consult-transcript screening through the realtime voice SDK so Discord, browser voice, and meeting surfaces can reuse one implementation. - Cron: default `cron.maxConcurrentRuns` to 8 so scheduled automations and their isolated agent turns can make progress in parallel without explicit configuration. - QA-Lab: add `qa coverage --match <query>` so focused proof selection can discover matching scenarios from existing metadata before running live or remote lanes. - Discord/model picker: surface an alpha-bucket select (e.g. `A–G (12) · H–N (18) · O–Z (5)`) when the provider list or a provider's model list exceeds 25 items, so configs with `provider/*` wildcards stay one click from the right page instead of paginating through prev/next; falls back to numeric chunks when every item shares the same first letter. - Control UI: add an ephemeral Activity tab for sanitized live tool activity summaries without persisting raw telemetry. Fixes [#​12831](https://github.com/openclaw/openclaw/issues/12831). Thanks [@​BunsDev](https://github.com/BunsDev). - Build: include `ui:build` in the `full` and `ciArtifacts` profiles of `scripts/build-all.mjs` so `pnpm build` always rebuilds `dist/control-ui` after `tsdown` cleans `dist`, removing the second-command requirement and the missing-asset failure mode for source/runtime installs and CI artifact uploads. ([#​85206](https://github.com/openclaw/openclaw/issues/85206)) - iOS: improve Talk mode with direct realtime voice sessions, compact toolbar status, and responsive voice waveform feedback. ([#​86355](https://github.com/openclaw/openclaw/issues/86355)) Thanks [@​ngutman](https://github.com/ngutman). - Media: replace the Sharp image backend with Rastermill for metadata, resizing, EXIF orientation, and PNG alpha-preserving optimization so OpenClaw no longer installs Sharp or the WhatsApp Jimp fallback for image processing. ([#​86437](https://github.com/openclaw/openclaw/issues/86437)) - Codex: update the bundled Codex CLI to 0.134.0 and keep native compaction disabled for budget-triggered app-server turns so OpenClaw owns the recovery boundary. ([#​86772](https://github.com/openclaw/openclaw/issues/86772)) ##### Fixes - Memory/security: reject prompt-like text submitted through the explicit `memory_store` tool before embedding or storage, matching the existing auto-capture prompt-injection filter. ([#​87142](https://github.com/openclaw/openclaw/issues/87142)) - Gateway/security: enable the default auth rate limiter for remote non-browser and HTTP gateway auth failures when `gateway.auth.rateLimit` is unset, while preserving the loopback exemption. ([#​87148](https://github.com/openclaw/openclaw/issues/87148)) - Prompt hardening: route untrusted group prompt metadata through sanitized untrusted structured context while preserving trusted operator-configured group system prompts and aligning the plugin SDK docs/test helpers. ([#​87144](https://github.com/openclaw/openclaw/issues/87144)) - Security/content boundaries: validate Browser snapshot tab URLs against SSRF policy before ChromeMCP or direct CDP reads, sanitize queued system-event text so untrusted plugin/channel labels cannot spoof nested prompt markers, wrap fetched file text and metadata as external content, apply ClickClack `allowFrom` sender allowlists before agent dispatch, reject RPCs from invalidated device-token clients during rotation, require staged sandbox media refs, and scrub serialized tool-call text from replies. ([#​78526](https://github.com/openclaw/openclaw/issues/78526), [#​87094](https://github.com/openclaw/openclaw/issues/87094), [#​87062](https://github.com/openclaw/openclaw/issues/87062), [#​83741](https://github.com/openclaw/openclaw/issues/83741), [#​70707](https://github.com/openclaw/openclaw/issues/70707), [#​86924](https://github.com/openclaw/openclaw/issues/86924)) Thanks [@​zsxsoft](https://github.com/zsxsoft), [@​ttzero25](https://github.com/ttzero25), and [@​mmaps](https://github.com/mmaps). - Transcripts/user turns: persist CLI, WebChat, media, follow-up, hook, and Codex-mirror user turns to the admitted session target; keep cleaned transcript text, inline image routing, provenance metadata, replay hooks, and fallback paths idempotent when runtimes fail or restart. - TUI/status/onboarding/UI: queue busy TUI prompts instead of dropping them, preserve the configured default model during onboarding, show failed tool results as errors, show config-open failures in Control UI, keep status JSON plugin scans healthy, preserve xAI usage-limit errors locally, and expose explicit fast-mode/systemd state. ([#​86722](https://github.com/openclaw/openclaw/issues/86722), [#​87000](https://github.com/openclaw/openclaw/issues/87000), [#​85786](https://github.com/openclaw/openclaw/issues/85786), [#​87108](https://github.com/openclaw/openclaw/issues/87108), [#​87001](https://github.com/openclaw/openclaw/issues/87001), [#​86614](https://github.com/openclaw/openclaw/issues/86614), [#​87115](https://github.com/openclaw/openclaw/issues/87115), [#​86976](https://github.com/openclaw/openclaw/issues/86976)) - Plugin commands/SDK: preserve plugin LLM command auth, bind native plugin command dispatch to the host agent's LLM auth, keep `onDiagnosticEvent` exports discoverable through `Function.name`, stabilize diagnostic event root aliases, correlate pathless read diagnostics, suppress transient runner failures in channel command paths, and repair local approval resolution. ([#​85936](https://github.com/openclaw/openclaw/issues/85936), [#​87084](https://github.com/openclaw/openclaw/issues/87084), [#​86977](https://github.com/openclaw/openclaw/issues/86977), [#​87069](https://github.com/openclaw/openclaw/issues/87069), [#​86771](https://github.com/openclaw/openclaw/issues/86771)) - Codex/providers: keep WebChat delivery hints out of user prompts, avoid false queued-terminal idle timeouts, share the native hook relay registry, quarantine unsupported dynamic tool schemas, preserve Claude resumed-session system prompts, normalize greedy Ollama `top_p`, preserve per-agent thinking defaults for ingress runs, and avoid native compaction takeover on budget-triggered Codex turns. ([#​87096](https://github.com/openclaw/openclaw/issues/87096), [#​73950](https://github.com/openclaw/openclaw/issues/73950), [#​87049](https://github.com/openclaw/openclaw/issues/87049), [#​86689](https://github.com/openclaw/openclaw/issues/86689), [#​86772](https://github.com/openclaw/openclaw/issues/86772)) - Gateway/perf/release: reuse startup-warning metadata and prepared auth stores, avoid cloning live-switch and lifecycle session caches on read paths, defer warning and scheduled-service fallback imports, trim Gateway session/startup/runtime CPU churn, skip duplicate turn session touches, stop chat timeout fallback cascades, drop stale subagent announce history, bound benchmark/watch/kitchen-sink teardown waits, bound macOS/package/onboarding/plugin smoke commands, bound install finalization probes, resolve Parallels npm-update commands from guest `PATH`, and bootstrap raw AWS macOS Node/pnpm commands through `/usr/bin/env`. ([#​86997](https://github.com/openclaw/openclaw/issues/86997)) - Reply/perf: reduce visible reply delivery latency by preserving Telegram typing/progress context, lazy-loading slash-command startup metadata, avoiding hot-path model hydration, flag-gating Codex profiler timing, deferring context compaction maintenance, and tracking delivery timing. ([#​86989](https://github.com/openclaw/openclaw/issues/86989), [#​86990](https://github.com/openclaw/openclaw/issues/86990), [#​86991](https://github.com/openclaw/openclaw/issues/86991), [#​86992](https://github.com/openclaw/openclaw/issues/86992), [#​86993](https://github.com/openclaw/openclaw/issues/86993), [#​86994](https://github.com/openclaw/openclaw/issues/86994)) Thanks [@​keshavbotagent](https://github.com/keshavbotagent). - Reply/source delivery: keep TUI, Control UI, media, TTS, transcript, and Codex source-reply finals live without duplicate terminal events or stale replay artifacts. - Agents/replay: repair legacy tool results before replay, preserve `sessions_spawn` transcript payloads, restore current guard checks, stage sandboxed workspace media, and keep duplicate transcripts tool display metadata from reappearing. ([#​82203](https://github.com/openclaw/openclaw/issues/82203), [#​86934](https://github.com/openclaw/openclaw/issues/86934), [#​87025](https://github.com/openclaw/openclaw/issues/87025)) Thanks [@​martingarramon](https://github.com/martingarramon), [@​vincentkoc](https://github.com/vincentkoc), and [@​joshavant](https://github.com/joshavant). - Agents/sessions: handle active-fallback failures in `sessions_send` so fallback routing reports the real failure and does not leave callers with an ambiguous dropped send. ([#​86638](https://github.com/openclaw/openclaw/issues/86638)) - Agents/hooks/subagents: enforce default hook agent allowlists, recover failed subagent lifecycle completions, and keep node task lifecycle cleanup from closing the Gateway listener. ([#​86101](https://github.com/openclaw/openclaw/issues/86101)) - Codex: project newer OpenClaw chat history into resumed app-server threads and keep Codex turn timeouts inside the Codex runtime boundary so timeouts do not poison shared app-server clients or fall through to unrelated provider fallback. ([#​86677](https://github.com/openclaw/openclaw/issues/86677), [#​86476](https://github.com/openclaw/openclaw/issues/86476)) Thanks [@​TurboTheTurtle](https://github.com/TurboTheTurtle) and [@​pashpashpash](https://github.com/pashpashpash). - Config/doctor/update: narrow profiled tool-section doctor repair, keep runtime-injected legacy web-search provider config out of user-authored config validation, and keep prerelease tags excluded from stable updater resolution. ([#​87030](https://github.com/openclaw/openclaw/issues/87030), [#​86818](https://github.com/openclaw/openclaw/issues/86818), [#​86559](https://github.com/openclaw/openclaw/issues/86559)) Thanks [@​joshavant](https://github.com/joshavant), [@​luoyanglang](https://github.com/luoyanglang), and [@​stevenepalmer](https://github.com/stevenepalmer). - Doctor/runtime: validate active bundled MCP tool schemas through the same runtime projection path so unsupported MCP input schemas are reported and quarantined instead of poisoning assistant startup. - CLI/Windows: add a Windows-only stack-size respawn for stack-heavy startup paths, default CLI logs to local timestamps, and validate timeout/banner TTY state more strictly. ([#​87031](https://github.com/openclaw/openclaw/issues/87031), [#​85387](https://github.com/openclaw/openclaw/issues/85387)) Thanks [@​giodl73-repo](https://github.com/giodl73-repo) and [@​vincentkoc](https://github.com/vincentkoc). - Locking/security: require owner identity proof before stale plugin lock removal, memoize session lock owner arguments, and avoid writing default exec approval stores unless policy state actually changed. ([#​86814](https://github.com/openclaw/openclaw/issues/86814), [#​86964](https://github.com/openclaw/openclaw/issues/86964)) Thanks [@​Alix-007](https://github.com/Alix-007) and [@​vincentkoc](https://github.com/vincentkoc). - Install/release: bound Docker package build, inventory, pack, and tarball preparation with process-group timeouts; pin shrinkwrap patch drift to the pnpm lock; harden macOS restart and dSYM packaging; and run release Docker/live timeout wrappers in the foreground so child processes cannot wedge gates. - QA/Telegram: bound Telegram user credential tar and broker calls so live proof setup fails with a timeout instead of waiting for the outer Crabbox job deadline. - QA/Tool Search: bound gateway E2E HTTP probes, run only the fixture plugin, and clean up temporary fixture trees after the compact tool-catalog proof completes. - Telegram/network: treat `ENETDOWN` as a transient pre-connect network failure so Telegram sends, gateway unhandled-rejection handling, and cron network retries follow the same recovery path as sibling network outages. ([#​86762](https://github.com/openclaw/openclaw/issues/86762)) Thanks [@​TurboTheTurtle](https://github.com/TurboTheTurtle). - Telegram: preserve inbound text entities, overlapping DM replies, account topic cache sidecars, outbound reply context, targeted bot-command mentions, durable group retry targets, forum topic names, and native progress callbacks. ([#​83873](https://github.com/openclaw/openclaw/issues/83873), [#​85361](https://github.com/openclaw/openclaw/issues/85361), [#​85555](https://github.com/openclaw/openclaw/issues/85555), [#​85656](https://github.com/openclaw/openclaw/issues/85656), [#​85709](https://github.com/openclaw/openclaw/issues/85709), [#​86299](https://github.com/openclaw/openclaw/issues/86299), [#​86553](https://github.com/openclaw/openclaw/issues/86553)) Thanks [@​SebTardif](https://github.com/SebTardif), [@​luoyanglang](https://github.com/luoyanglang), and [@​neeravmakwana](https://github.com/neeravmakwana). - iMessage: read image attachments from local Messages attachment roots, dedupe duplicate local Messages-source accounts, seed direct DM history, fix image/group media attachment commands, advance catchup cursors after live handling, and keep slash-command acknowledgements in the source conversation. ([#​82642](https://github.com/openclaw/openclaw/issues/82642), [#​85475](https://github.com/openclaw/openclaw/issues/85475), [#​86569](https://github.com/openclaw/openclaw/issues/86569), [#​86705](https://github.com/openclaw/openclaw/issues/86705), [#​86706](https://github.com/openclaw/openclaw/issues/86706), [#​86770](https://github.com/openclaw/openclaw/issues/86770)) Thanks [@​homer-byte](https://github.com/homer-byte), [@​TurboTheTurtle](https://github.com/TurboTheTurtle), [@​swang430](https://github.com/swang430), and [@​OmarShahine](https://github.com/OmarShahine). - WhatsApp/QQ/Twitch/IRC/Slack: restore WhatsApp ack identity and group-drop warnings, make QQ Bot media respect `OPENCLAW_HOME`, serialize Twitch auth disconnects, store IRC channel routes canonically, and keep Slack downloaded files out of reply media. ([#​83833](https://github.com/openclaw/openclaw/issues/83833), [#​85309](https://github.com/openclaw/openclaw/issues/85309), [#​85777](https://github.com/openclaw/openclaw/issues/85777), [#​85794](https://github.com/openclaw/openclaw/issues/85794), [#​85906](https://github.com/openclaw/openclaw/issues/85906), [#​86318](https://github.com/openclaw/openclaw/issues/86318), [#​86697](https://github.com/openclaw/openclaw/issues/86697)) Thanks [@​sliverp](https://github.com/sliverp), [@​neeravmakwana](https://github.com/neeravmakwana), and [@​Kailigithub](https://github.com/Kailigithub). - Discord/voice: improve voice playback and wake replies, bucket large model picker menus, merge media captions into one message, route metadata through configured proxies, restore numeric channel sends, suppress self-reply echoes, and tighten wake matching without breaking fuzzy wake phrases. ([#​80227](https://github.com/openclaw/openclaw/issues/80227), [#​86238](https://github.com/openclaw/openclaw/issues/86238), [#​86487](https://github.com/openclaw/openclaw/issues/86487), [#​86571](https://github.com/openclaw/openclaw/issues/86571), [#​86595](https://github.com/openclaw/openclaw/issues/86595), [#​86601](https://github.com/openclaw/openclaw/issues/86601)) - Codex: preserve native web-search metadata, keep oversized native thread reuse, bridge CLI API-key auth into the app server, preserve sandbox bootstrap path style, recover context-window prompt errors, honor yolo approval policy, disable native thread personality, and route compaction through Codex auth. ([#​85378](https://github.com/openclaw/openclaw/issues/85378), [#​85542](https://github.com/openclaw/openclaw/issues/85542), [#​85891](https://github.com/openclaw/openclaw/issues/85891), [#​85909](https://github.com/openclaw/openclaw/issues/85909), [#​86408](https://github.com/openclaw/openclaw/issues/86408)) - Agents/runtime: enforce session lock max-hold reclaim, release embedded-attempt locks on all exits, treat aborted subagent runs as terminal, avoid runtime model hydration on hot paths, disclose scoped session list counts, derive overflow budgets from provider errors, and keep fallback errors scoped to the active model candidate. ([#​70473](https://github.com/openclaw/openclaw/issues/70473), [#​85764](https://github.com/openclaw/openclaw/issues/85764), [#​86014](https://github.com/openclaw/openclaw/issues/86014), [#​86134](https://github.com/openclaw/openclaw/issues/86134), [#​86427](https://github.com/openclaw/openclaw/issues/86427), [#​86944](https://github.com/openclaw/openclaw/issues/86944)) Thanks [@​openperf](https://github.com/openperf), [@​fuller-stack-dev](https://github.com/fuller-stack-dev), [@​zhangguiping-xydt](https://github.com/zhangguiping-xydt), and [@​ferminquant](https://github.com/ferminquant). - Config/update/doctor: retry config recovery after failed backup restore, skip shell env fallback on Windows, exclude prerelease tags from the stable git channel, support deep config edits, warn instead of aborting on unreadable cron stores, prune stale bundled plugin paths, and avoid duplicate restart prompts when the Gateway is already healthy. ([#​85739](https://github.com/openclaw/openclaw/issues/85739), [#​85787](https://github.com/openclaw/openclaw/issues/85787), [#​86060](https://github.com/openclaw/openclaw/issues/86060), [#​86260](https://github.com/openclaw/openclaw/issues/86260), [#​86384](https://github.com/openclaw/openclaw/issues/86384), [#​86533](https://github.com/openclaw/openclaw/issues/86533)) Thanks [@​liaoyl830](https://github.com/liaoyl830). - Install/release: support Alpine CLI installs and runtime floors, prefer trusted startup argv runtime fallback roots, reject stale CLI node runtimes, avoid npm `min-release-age` installer failures, bound npm/package/Docker install phases, restore config parent ownership in Docker, seed Docker lockfile package tarballs before prune, make release/plugin prerelease checks fail closed instead of hanging or false-greening, and use host-visible Crabbox local work roots for Docker-backed proof. ([#​85491](https://github.com/openclaw/openclaw/issues/85491)) - Windows daemon: keep Scheduled Task gateway launches running on battery power and avoid workgroup-machine prompts for a domain user during task installation. ([#​59299](https://github.com/openclaw/openclaw/issues/59299)) - Security: avoid printing Gateway tokens in Docker, validate plugin model-pattern regexes safely, escape transcript metadata field names, harden session allowlist glob matching, audit Claude permission overrides under YOLO, and require explicit allow for ACP auto approvals. ([#​85849](https://github.com/openclaw/openclaw/issues/85849), [#​85934](https://github.com/openclaw/openclaw/issues/85934), [#​86046](https://github.com/openclaw/openclaw/issues/86046), [#​86557](https://github.com/openclaw/openclaw/issues/86557)) - Media/images: replace Sharp with Rastermill, keep EXIF normalization best-effort, normalize HEIC/HEIF before image descriptions, route Codex image API keys through OpenAI, preserve image compression metadata, and auto-scale live tool result caps. ([#​85776](https://github.com/openclaw/openclaw/issues/85776), [#​86037](https://github.com/openclaw/openclaw/issues/86037), [#​86437](https://github.com/openclaw/openclaw/issues/86437), [#​86857](https://github.com/openclaw/openclaw/issues/86857), [#​86923](https://github.com/openclaw/openclaw/issues/86923)) - Memory: prevent semantic vector indexes from silently degrading when embeddings are unavailable, stop doctor OOMs on large session stores, preserve sidecar hooks/artifacts, write fallback dream diaries, use CJK-aware dreaming dedupe, and avoid per-file watcher FD fan-out. ([#​80613](https://github.com/openclaw/openclaw/issues/80613), [#​82928](https://github.com/openclaw/openclaw/issues/82928), [#​85060](https://github.com/openclaw/openclaw/issues/85060), [#​85704](https://github.com/openclaw/openclaw/issues/85704), [#​85967](https://github.com/openclaw/openclaw/issues/85967), [#​86701](https://github.com/openclaw/openclaw/issues/86701)) Thanks [@​brokemac79](https://github.com/brokemac79), [@​openperf](https://github.com/openperf), and [@​yaaboo-gif](https://github.com/yaaboo-gif). - Agents/sessions: include visibility metadata on restricted `sessions_list` results so scoped counts are clearly reported without widening access or exposing hidden-session counts. ([#​86944](https://github.com/openclaw/openclaw/issues/86944)) Thanks [@​ferminquant](https://github.com/ferminquant). - Gateway/DNS: validate wide-area discovery domains before deriving zone paths or writing zone files, so invalid `discovery.wideArea.domain` and `dns setup --domain` values fail with a DNS-name diagnostic instead of falling through to unrelated configuration errors. Thanks [@​mmaps](https://github.com/mmaps). - Agents/BTW: route fallback side-question streams through the embedded stream resolver so Anthropic-compatible MiniMax requests use the same capped transport as normal chat. ([#​86312](https://github.com/openclaw/openclaw/issues/86312)) Thanks [@​neeravmakwana](https://github.com/neeravmakwana). - Telegram: treat `/command@TargetBot` bot-command entities as explicit mentions for the addressed bot so `requireMention` groups no longer drop targeted commands or captions. Fixes [#​84462](https://github.com/openclaw/openclaw/issues/84462). ([#​86553](https://github.com/openclaw/openclaw/issues/86553)) Thanks [@​luoyanglang](https://github.com/luoyanglang). - CI: bound Docker/Bash E2E tarball npm installs with `OPENCLAW_E2E_NPM_INSTALL_TIMEOUT` so package, onboarding, plugin, and upgrade lanes fail instead of hanging on a stuck npm install. - CI: fail Parallels npm-update smoke jobs after the guest command timeout and cleanup backstop instead of only logging a timeout line. - CI: bound kitchen-sink RPC HTTP probes so stalled gateway readiness or response bodies fail and retry instead of wedging the walker. - CI: bound Telegram user Crabbox proof Bot API calls so stalled Telegram responses fail instead of wedging credential and desktop proof cleanup. - CI: bound MCP channel stdio client initialization so Docker channel proof fails and closes the bridge transport instead of waiting for the outer job timeout. - CI: keep `OPENCLAW_TESTBOX=1 pnpm check:changed` delegating to Blacksmith Testbox through Crabbox without forwarding local Testbox or worker env into the remote command. - CI: send KILL after the TERM grace period for manual checkout fetch timeouts so stuck Testbox and workflow checkout retries cannot hang behind a wedged `git fetch`. - CI: send KILL after the TERM grace period for Bun global install smoke command timeouts so trapped `openclaw` child processes cannot wedge the scheduled install smoke. - iMessage: thread current channel/account inbound attachment roots into the image tool so iMessage-saved attachments under `~/Library/Messages/Attachments` (including the wildcard `/Users/*/Library/Messages/Attachments` root) are read through the existing inbound path policy instead of being rejected as `path-not-allowed`. Literal `localRoots` stays workspace-scoped. Fixes [#​30170](https://github.com/openclaw/openclaw/issues/30170). ([#​86569](https://github.com/openclaw/openclaw/issues/86569)) - QQ Bot: respect `OPENCLAW_HOME` for outbound media path resolution so `<qqmedia>` sends no longer silently fail when `HOME` and `OPENCLAW_HOME` differ (Docker / multi-user hosts). Persisted QQ Bot data (sessions, known users, refs) stays anchored on the OS home for upgrade compatibility. Fixes [#​83562](https://github.com/openclaw/openclaw/issues/83562). Thanks [@​sliverp](https://github.com/sliverp). - Update: report the primary malformed `openclaw.extensions` payload error without adding a duplicate missing-main diagnostic. ([#​86596](https://github.com/openclaw/openclaw/issues/86596)) Thanks [@​ferminquant](https://github.com/ferminquant). - Control UI: keep host-local Markdown file paths inert while preserving app-relative links. ([#​86620](https://github.com/openclaw/openclaw/issues/86620)) Thanks [@​BryanTegomoh](https://github.com/BryanTegomoh). - Gateway: dampen repeated unauthenticated device-required probes per URL while preserving explicit-auth and paired recovery paths. ([#​86575](https://github.com/openclaw/openclaw/issues/86575)) Thanks [@​ferminquant](https://github.com/ferminquant). - IRC: store inbound channel routes with the canonical `channel:#name` target and join transient channel sends before writing. ([#​85906](https://github.com/openclaw/openclaw/issues/85906)) Thanks [@​Kailigithub](https://github.com/Kailigithub). - Usage: surface unknown all-zero model pricing as missing cost entries instead of a confident `$0` total. ([#​85882](https://github.com/openclaw/openclaw/issues/85882)) Thanks [@​MichaelZelbel](https://github.com/MichaelZelbel). - Agents/Codex: honor yolo app-server approval policy only for the full `never` plus `danger-full-access` case. ([#​85909](https://github.com/openclaw/openclaw/issues/85909)) Thanks [@​earlvanze](https://github.com/earlvanze). - Gateway/Gmail: clear Gmail watcher renewal intervals on re-entry so hot reloads do not leak lifecycle timers. ([#​82947](https://github.com/openclaw/openclaw/issues/82947)) Thanks [@​SebTardif](https://github.com/SebTardif). - Logging: exit cleanly on broken stdout/stderr pipes without masking existing failure exit codes. ([#​80059](https://github.com/openclaw/openclaw/issues/80059)) Thanks [@​pavelzak](https://github.com/pavelzak). - Gateway/security: escape transcript metadata field names while extracting oversized session line prefixes. ([#​85934](https://github.com/openclaw/openclaw/issues/85934)) Thanks [@​SebTardif](https://github.com/SebTardif). - Plugins/security: validate manifest model pattern regexes with the safe-regex compiler so unsafe patterns are ignored before matching. ([#​86046](https://github.com/openclaw/openclaw/issues/86046)) Thanks [@​SebTardif](https://github.com/SebTardif). - Discord: route gateway metadata REST lookups through the configured Discord proxy so proxied accounts do not fall back to direct `discord.com` connections before opening the WebSocket. Fixes [#​80227](https://github.com/openclaw/openclaw/issues/80227). Thanks [@​Clivilwalker](https://github.com/Clivilwalker). - Agents/media: hydrate current-turn image attachments from filename-derived MIME types so active vision can see generated or forwarded images whose source omitted an image content type. ([#​84812](https://github.com/openclaw/openclaw/issues/84812)) Thanks [@​marchpure](https://github.com/marchpure). - Agents/fs: point workspace-only scratch-path guidance at in-workspace temp directories while keeping host-root writes rejected by the tool guard. ([#​86501](https://github.com/openclaw/openclaw/issues/86501)) Thanks [@​tianxiaochannel-oss88](https://github.com/tianxiaochannel-oss88). - Agents/media: keep async cron media completions scoped to their run session while preserving direct delivery for stale generated-media success and failure notifications. ([#​86529](https://github.com/openclaw/openclaw/issues/86529)) Thanks [@​ai-hpc](https://github.com/ai-hpc). - Gateway: emit plugin `session_end`/`session_start` hooks when `agent.send` rotates or replaces a session id, keeping hook lifecycle state aligned with `sessions.changed` notifications. Fixes [#​83507](https://github.com/openclaw/openclaw/issues/83507). ([#​85875](https://github.com/openclaw/openclaw/issues/85875)) Thanks [@​brokemac79](https://github.com/brokemac79). - OpenShell/SSH: reject malformed generated exec commands before sandbox/session setup so unresolved workflow placeholders fail fast instead of reaching the remote shell. Fixes [#​72373](https://github.com/openclaw/openclaw/issues/72373). Thanks [@​brokemac79](https://github.com/brokemac79). - Google: stop normalizing `gemini-3.1-flash-lite` to the retired preview endpoint and update Flash Lite alias guidance to the GA model id. Fixes [#​86151](https://github.com/openclaw/openclaw/issues/86151). ([#​86240](https://github.com/openclaw/openclaw/issues/86240)) Thanks [@​SebTardif](https://github.com/SebTardif). - Installer: make Alpine apk installs cover Git, verify the Node runtime floor, try `nodejs-current`, and report Alpine version guidance when repositories only provide older Node packages. - Agents/status: prefer the active Claude CLI OAuth auth label over an unused Anthropic env API-key label for equivalent runtime aliases. Fixes [#​80184](https://github.com/openclaw/openclaw/issues/80184). ([#​86570](https://github.com/openclaw/openclaw/issues/86570)) Thanks [@​brokemac79](https://github.com/brokemac79). - Agents/media: send direct fallback for generated media still missing after an active requester wake fails. ([#​85489](https://github.com/openclaw/openclaw/issues/85489)) Thanks [@​fuller-stack-dev](https://github.com/fuller-stack-dev). - Agents: derive overflow compaction budgets from provider-reported and synthetic over-budget token counts so confirmed context overflows compact before retrying. ([#​70473](https://github.com/openclaw/openclaw/issues/70473)) Thanks [@​fuller-stack-dev](https://github.com/fuller-stack-dev). - Agents/Codex: recover Codex context-window prompt errors through overflow compaction and surface reset guidance when recovery is exhausted. ([#​85542](https://github.com/openclaw/openclaw/issues/85542)) Thanks [@​fuller-stack-dev](https://github.com/fuller-stack-dev). - Agents/Codex: allow Codex app-server runs to bootstrap from `CODEX_API_KEY` or `OPENAI_API_KEY` when no Codex auth profile is configured. - Agents/Codex: keep selected Codex runtime routing on OpenAI-Codex while preserving direct OpenAI API-key compaction fallback. ([#​86408](https://github.com/openclaw/openclaw/issues/86408)) Thanks [@​funmerlin](https://github.com/funmerlin) and [@​VACInc](https://github.com/VACInc). - Agent transcript: include OpenClaw agent session logs when finding local transcript candidates. - Crabbox: bootstrap raw AWS macOS shell commands wrapped in absolute `time` paths so RSS probes can run Node and pnpm on fresh macOS runners. - Crabbox: bootstrap raw AWS macOS shell commands even when setup statements precede Node or pnpm usage. - TUI/local: skip unnecessary secret resolution, gateway model catalog loading, bootstrap, and skill scans in explicit local-model runs so startup reaches the model request faster. - Sessions/doctor: load large session stores without clone amplification during read-only doctor checks and reclaim stale `sessions.json.*.tmp` sidecars. Fixes [#​56827](https://github.com/openclaw/openclaw/issues/56827). Thanks [@​openperf](https://github.com/openperf). - Tests: clean successful plugin gateway gauntlet isolated temp roots while keeping an explicit preservation switch for failed/debug runs. - Plugins/perf: reuse derived plugin metadata snapshots for the lifetime of the process so reply-time skill setup no longer rescans plugin metadata on every turn. - Discord/OpenAI voice: keep wake-name master consults using the current speaker context after ignored ambient transcripts and shorten the default capture silence grace. - Doctor: skip redundant Gateway restart prompts when a recent supervisor restart leaves the Gateway healthy. Fixes [#​86518](https://github.com/openclaw/openclaw/issues/86518). ([#​86533](https://github.com/openclaw/openclaw/issues/86533)) Thanks [@​liaoyl830](https://github.com/liaoyl830). - Cron: restore suspended cron lanes to the configured/default concurrency instead of falling back to one after quota or circuit-breaker auto-resume. - Gateway: keep session-only Control UI tool-start mirrors flowing during diagnostic queue pressure instead of silently dropping non-terminal tool updates. - Agents/memory: return optional not-found context for missing date-only daily memory reads instead of logging benign first-run `ENOENT` failures. Fixes [#​82928](https://github.com/openclaw/openclaw/issues/82928). Thanks [@​galiniliev](https://github.com/galiniliev). - Discord: merge streamed text captions into following media block replies so captions and attachments send as one message. ([#​86487](https://github.com/openclaw/openclaw/issues/86487)) Thanks [@​neeravmakwana](https://github.com/neeravmakwana). - Gateway: avoid sending duplicate tool-event frames to Control UI connections that are subscribed by both run and session. - Discord/OpenAI voice: accept broader edge-position fuzzy wake-name transcripts while keeping ambient speech gated. - Discord/OpenAI voice: accept longer leading wake-name mistranscripts such as "Open Club" for OpenClaw. - Agents/OpenAI-compatible: stop ModelStudio-compatible chat requests before sending system/tool-only payloads that have no usable user or assistant turn. ([#​86177](https://github.com/openclaw/openclaw/issues/86177)) Thanks [@​TurboTheTurtle](https://github.com/TurboTheTurtle). - Gateway/plugins: reuse plugin package realpath checks while building installed plugin indexes so startup avoids repeated filesystem resolution work. - Kilo Gateway: send string `stop` sequences as arrays so Kilo accepts OpenAI-compatible chat completions. ([#​86461](https://github.com/openclaw/openclaw/issues/86461)) Thanks [@​SebTardif](https://github.com/SebTardif). - Discord/OpenAI voice: accept leading fuzzy wake-name transcripts such as "Monty" or "Moti" for a Molty agent while keeping ambient speech gated. - Media understanding: convert HEIC and HEIF images to JPEG before image description providers run so iPhone photos work in direct and configured image-description flows. ([#​86037](https://github.com/openclaw/openclaw/issues/86037)) - Agents: release embedded-attempt session locks from outer teardown so post-prompt exceptions cannot wedge later requests behind `SessionWriteLockTimeoutError`. Fixes [#​86014](https://github.com/openclaw/openclaw/issues/86014). Thanks [@​openperf](https://github.com/openperf). - Discord/OpenAI voice: rotate Realtime sessions at provider max duration without logging the expected session-expiry event as an error. - Sessions: skip metadata-only entries during QMD-slugified session lookup so one incomplete row does not block transcript hit resolution. ([#​86327](https://github.com/openclaw/openclaw/issues/86327)) Thanks [@​abnershang](https://github.com/abnershang). - Agents/media: derive bundled plugin local-media trust from plugin tool metadata instead of importing the full plugin registry on subscription paths. ([#​84409](https://github.com/openclaw/openclaw/issues/84409)) Thanks [@​samzong](https://github.com/samzong). - Image tool: keep config-backed custom-provider API keys usable for auto-discovered vision models, including deferred image-tool execution without env keys or auth profiles. ([#​85733](https://github.com/openclaw/openclaw/issues/85733)) - Memory/local embeddings: run local GGUF embeddings in an isolated worker sidecar and degrade to configured fallback or keyword search on worker failure so native embedding crashes do not take down the Gateway. ([#​85348](https://github.com/openclaw/openclaw/issues/85348)) Thanks [@​osolmaz](https://github.com/osolmaz). - Gateway: clear the runtime config snapshot before `SIGUSR1` in-process restarts so config changes survive the next gateway loop. ([#​86388](https://github.com/openclaw/openclaw/issues/86388)) Thanks [@​XuZehan-iCenter](https://github.com/XuZehan-iCenter). - Models: show OAuth delegation markers as configured `models.json` auth while keeping runtime route usability checks strict. ([#​86378](https://github.com/openclaw/openclaw/issues/86378)) Thanks [@​rohitjavvadi](https://github.com/rohitjavvadi). - Cron: seed active scheduled and manual cron task rows with a progress summary so status surfaces do not look blank while jobs run. ([#​86313](https://github.com/openclaw/openclaw/issues/86313)) Thanks [@​ferminquant](https://github.com/ferminquant). - Cron: preserve unsupported persisted cron payload rows during routine store writes while keeping those rows non-runnable. Fixes [#​84922](https://github.com/openclaw/openclaw/issues/84922). ([#​86415](https://github.com/openclaw/openclaw/issues/86415)) Thanks [@​IWhatsskill](https://github.com/IWhatsskill). - Updater: exclude prerelease git tags from stable channel resolution so source updates do not check out newer alpha/rc/preview/canary tags. ([#​86260](https://github.com/openclaw/openclaw/issues/86260)) Thanks [@​stevenepalmer](https://github.com/stevenepalmer). - Security/Audit: flag webhook `hooks.token` reuse of active Gateway password auth in `openclaw security audit` while keeping password-mode startup compatibility. ([#​84338](https://github.com/openclaw/openclaw/issues/84338)) Thanks [@​coygeek](https://github.com/coygeek). - QQBot: derive the outbound reply watchdog from configured agent and provider timeouts so slow local model replies are not cut off at five minutes. Fixes [#​85267](https://github.com/openclaw/openclaw/issues/85267). ([#​85271](https://github.com/openclaw/openclaw/issues/85271)) Thanks [@​SymbolStar](https://github.com/SymbolStar). - Agents/heartbeat: stop heartbeat turns after the first valid `heartbeat_respond` so repeated response loops do not burn tokens. ([#​86357](https://github.com/openclaw/openclaw/issues/86357)) Thanks [@​udaymanish6](https://github.com/udaymanish6). - Tasks: keep retained lost tasks out of default status health counts, explain their cleanup window during maintenance, and prune lost task records after 24 hours instead of the general 7-day terminal retention. - Memory-core: keep REM dreaming focused on live light-staged memories and mark staged entries as considered so old recall history no longer dominates fresh candidates. ([#​86302](https://github.com/openclaw/openclaw/issues/86302)) Thanks [@​SebTardif](https://github.com/SebTardif). - Memory: abort sync instead of downgrading an existing semantic vector index to FTS-only when the configured embedding provider is temporarily unavailable. ([#​85704](https://github.com/openclaw/openclaw/issues/85704)) Thanks [@​yaaboo-gif](https://github.com/yaaboo-gif). - Telegram: propagate forum topic names through the account-scoped topic cache for native command context and topic create/edit actions. ([#​86299](https://github.com/openclaw/openclaw/issues/86299)) Thanks [@​SebTardif](https://github.com/SebTardif). - Slack: keep downloaded read-only files out of reply media so Slack file reads do not echo files back to the conversation. ([#​86318](https://github.com/openclaw/openclaw/issues/86318)) Thanks [@​neeravmakwana](https://github.com/neeravmakwana). - Cron: accept leading-plus relative durations such as `+5m` for one-shot `--at` schedules. ([#​86341](https://github.com/openclaw/openclaw/issues/86341)) Thanks [@​mushuiyu886](https://github.com/mushuiyu886). - Agents/media: preserve async-started media tool metadata so background generation starts no longer surface generic incomplete-turn warnings while replay stays unsafe. ([#​85933](https://github.com/openclaw/openclaw/issues/85933)) Thanks [@​fuller-stack-dev](https://github.com/fuller-stack-dev). - Docker E2E: dedupe scheduler lane resources so npm/service package lanes are not over-counted and serialized unnecessarily. - QA/diagnostics: add a collector-backed OpenTelemetry smoke lane, make the OTLP payload leak check scenario-aware, and keep source QA builds from failing on optional dependency imports resolved through pnpm's temp module path. - Crabbox: bootstrap Git metadata for sparse remote changed gates so raw synced workspaces can run `pnpm check:changed` from the intended diff. - xAI/LM Studio: avoid buffering ordinary bracketed or `final` prose until stream completion while watching for plain-text tool-call fallbacks. - Doctor: warn and continue when the cron job store exists but cannot be read so later health checks still run. Fixes [#​86102](https://github.com/openclaw/openclaw/issues/86102). ([#​86384](https://github.com/openclaw/openclaw/issues/86384)) Thanks [@​1052326311](https://github.com/1052326311). - Discord: suppress a bot's previous reply body and referenced media from prompt context when a user replies to that bot message, while keeping reply metadata for routing. ([#​86238](https://github.com/openclaw/openclaw/issues/86238)) Thanks [@​fuller-stack-dev](https://github.com/fuller-stack-dev). - Discord: restore bare numeric channel IDs for outbound message-tool sends while keeping explicit DM targets unambiguous. ([#​86571](https://github.com/openclaw/openclaw/issues/86571)) Thanks [@​joshavant](https://github.com/joshavant). - Docker E2E: avoid rebuilding the Control UI twice while preparing the shared OpenClaw package tarball for package-backed scenario runs. - Tests: avoid rebuilding the Control UI twice during the installer Docker smoke now that `pnpm build` includes `ui:build`. - Tests: give QA config mutation RPCs enough native Windows budget to finish gateway config writes and restart settle after hot scenario runs. - Tests: keep the gateway restart-inflight QA scenario focused on restart recovery on native Windows by allowing expected embedded prompt handoff errors and using the Windows-safe timeout budget. - QA-Lab: make the synthetic OpenAI provider honor generic `reply exactly:` directives after required kickoff reads so restart-recovery scenarios do not fall through to generic repo-summary prose. - Gateway: abort active `agent` RPC runs during forced restart shutdown so stale in-process turns cannot keep writing a session after the Gateway lifecycle restarts. - Crabbox: sync clean sparse worktrees through a temporary full checkout even when reusing an existing lease so tracked build-time files are not omitted. - Build: route `scripts/ui.js` through the shared pnpm runner and keep Control UI chunking helpers in sparse-included source so native Windows Corepack builds can produce `dist/control-ui`. - Tests: give the memory fallback QA scenario enough turn budget to exercise native Windows gateway runs instead of failing on the client timeout while the mock agent is still dispatching. - Tests: collect QA gateway CPU/RSS metrics on native Windows and give the channel baseline enough turn budget to report slow gateway runs instead of timing out before proof. - Install/update: bypass npm `min-release-age` policies with `--min-release-age=0` instead of `--before` so hosted installers keep working on npm versions that reject the combined config. ([#​84749](https://github.com/openclaw/openclaw/issues/84749)) Thanks [@​TeodoroRodrigo](https://github.com/TeodoroRodrigo). - Diagnostics: reclaim wedged session lanes when stale active-run bookkeeping blocks queued work despite no forward progress. Fixes [#​85639](https://github.com/openclaw/openclaw/issues/85639). Thanks [@​openperf](https://github.com/openperf). - WebChat: keep message-tool replies visible in the chat while still summarizing internal tool results for the model. Fixes [#​86347](https://github.com/openclaw/openclaw/issues/86347). Thanks [@​shakkernerd](https://github.com/shakkernerd). - Gateway/perf: fail startup benchmark samples when the Gateway process exits before benchmark teardown, including signal deaths after readiness probes. - Gateway/perf: fail restart benchmark samples when the Gateway exits before benchmark teardown, including clean exits and signal deaths after successful restart probes. - Agents/tests: keep model catalog visibility on static selection helpers so catalog visibility checks avoid the broad model-selection barrel import. - Agents/commitments: serialize commitment store load-modify-save writes so concurrent heartbeat and CLI updates no longer lose dismissal, sent, or attempt state. ([#​81153](https://github.com/openclaw/openclaw/issues/81153)) Thanks [@​ai-hpc](https://github.com/ai-hpc). - xAI/LM Studio: promote plain-text tool-call fallbacks into structured tool calls and strip leaked internal tool syntax before user-facing delivery. ([#​86222](https://github.com/openclaw/openclaw/issues/86222)) Thanks [@​fuller-stack-dev](https://github.com/fuller-stack-dev). - CLI: suppress benign self-update version-skew warnings during package post-update finalization. - Gateway/perf: tighten restart and startup benchmark failure handling so long profiling runs, failed probes, and fresh Linux runners no longer produce false passing or `n/a` results. - Checks: keep intentional Knip unused-file findings optional so full CI and sparse proof workspaces stay aligned. - Docker: restore writable `~/.config` in runtime images. Fixes [#​85968](https://github.com/openclaw/openclaw/issues/85968). Thanks [@​hkoessler](https://github.com/hkoessler) and [@​Bartok9](https://github.com/Bartok9). - Plugin SDK: keep legacy root diagnostic subscriptions connected when built plugin SDK aliases resolve diagnostic helpers through a separate module graph. - Diagnostics: export alertable OTel and Prometheus signals for blocked tools, model failover, stale sessions, liveness warnings, oversized payloads, and webhook ingress while fixing shared OTLP endpoints with query strings. - Tests: normalize macOS canonical temp paths in exec allowlists, fs-safe trash assertions, installed plugin matching, Telegram topic-name stores, and built ACPX MCP server expectations so native macOS proof runners cover the intended behavior. - Codex/app-server: preserve message-tool-only source reply delivery mode on active runs so sub-agent completion wakeups can steer the active Codex turn instead of being rejected. ([#​86287](https://github.com/openclaw/openclaw/issues/86287)) Thanks [@​ferminquant](https://github.com/ferminquant). - Tests: sample the Windows kitchen-sink RPC gateway directly and serialize RSS probes so native runs keep the memory guard active. - Tests: normalize bundled plugin lifecycle probe paths and state-root lookup so native Windows release sweeps accept valid packaged plugin installs. - Agents/Claude CLI: route live native Bash permission requests through OpenClaw exec policy so Claude turns no longer stall on `control_request`, and document that OpenClaw exec policy is authoritative. Fixes [#​80819](https://github.com/openclaw/openclaw/issues/80819). ([#​86330](https://github.com/openclaw/openclaw/issues/86330), from [#​81971](https://github.com/openclaw/openclaw/issues/81971)) Thanks [@​guthirry](https://github.com/guthirry) and [@​sallyom](https://github.com/sallyom). - Security audit: warn when YOLO OpenClaw exec policy overrides a restrictive raw Claude `--permission-mode` for managed live sessions. ([#​86557](https://github.com/openclaw/openclaw/issues/86557)) Thanks [@​sallyom](https://github.com/sallyom). - Config: keep benign legacy metadata write anomalies out of default doctor and config command output while preserving explicit anomaly logging for diagnostics. - Codex: log when implicit app-server `never` approvals are promoted for OpenClaw tool policy, including whether the trigger was a `before_tool_call` hook or trusted tool policy. - Codex harness: make subscription usage-limit errors without reset times explain that OpenClaw cannot determine the reset and point users to wait until Codex is available, use another Codex account, or switch to another configured model/provider. Thanks [@​amknight](https://github.com/amknight). - Google Vertex: support production ADC modes such as Workload Identity Federation, service-account credentials, and metadata-server ADC for the native Vertex transport. ([#​83971](https://github.com/openclaw/openclaw/issues/83971)) Thanks [@​damianFelixPago](https://github.com/damianFelixPago). - Telegram: route normal `[telegram][diag]` polling diagnostics through `runtime.log` while keeping non-diag warnings and persistence failures on `runtime.error`, so healthy polling startup no longer looks like an error. Fixes [#​82957](https://github.com/openclaw/openclaw/issues/82957). ([#​82958](https://github.com/openclaw/openclaw/issues/82958)) Thanks [@​galiniliev](https://github.com/galiniliev). - Providers/Ollama: strip inline Kimi cloud reasoning prefixes from streamed and final visible replies while keeping ordinary Kimi answers append-only. ([#​86286](https://github.com/openclaw/openclaw/issues/86286)) Thanks [@​jason-allen-oneal](https://github.com/jason-allen-oneal). - Gateway: require Talk secret authority before setup-code handoff can include Talk secrets. ([#​85690](https://github.com/openclaw/openclaw/issues/85690)) Thanks [@​ngutman](https://github.com/ngutman). - Agents: keep fallback error reporting scoped to the active model candidate so stale prior-provider quota/auth text is not reported for later fallback attempts. ([#​86134](https://github.com/openclaw/openclaw/issues/86134)) Thanks [@​zhangguiping-xydt](https://github.com/zhangguiping-xydt). - iMessage: dedupe watcher startup when `channels.imessage.accounts` lists both `default` and a named account that point at the same local Messages source, so the gateway no longer spawns two `imsg rpc` processes or doubles inbound replies; the dedupe is scoped to watcher startup, leaving duplicate accounts addressable for outbound sends, status, and capability listings, and `openclaw doctor` flags the redundant account with a rebinding hint. Fixes [#​65141](https://github.com/openclaw/openclaw/issues/65141). ([#​86705](https://github.com/openclaw/openclaw/issues/86705)) Thanks [@​swang430](https://github.com/swang430). </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about these updates again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDEuMSIsInVwZGF0ZWRJblZlciI6IjQzLjEwMS4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9jb250YWluZXIiLCJ0eXBlL3BhdGNoIl19--> Reviewed-on: https://git.erwanleboucher.dev/eleboucher/homelab/pulls/682
* fix(agents): skip fallback for session coordination errors Preserve provider fallback metadata when session coordination errors are nested under provider failures. Co-authored-by: luyao618 <364939526@qq.com> (cherry picked from commit 6a5a135) * fix(agents): tolerate in-process session writes during prompt release (openclaw#84250) Merged via squash. Prepared head SHA: 33f88fe Co-authored-by: tianxiaochannel-oss88 <272340815+tianxiaochannel-oss88@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman (cherry picked from commit 1b77145) * fix(agents): bound embedded compaction write locks Fixes the embedded attempt session write-lock watchdog so the fallback max hold time follows the resolved compaction timeout plus the existing lock grace window, instead of inheriting the full run timeout. Adds regression coverage for the helper and settled-compaction lock lifecycle, plus a changelog entry thanking @luoyanglang. Verification: - `pnpm test src/agents/session-write-lock.test.ts src/agents/pi-embedded-runner/run/attempt.test.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts` - `pnpm check:changed` via Blacksmith Testbox `tbx_01ks8b6vn8se5cg1dfn3te3g47` / https://github.com/openclaw/openclaw/actions/runs/26301988670 - Autoreview clean: `/Users/steipete/Projects/agent-scripts/skills/autoreview/scripts/autoreview --mode branch --base origin/main` - PR CI green on `79e8c5f1a637981d263c0268bf5666967ff4e778`: https://github.com/openclaw/openclaw/actions/runs/26302152844 and https://github.com/openclaw/openclaw/actions/runs/26302152798 Co-authored-by: luoyanglang <hanwanlonga@gmail.com> (cherry picked from commit 46de078) * fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition (openclaw#85764) * fix(session-lock): enforce maxHoldMs in shouldReclaim during lock acquisition - Adds optional maxHoldMs parameter to inspectLockPayload - Inspect now marks locks as stale when held longer than maxHoldMs - Passes maxHoldMs through inspectLockPayloadForSession - acquireSessionWriteLock's shouldReclaim callback now passes maxHoldMs This ensures that when a live process holds a lock for longer than maxHoldMs (default 5min), other processes can reclaim it during acquisition — matching the watchdog's existing enforcement. Previously shouldReclaim only used staleMs (30min default), meaning a lock held for 10+ minutes by a live PID would never be reclaimable, causing 60s timeout failures and gateway freezes. Closes openclaw#85762 * fix(session-lock): add dead-PID fast-path before retry loop Adds a fast-path check at the top of acquireSessionWriteLock: if the lock file's owner PID is dead, remove it immediately before entering the retry loop. This saves up to timeoutMs (60s) of futile waiting when the previous lock holder has died. The shouldReclaim callback already handles this case, but only iteratively through the retry loop. The fast-path eliminates that unnecessary delay. * fix(session-lock): enforce max hold during acquisition * fix(session-lock): revalidate max hold safely * fix(session-lock): honor holder max-hold policy * fix(session-lock): keep cleanup from reclaiming live holders * fix(session-lock): remove stale locks only when unchanged * fix(session-lock): skip self-held max-hold reclaim * fix(ci): refresh gateway protocol checks --------- Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com> Co-authored-by: Peter Steinberger <steipete@gmail.com> (cherry picked from commit a1eb765) * fix(embedded-runner): preserve provider errors on cleanup takeover (openclaw#84321) Summary: - The PR preserves provider-facing embedded-runner prompt errors when cleanup detects session takeover, keeps the takeover signal fatal for fallback, and adds focused regressions. - PR surface: Source +52, Tests +92. Total +144 across 5 files. - Reproducibility: yes. Source inspection shows current main can let cleanup takeover replace a prior prompt/p ... rror and can normalize a provider-looking takeover wrapper before fallback sees it as coordination failure. Automerge notes: - PR branch already contained follow-up commit before automerge: fix(embedded-runner): preserve takeover during fallback - PR branch already contained follow-up commit before automerge: fix(clawsweeper): address review for automerge-openclaw-openclaw-8405… Validation: - ClawSweeper review passed for head 050c779. - Required merge gates passed before the squash merge. Prepared head SHA: 050c779 Review: openclaw#84321 (comment) Co-authored-by: abnershang <abner.shang@gmail.com> Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Approved-by: takhoffman Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> (cherry picked from commit 7fbca96) * fix(agents): release embedded-attempt session lock on every exit path (openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com> (cherry picked from commit 32ddfc2) * fix(agents): fence yield abort lock release (cherry picked from commit 0fe7479) * fix(agents): memoize session lock owner args Memoize owner process argv lookups per PID during `cleanStaleLockFiles`, and yield between lock entries so startup cleanup does not monopolize the event loop while inspecting many session locks. This keeps lock classification semantics unchanged while avoiding repeated synchronous process-args reads for lock clusters owned by the same PID, especially the Windows PowerShell path. Fixes openclaw#86509. Verification: - `git diff --check origin/main...HEAD` - focused TSX harness against the current-main merge result: `session-lock memo regression harness passed` Thanks @openperf. Co-authored-by: openperf <16864032@qq.com> (cherry picked from commit c430fcd) * fix(diagnostics): recover orphaned session activity Recover idle queued sessions whose diagnostic activity retained stale ownerless model or tool calls by classifying them as recoverable session.stuck after the usual recovery gates. Yield the event loop before stale session-lock process inspection so sync process lookup cannot monopolize lock contention paths. Docs now describe the widened session.stuck telemetry contract for recoverable stale bookkeeping, including ownerless activity. Thanks @samuelsoaress. Refs openclaw#84903. Co-authored-by: samuelsoaress <samuelsoares177778@gmail.com> (cherry picked from commit 286964c) * [FORK][openclaw#86584] gate owned-write publish on pre-append fingerprint (fixes openclaw#86572) Carries unmerged upstream PR openclaw#86584 (HEAD d79a3b4) onto the boon 5.18 base as the same-lane EmbeddedAttemptSessionTakeoverError fence fix for long cron turns. Fails closed: an external mutation before pi's append fails the trust gate and still trips the fence (verified by the PR's 303-line test suite incl. the mixed-interleave negative test). Backfills base symbols openclaw#86584 assumes (introduced upstream between 5.18 and the PR base, not carried by the 9 merged race-fix picks): - session-lock.ts: MAX_BENIGN_SESSION_FENCE_{ADVANCE,REWRITE,REWRITE_RESULT}_BYTES, MAX_SAFE_FILE_OFFSET, TRANSCRIPT_ONLY_OPENCLAW_ASSISTANT_MODELS, SessionFileFenceSnapshot type, fenceSnapshot state var, ActiveWriteLockState type + activeWriteLock store fix (reuse nested writes via {active:true}), node:util + string-normalization imports. - transcript-append.ts: wrap appendSessionTranscriptMessage in runWithOwnedSessionTranscriptWriteLock so low-level appends acquire the owned-context lock. - test import fixes (appendSessionTranscriptMessage, withOwned/bindOwned, __testing). Drop when upstream merges openclaw#86584. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [FORK][openclaw#86584] wire owned-transcript-write context + typecheck cleanup CRITICAL: wrap promptActiveSession in withOwnedSessionTranscriptWrites and bind onBlockReply/onBlockReplyFlush to the owned context in attempt.ts. Without this, pi's own transcript appends during a prompt are NOT recorded as owned, so the fence trips on them (the exact takeover the backport is meant to prevent). This wiring is an intermediate-base feature (between 5.18 and openclaw#84250's base) the merged picks didn't carry. Tests passed before only because they set the context manually. Also: add releaseHeldLockForAbort to the controller type; drop incidental non-fence suppressAssistantErrorPersistence passes; remove dead async benign-rewrite cluster (sessionFence{Advance,Rewrite}IsBenign + readAppendedSessionFileText + lineMatchesLinearTranscriptMigration + helpers) — our openclaw#84250-based assertSessionFileFence uses the sync owned-write path, so the async benign-detection variants are unreachable. tsgo core: 0 errors. 384 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [FORK][openclaw#86584] address codex review: prefix-validate benign advance + preserve provider error Finding 2 (masking gap, P2): sessionFenceAdvanceIsBenignSync only validated the APPENDED bytes, so a writer that rewrote the existing prefix AND appended a benign delivery-mirror/gateway-injected line could be laundered as an owned advance — masking a genuine external takeover (silent message loss). Now fail closed unless the current prefix is byte-identical to the trusted readSessionFileFenceSnapshot text (readSessionFilePrefixSync); absent snapshot text => not benign. Finding 1 (provider-error masking, P2): wrappedStreamFn's finally let a reacquireAfterPrompt() takeover error mask the original provider error when the stream itself threw. Now only surface the reacquire error when the stream succeeded; otherwise preserve the original failure. tsgo core: 0 errors. 384 tests pass (benign-advance acceptance + external-mutation rejection both green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(release): 2026.5.18-boon.1 — session-takeover hardening (boon fleet build) Version bump + CHANGELOG for the fork build. Also fixes a backport test-import gap: attempt.test.ts referenced `attemptTesting` (the __testing export) without importing it. Full project typecheck (tsgo -b tsconfig.projects.json): 0 errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(ci): no-unsafe-finally in wrappedStreamFn + drop collateral protocol/test churn - wrappedStreamFn: restructure provider-error-preservation without a throw inside finally (oxlint no-unsafe-finally). Same semantics: always reacquire; prefer the original stream error over a reacquire takeover error; surface reacquire error only when the stream succeeded. - Revert src/gateway/server-methods/agent.test.ts + GatewayModels.swift to the 5.18 baseline: the openclaw#85764 cherry-pick conflict-resolution had pulled in openclaw#85256-era internal-session-effect tests + protocol fields whose implementation isn't in this backport, breaking checks-node-agentic-gateway-methods + checks-fast-bundled-protocol. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: remove vestigial onAssistantErrorMessagePersisted option decls Address cubic P2 review (PR #2): the option was declared on the guard and guard-wrapper option types but never forwarded or invoked, so any provided callback was silently ignored. The companion error-suppression feature (suppressAssistantErrorPersistence + the agent-runner/followup caller chain) is deliberately scoped OUT of this 5.18 backport, so the decls were dead plumbing left over from a cherry-pick. Remove them to keep the option surface honest; the load-bearing beforeMessagePersist fence checkpoint (openclaw#86572) is retained. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Yao <364939526@qq.com> Co-authored-by: xiaotian <tianxiaochannel@gmail.com> Co-authored-by: 狼哥 <hanwanlonga@gmail.com> Co-authored-by: njuboy <njuboy11@gmail.com> Co-authored-by: njuboy11 <njuboy11@users.noreply.github.com> Co-authored-by: Peter Steinberger <steipete@gmail.com> Co-authored-by: clawsweeper[bot] <274271284+clawsweeper[bot]@users.noreply.github.com> Co-authored-by: abnershang <abner.shang@gmail.com> Co-authored-by: takhoffman <781889+takhoffman@users.noreply.github.com> Co-authored-by: Chunyue Wang <80630709+openperf@users.noreply.github.com> Co-authored-by: openperf <16864032@qq.com> Co-authored-by: Samuel Soares da Silva <samuelsoares177778@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
…openclaw#86427) * fix(agents): release embedded-attempt session lock on every exit path The embedded run controller acquires its session write lock eagerly at creation and released it only inside the post-run cleanup block. An exception thrown in post-prompt processing skipped that block, so the lock leaked to the live gateway process until the watchdog reclaimed it and later requests to the session failed with SessionWriteLockTimeoutError. Add an idempotent dispose() to the lock controller and call it from the run's outer finally so the eagerly-held lock is released on every exit path. Normal/aborted/timed-out runs still hand the lock to acquireForCleanup first, so dispose() is a no-op then (no double release). Fixes openclaw#86014 * fix: keep session lock teardown comment lean * docs(changelog): note embedded session lock fix --------- Co-authored-by: Peter Steinberger <steipete@gmail.com>
Summary
.jsonlfile. If the run times out or a tool errors, the lock can be left held by the live gateway process — subsequent requests to that session block for the acquire timeout and then fail withSessionWriteLockTimeoutError(OPENCLAW_SESSION_WRITE_LOCK_TIMEOUT), and the retained session graph leaks memory. (SessionWriteLockTimeoutError: gateway never releases session file lock after embedded run timeout #86014)createEmbeddedAttemptSessionLockControlleracquires its coarse session lock (heldLock) eagerly at creation and the run releases it only inside the post-run cleanup block (acquireForCleanup→cleanupEmbeddedAttemptResources, which itself releases in afinally). That cleanup block is ordinary try-body code. The run's outerfinallydid not release the lock — it only emitted diagnostics and restored skill env. Normal timeout/abort/error paths funnel intoaborted/timedOut/promptErrorflags and do reach the cleanup block, but an exception thrown in the ~post-prompt processing between the prompt teardown and the cleanup block (e.g. a tool/result step or the trajectory flush erroring) escapes straight to the outerfinally, skipping the release. The lock then stays held by the live process (its PID is alive, so it is not stale) until the long compaction-scaled watchdog window elapses or a retry re-acquires it — effectively wedging the session. (The reporter's "maxHoldMsgrows unboundedly" is a fixed value: the embedded-attempt lock'smaxHoldMsis the compaction window + grace =1020000ms, matching the reported lock file; the observedcreatedAtrefresh is the lock being re-acquired by retries.)8a060b2904d4), which replaced the single per-runsessionLockwith this eagerly-acquired controller (release-before-I/O + reacquire) and added the post-prompt processing that can throw; the controller releases the lock only through the post-run cleanup hand-off, so an exception on the post-prompt path leaves it held.dispose()that releasesheldLockif it is still retained, and call it from the run's outerfinallyso the lock is released on every exit path. On normal/aborted/timed-out runs the cleanup block still hands the lock toacquireForCleanupfirst, sodispose()is a no-op then (no double release); only the exception-skips-cleanup path now actually releases viadispose().src/agents/pi-embedded-runner/run/attempt.session-lock.ts: adddispose(): Promise<void>toEmbeddedAttemptSessionLockController(releases the retainedheldLock, idempotent).src/agents/pi-embedded-runner/run/attempt.ts: capture areleaseRetainedSessionLockclosure once the controller exists and invoke it in the run's outerfinally(guarded so a release error is logged rather than masking the run's original error).src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts: regression coverage.acquireForCleanup→cleanupEmbeddedAttemptResources) and its flush behavior are unchanged;dispose()is a safety net.maxHoldMsderivation are unchanged. The separate "an operation insidewithSessionWriteLocknever returns" variant (its ownfinallycannot run) remains covered only by the watchdog and is out of scope.Reproduction
acquireForCleanup.finally, which does not release the lock; the lock file persists held by the live process and the next request to that session fails withSessionWriteLockTimeoutError.finallycalls the controller'sdispose(), releasing the retained lock; the next request succeeds.Real behavior proof
SessionWriteLockTimeoutError.acquireSessionWriteLockand the realcreateEmbeddedAttemptSessionLockControlleragainst a temp session file, with the embedded-attemptmaxHoldMs(1020000) the gateway uses. It runs an attempt that acquires the lock then throws during post-prompt processing, then issues a secondacquireSessionWriteLockfor the same session from the same live process — once without the fix (nodispose()in the finally) and once with it.tsx /tmp/qmd86014/repro.mts.SessionWriteLockTimeoutError— the reporter's error (session file locked (timeout 60000ms): pid=7 …/<session>.jsonl.lock) is reproduced verbatim (same shape, differing only in timeout/pid) and removed by the fix.pnpm test src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts(68 passed) including two new cases —dispose()releases the eagerly-held lock when cleanup is skipped (and is idempotent), anddispose()does not double-release a lock already handed toacquireForCleanup.agentTurnretries over time was not executed; the harness exercises the real lock primitive + real controller end to end and the unit tests cover thedispose()contract. The "stalled operation insidewithSessionWriteLocknever returns" variant is not addressed here (covered by the watchdog).Risk / Mitigation
finallycould double-release a lock the cleanup block already released, or a release error could mask the run's original error.dispose()is idempotent — the happy/aborted/timed-out paths transfer the lock out viaacquireForCleanup(so the controller no longer retains it anddispose()is a no-op), proven by a no-double-release test. The outer-finallycall is wrapped so a release failure is logged rather than thrown, mirroring the existing teardown logging. No behavior change on paths that already released the lock.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Fixes #86014