fix: preserve subagent delivery after lock stalls by galiniliev · Pull Request #86540 · openclaw/openclaw

galiniliev · 2026-05-25T15:17:38Z

Summary

Preserves suspended subagent final-delivery text into durable completion fallback state before expiring the outbox payload, so expiry is not a hard loss of the child result.
Distinguishes expiry after durable fallback preservation from true payload discard.
Adds session write-lock timeout owner diagnostics including liveness, starttime, current starttime, age, and stale reasons.
Out of scope: redesigning subagent scheduling or session JSONL lock ownership.

Linked context

Which issue does this close?

Closes #86537
Closes #86538

Which issues, PRs, or discussions are related?

Related #86537
Related #86538

Was this requested by a maintainer or owner?

Local bug evidence under /mnt/c/OpenClaw/bugs/BUG-035-subagent-terminal-reconciliation-times-out-then-discards-delivery and /mnt/c/OpenClaw/bugs/BUG-036-session-write-lock-timeouts-block-main-and-subagent-lanes.

Real behavior proof (required for external PRs)

Behavior addressed: Expired suspended subagent final delivery now keeps final text durably instead of dropping the only payload, and write-lock timeouts expose actionable owner diagnostics.
Real environment tested: Linux WSL2 dev checkout plus Azure Crabbox remote validation. Before proof used Azure Crabbox silver-prawn / cbx_e06ef86ae274; after proof used Azure Crabbox violet-lobster / cbx_3eda7a42d715.
Exact steps or command run after this patch: node scripts/run-vitest.mjs src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts; git diff --check; .agents/skills/autoreview/scripts/autoreview --mode local; Azure Crabbox before/after focused regression proof.
Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):

RUN  v4.1.7 /home/galini/GitHub/worktrees/bug-035-036-subagent-delivery-locks

Test Files  4 passed (4)
     Tests  144 passed (144)
  Start at  15:01:23
  Duration  16.11s

autoreview clean: no accepted/actionable findings reported
overall: patch is correct (0.84)

Azure Crabbox after-fix evidence log:

Crabbox OpenClaw evidence
timestamp=20260525T214730Z
name=pr-86540-after-fix
crabbox_id=violet-lobster
exit_status=0

tests=
build=0
types=0
check=none
proof_command_count=1

CRABBOX_PHASE:context
CRABBOX_PHASE:bootstrap
CRABBOX_PHASE:install
CRABBOX_PHASE:evidence_1
AFTER_FIX_HEAD_SHA=f9c6fc7de291905233e7d5ca65201f6d299e00ab
AFTER_FIX_TEST_EXIT=0
AFTER_FIX_REGRESSION_PASSED=1
CRABBOX_PHASE:diff
CRABBOX_PHASE:done

run summary sync=30.982s command=1m2.528s total=1m33.578s sync_skipped=false exit=0 sync_steps=ssh:611ms,manifest:635ms,preflight:0s,fingerprint:2ms,fingerprint_remote:70ms,git_seed:19.231s,manifest_write:223ms,prune:67ms,rsync:8.012s,finalize:2.039s command_phases=user-command:2.86s,context:32ms,bootstrap:16.583s,install:10.482s,evidence_1:32.558s,diff:6ms,done:2ms

Observed result after fix: The subagent registry regression expires a suspended delivery and observes discardReason: "expired-after-durable-fallback" plus completion.fallbackResultText; the session lock regression observes timeout owner text with pid=, alive=true, starttime=, and currentStarttime=. Azure Crabbox after-fix proof passed the focused regression set at head f9c6fc7de291905233e7d5ca65201f6d299e00ab.
What was not tested: Live gateway retry with the original private sessions was not rerun.
Proof limitations or environment constraints: Azure Crabbox and local proof cover the durable state and timeout diagnostic seams; the original session keys and lock paths are private and redacted.
Before evidence (optional but encouraged):

Redacted original symptom evidence:

subagent wait timed out; deferring terminal state until session reconciliation
subagent suspended delivery discarded {"reason":"expired","runId":"[redacted run id]","childSessionKey":"[redacted child session key]","requesterSessionKey":"[redacted requester session key]"}
SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=[redacted pid] [redacted session lock path]

Azure Crabbox before-fix evidence log:

Crabbox OpenClaw evidence
timestamp=20260525T214615Z
name=pr-86540-before-regression
crabbox_id=silver-prawn
exit_status=0

tests=
build=0
types=0
check=none
proof_command_count=1

CRABBOX_PHASE:context
CRABBOX_PHASE:bootstrap
CRABBOX_PHASE:install
CRABBOX_PHASE:evidence_1
BEFORE_FIX_BASE_SHA=c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54
BEFORE_FIX_TEST_EXIT=1
BEFORE_FIX_EXPECTED_REGRESSION=1
CRABBOX_PHASE:diff
CRABBOX_PHASE:done

run summary sync=1.005s command=16.214s total=18.025s sync_skipped=true exit=0 sync_steps=ssh:238ms,manifest:698ms,preflight:0s,fingerprint:2ms,fingerprint_remote:68ms command_phases=user-command:2.843s,context:41ms,bootstrap:646ms,install:390ms,evidence_1:12.284s,diff:4ms,done:3ms

Tests and validation

Which commands did you run?

node scripts/run-vitest.mjs src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts
git diff --check
.agents/skills/autoreview/scripts/autoreview --mode local
Azure Crabbox before proof: silver-prawn / cbx_e06ef86ae274, base c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54, observed BEFORE_FIX_TEST_EXIT=1 and BEFORE_FIX_EXPECTED_REGRESSION=1.
Azure Crabbox after proof: violet-lobster / cbx_3eda7a42d715, head f9c6fc7de291905233e7d5ca65201f6d299e00ab, observed AFTER_FIX_TEST_EXIT=0 and AFTER_FIX_REGRESSION_PASSED=1.

What regression coverage was added or updated?

src/agents/subagent-registry.test.ts now verifies durable fallback preservation on suspended final-delivery expiry.
src/agents/session-write-lock.test.ts now verifies timeout owner diagnostics include liveness/starttime details.

What failed before this fix, if known?

Redacted logs showed 25 suspended final deliveries discarded after repeated terminal reconciliation timeouts and 161 session write-lock timeout lines with only pid/path owner context.
Azure Crabbox before proof applied the focused regression tests to base c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54 and observed the expected regression failure.

If no test was added, why not?

N/A

Risk checklist

Did user-visible behavior change? (Yes/No)

Yes

Did config, environment, or migration behavior change? (Yes/No)

No

Did security, auth, secrets, network, or tool execution behavior change? (Yes/No)

No

What is the highest-risk area?

Subagent final-delivery cleanup state.

How is that risk mitigated?

The delivery payload is still compacted on expiry, but final text is copied into the existing completion fallback field before the payload is removed; pressure-pruned deliveries remain unchanged.

Current review state

What is the next action?

Maintainer review and CI.

What is still waiting on author, maintainer, CI, or external proof?

Live replay of the original private sessions was not run. Required CI is currently blocked by an unrelated type error in extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts:872.

Which bot or reviewer comments were addressed?

Autoreview reported no accepted/actionable findings. Azure Crabbox before/after focused evidence was added to the PR description.

clawsweeper · 2026-05-25T15:25:09Z

Codex review: needs real behavior proof before merge. Reviewed May 29, 2026, 1:20 AM ET / 05:20 UTC.

Summary
The PR preserves suspended subagent final-delivery text in completion fallback state before expiring the delivery payload, adds an expired-after-durable-fallback discard reason, expands session write-lock timeout owner diagnostics, and updates focused agent tests.

PR surface: Source +46, Tests +15. Total +61 across 5 files.

Reproducibility: yes. from source and supplied before-proof: current main still clears completion fallback on suspended delivery discard and reports only pid-level lock timeout ownership. I did not run the repro locally in this read-only review.

Review metrics: 1 noteworthy metric.

Persisted delivery state: 1 discard reason added; completion fallback retained on expiry. This changes durable subagent run state that recovery and read paths may inspect, so maintainers should notice the session-state surface before merge.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

[P1] Add redacted after-fix live gateway/session replay or runtime log for the contended delivery and lock diagnostic path.
[P1] Coordinate or rebase against the overlapping session-lock PRs before merge.
[P1] Get required CI green or record a maintainer waiver for unrelated failures.

Proof guidance:

[P1] Needs real behavior proof before merge: The PR body has WSL2/Azure Crabbox focused regression-test output and redacted before logs, but no after-fix live gateway/session replay or redacted runtime log; add proof with private details redacted and update the PR body to trigger re-review, or ask a maintainer to comment @clawsweeper re-review if it does not.

Risk before merge

[P1] After-fix proof exercises focused local/Azure Crabbox regression tests, but it does not include a redacted live gateway/session replay or after-fix runtime log from a contended session path.
[P2] The PR changes durable subagent completion and delivery state; if the fallback preservation is wrong, requester sessions could still miss or stale-read subagent completion text after expiry.
[P1] The same acquireSessionWriteLock surface overlaps open PRs fix(gateway): bound startup sidecar fanout #85399 and fix(agents): retry stale session lock reports #87342, and required checks at the reviewed head include type/package-boundary failures that need coordination, rebase, or maintainer waiver.

Maintainer options:

Require live proof and rebase before merge (recommended)
Ask for a redacted after-fix gateway/session replay or runtime log, then rebase against the overlapping lock-path work and rerun required checks before landing.
Accept focused proof with a maintainer waiver
Maintainers can decide the supplied failing/passing regression tests and redacted before logs are enough, but that should be recorded alongside any CI waiver before merge.
Pause behind the lock-path cluster
If one of the overlapping session-lock PRs becomes canonical, pause this PR until the durable delivery fix can be rebased or split cleanly.

Next step before merge

[P1] Protected maintainer label plus proof sufficiency, CI, and overlapping lock-path coordination are human merge decisions rather than a narrow automated repair.

Security
Cleared: No concrete security or supply-chain concern found; the diff touches core agent session state/tests and adds local lock diagnostics without dependency, workflow, credential, auth, network, or execution-surface changes.

Review details

Best possible solution:

Land a rebased branch after maintainers accept or improve the live proof, coordinate the overlapping session-lock work, and get required CI green or explicitly waived.

Do we have a high-confidence way to reproduce the issue?

Yes, from source and supplied before-proof: current main still clears completion fallback on suspended delivery discard and reports only pid-level lock timeout ownership. I did not run the repro locally in this read-only review.

Is this the best way to solve the issue?

Mostly yes: the code path is narrow and reuses existing completion fallback and lock-inspection state rather than adding a new scheduler or config surface. The unresolved part is merge readiness, especially live proof, CI, and lock-path coordination.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 13cb9f82777e.

Label changes

Label justifications:

P1: The PR targets loss of subagent completion delivery and session lock diagnostics in real agent workflows, which is an urgent agent/session-state regression class.
merge-risk: 🚨 session-state: Merging changes durable subagent completion fallback and discard state, so a bad merge could preserve or expose incorrect session completion state.
rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body has WSL2/Azure Crabbox focused regression-test output and redacted before logs, but no after-fix live gateway/session replay or redacted runtime log; add proof with private details redacted and update the PR body to trigger re-review, or ask a maintainer to comment @clawsweeper re-review if it does not.

Evidence reviewed

PR surface:

Source +46, Tests +15. Total +61 across 5 files.

View PR surface stats

Area	Files	Added	Removed	Net
Source	3	52	6	+46
Tests	2	19	4	+15
Docs	0	0	0	0
Config	0	0	0	0
Generated	0	0	0	0
Other	0	0	0	0
Total	5	71	10	+61

What I checked:

Current main still drops fallback on suspended delivery expiry: Current main clears completion.fallbackResultText and fallbackCapturedAt when a suspended pending final delivery is discarded, so the central fix is not already implemented on main. (src/agents/subagent-registry.ts:810, 13cb9f82777e)
PR preserves the durable fallback before compaction: The PR copies frozenResultText or fallbackFrozenResultText into the existing completion fallback fields when expiry discards the delivery payload. (src/agents/subagent-registry.ts:707, f9c6fc7de291)
Current main timeout owner is only pid-level: Current main formats the session write-lock timeout owner as only pid=<pid> or unknown, matching the reported diagnostic gap. (src/agents/session-write-lock.ts:825, 13cb9f82777e)
PR adds lock owner diagnostics: The PR formats timeout owner details with pid, liveness, starttime/currentStarttime, createdAt, ageMs, and stale reasons from the existing lock inspection path. (src/agents/session-write-lock.ts:640, f9c6fc7de291)
Focused regression coverage exists: The PR updates focused tests to assert the new durable fallback state and expanded owner diagnostics. (src/agents/subagent-registry.test.ts:1858, f9c6fc7de291)
Protected review state and checks: Live API data shows the PR remains open with the protected maintainer label and status: 📣 needs proof; check-run data includes failed type/package-boundary jobs at the reviewed head. (f9c6fc7de291)

Likely related people:

Vincent Koc: Current shallow blame attributes the central subagent discard and session lock inspection implementations to the grafted current-main refactor commit, so this is the strongest local routing signal despite limited history depth. (role: recent area contributor; confidence: medium; commits: bf30361bc89d; files: src/agents/subagent-registry.ts, src/agents/session-write-lock.ts)
Ted Li: The latest current-main history for src/agents/subagent-registry.ts includes adjacent agent/subagent timeout work, making this a useful secondary routing signal for subagent completion behavior. (role: recent adjacent contributor; confidence: medium; commits: 8a60f39221db; files: src/agents/subagent-registry.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

clawsweeper · 2026-05-25T15:34:47Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

omarshahine · 2026-05-25T18:09:38Z

Thanks for the PR. Before this can move forward, please add live proof from the affected surface, not just unit tests, mocked tests, or source inspection.

A useful proof update should include:

the exact build/SHA tested
the real environment used
the command, UI flow, channel flow, provider request, or other live path exercised
before/after symptom evidence where applicable
the observed result after the patch
any remaining proof gaps

Please redact secrets, tokens, phone numbers, and private message content from logs or screenshots.

martingarramon

Verified at f9c6fc7d.

Two distinct changes in this PR:

Lock timeout diagnostics (session-write-lock.ts): formatSessionLockTimeoutOwner assembles pid + liveness + starttime + age into the timeout error. The pinCurrentProcessStartTimeForTest addition in the test is needed to make starttime assertions deterministic. ✓

Durable fallback preservation (subagent-registry.ts): before expiring a suspended delivery, checks frozenResultText ?? fallbackFrozenResultText and persists it into completion.fallbackResultText if present. discardReason distinguishes "expired-after-durable-fallback" from "expired". The type in subagent-registry.types.ts is updated to match. ✓

Three CI failures need resolution before merge:

check-prod-types and check-additional-extension-package-boundary — likely the new fallbackResultText / fallbackCapturedAt fields on the completion type are not exported through the package boundary; verify src/agents/subagent-registry.types.ts re-exports reach the extension bundle entry.
check-test-types — if a new field is referenced in a test that the type declaration doesn't yet include.

LGTM pending CI.

galiniliev · 2026-05-25T23:18:53Z

Maintainer prep update for f9c6fc7de291905233e7d5ca65201f6d299e00ab.

Evidence collected

Azure Crabbox before proof: silver-prawn / cbx_e06ef86ae274, base c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54, applied the PR regression tests to the base, observed BEFORE_FIX_TEST_EXIT=1 and BEFORE_FIX_EXPECTED_REGRESSION=1.
Azure Crabbox after proof: violet-lobster / cbx_3eda7a42d715, head f9c6fc7de291905233e7d5ca65201f6d299e00ab, ran pnpm test src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts, observed AFTER_FIX_TEST_EXIT=0 and AFTER_FIX_REGRESSION_PASSED=1.
Local focused proof also passed via node scripts/run-vitest.mjs src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts.
Autoreview was clean: no accepted/actionable findings, overall patch-correct confidence 0.84.

Review result
No blocking code findings from maintainer review. I do not see a feature drop or regression in the PR itself: the change preserves durable suspended subagent fallback text before compacting expired payloads, and the lock timeout diagnostic expansion stays scoped to owner/liveness/starttime/staleness metadata.

Current blocker
CI was rerun at https://github.com/openclaw/openclaw/actions/runs/26407542199 and still has three failed jobs: check-prod-types, check-test-types, and check-additional-extension-package-boundary. The shared failure is an unrelated type error in extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts:872 where rttMeasurement.source is inferred as string instead of the literal "request-to-observed-message". This PR only touches src/agents/session-write-lock* and src/agents/subagent-registry*.

I would not call this fully land-ready while required CI is red. The PR-specific code/test evidence is clean; remaining decision is whether to fix/override the unrelated qa-lab CI blocker and whether the targeted Azure before/after proof is enough versus requiring a live gateway/session replay.

fix: preserve subagent delivery after lock stalls

f9c6fc7

openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 25, 2026

This was referenced May 25, 2026

[Bug]: Subagent terminal reconciliation can expire suspended final delivery #86537

Open

[Bug]: Session write-lock timeouts block subagent delivery lanes #86538

Open

galiniliev self-assigned this May 25, 2026

martingarramon reviewed May 25, 2026

View reviewed changes

Thinkscape mentioned this pull request May 26, 2026

[Bug]: sessions_yield subagent wake can compact parent branch at low context usage #86684

Open

clawsweeper Bot mentioned this pull request May 26, 2026

session file locked timeout mismatch: 60s wait vs 17min hold #86801

Closed

This comment was marked as spam.

Sign in to view

RomneyDa added dependency-guard-backfill and removed dependency-guard-backfill labels May 29, 2026

clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. labels May 29, 2026

anyech mentioned this pull request May 30, 2026

fix(discord): unblock inbound queue after stuck run #88251

Closed

clawsweeper Bot mentioned this pull request Jun 13, 2026

[Bug]: sessions_yield abort-settle timeout (2000ms) leaks session write lock on 2026.6.5 — session bricked for minutes, announces lost, staleness never fires (holder pid = gateway, alive=true) #92459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: preserve subagent delivery after lock stalls#86540

fix: preserve subagent delivery after lock stalls#86540
galiniliev wants to merge 1 commit into
openclaw:mainfrom
galiniliev:bug-035-036-subagent-delivery-locks

galiniliev commented May 25, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 25, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 25, 2026

Uh oh!

omarshahine commented May 25, 2026

Uh oh!

martingarramon left a comment

Uh oh!

galiniliev commented May 25, 2026

Uh oh!

This comment was marked as spam.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

galiniliev commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Linked context

Real behavior proof (required for external PRs)

Tests and validation

Risk checklist

Current review state

Uh oh!

clawsweeper Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented May 25, 2026

Uh oh!

omarshahine commented May 25, 2026

Uh oh!

martingarramon left a comment

Choose a reason for hiding this comment

Uh oh!

galiniliev commented May 25, 2026

Uh oh!

This comment was marked as spam.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

galiniliev commented May 25, 2026 •

edited

Loading

clawsweeper Bot commented May 25, 2026 •

edited

Loading