Skip to content

fix: preserve subagent delivery after lock stalls#86540

Open
galiniliev wants to merge 1 commit into
openclaw:mainfrom
galiniliev:bug-035-036-subagent-delivery-locks
Open

fix: preserve subagent delivery after lock stalls#86540
galiniliev wants to merge 1 commit into
openclaw:mainfrom
galiniliev:bug-035-036-subagent-delivery-locks

Conversation

@galiniliev

@galiniliev galiniliev commented May 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Preserves suspended subagent final-delivery text into durable completion fallback state before expiring the outbox payload, so expiry is not a hard loss of the child result.
  • Distinguishes expiry after durable fallback preservation from true payload discard.
  • Adds session write-lock timeout owner diagnostics including liveness, starttime, current starttime, age, and stale reasons.
  • Out of scope: redesigning subagent scheduling or session JSONL lock ownership.

Linked context

Which issue does this close?

Closes #86537
Closes #86538

Which issues, PRs, or discussions are related?

Related #86537
Related #86538

Was this requested by a maintainer or owner?

Local bug evidence under /mnt/c/OpenClaw/bugs/BUG-035-subagent-terminal-reconciliation-times-out-then-discards-delivery and /mnt/c/OpenClaw/bugs/BUG-036-session-write-lock-timeouts-block-main-and-subagent-lanes.

Real behavior proof (required for external PRs)

  • Behavior addressed: Expired suspended subagent final delivery now keeps final text durably instead of dropping the only payload, and write-lock timeouts expose actionable owner diagnostics.
  • Real environment tested: Linux WSL2 dev checkout plus Azure Crabbox remote validation. Before proof used Azure Crabbox silver-prawn / cbx_e06ef86ae274; after proof used Azure Crabbox violet-lobster / cbx_3eda7a42d715.
  • Exact steps or command run after this patch: node scripts/run-vitest.mjs src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts; git diff --check; .agents/skills/autoreview/scripts/autoreview --mode local; Azure Crabbox before/after focused regression proof.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):
RUN  v4.1.7 /home/galini/GitHub/worktrees/bug-035-036-subagent-delivery-locks

Test Files  4 passed (4)
     Tests  144 passed (144)
  Start at  15:01:23
  Duration  16.11s

autoreview clean: no accepted/actionable findings reported
overall: patch is correct (0.84)

Azure Crabbox after-fix evidence log:

Crabbox OpenClaw evidence
timestamp=20260525T214730Z
name=pr-86540-after-fix
crabbox_id=violet-lobster
exit_status=0

tests=
build=0
types=0
check=none
proof_command_count=1

CRABBOX_PHASE:context
CRABBOX_PHASE:bootstrap
CRABBOX_PHASE:install
CRABBOX_PHASE:evidence_1
AFTER_FIX_HEAD_SHA=f9c6fc7de291905233e7d5ca65201f6d299e00ab
AFTER_FIX_TEST_EXIT=0
AFTER_FIX_REGRESSION_PASSED=1
CRABBOX_PHASE:diff
CRABBOX_PHASE:done

run summary sync=30.982s command=1m2.528s total=1m33.578s sync_skipped=false exit=0 sync_steps=ssh:611ms,manifest:635ms,preflight:0s,fingerprint:2ms,fingerprint_remote:70ms,git_seed:19.231s,manifest_write:223ms,prune:67ms,rsync:8.012s,finalize:2.039s command_phases=user-command:2.86s,context:32ms,bootstrap:16.583s,install:10.482s,evidence_1:32.558s,diff:6ms,done:2ms
  • Observed result after fix: The subagent registry regression expires a suspended delivery and observes discardReason: "expired-after-durable-fallback" plus completion.fallbackResultText; the session lock regression observes timeout owner text with pid=, alive=true, starttime=, and currentStarttime=. Azure Crabbox after-fix proof passed the focused regression set at head f9c6fc7de291905233e7d5ca65201f6d299e00ab.
  • What was not tested: Live gateway retry with the original private sessions was not rerun.
  • Proof limitations or environment constraints: Azure Crabbox and local proof cover the durable state and timeout diagnostic seams; the original session keys and lock paths are private and redacted.
  • Before evidence (optional but encouraged):

Redacted original symptom evidence:

subagent wait timed out; deferring terminal state until session reconciliation
subagent suspended delivery discarded {"reason":"expired","runId":"[redacted run id]","childSessionKey":"[redacted child session key]","requesterSessionKey":"[redacted requester session key]"}
SessionWriteLockTimeoutError: session file locked (timeout 60000ms): pid=[redacted pid] [redacted session lock path]

Azure Crabbox before-fix evidence log:

Crabbox OpenClaw evidence
timestamp=20260525T214615Z
name=pr-86540-before-regression
crabbox_id=silver-prawn
exit_status=0

tests=
build=0
types=0
check=none
proof_command_count=1

CRABBOX_PHASE:context
CRABBOX_PHASE:bootstrap
CRABBOX_PHASE:install
CRABBOX_PHASE:evidence_1
BEFORE_FIX_BASE_SHA=c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54
BEFORE_FIX_TEST_EXIT=1
BEFORE_FIX_EXPECTED_REGRESSION=1
CRABBOX_PHASE:diff
CRABBOX_PHASE:done

run summary sync=1.005s command=16.214s total=18.025s sync_skipped=true exit=0 sync_steps=ssh:238ms,manifest:698ms,preflight:0s,fingerprint:2ms,fingerprint_remote:68ms command_phases=user-command:2.843s,context:41ms,bootstrap:646ms,install:390ms,evidence_1:12.284s,diff:4ms,done:3ms

Tests and validation

Which commands did you run?

  • node scripts/run-vitest.mjs src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts
  • git diff --check
  • .agents/skills/autoreview/scripts/autoreview --mode local
  • Azure Crabbox before proof: silver-prawn / cbx_e06ef86ae274, base c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54, observed BEFORE_FIX_TEST_EXIT=1 and BEFORE_FIX_EXPECTED_REGRESSION=1.
  • Azure Crabbox after proof: violet-lobster / cbx_3eda7a42d715, head f9c6fc7de291905233e7d5ca65201f6d299e00ab, observed AFTER_FIX_TEST_EXIT=0 and AFTER_FIX_REGRESSION_PASSED=1.

What regression coverage was added or updated?

  • src/agents/subagent-registry.test.ts now verifies durable fallback preservation on suspended final-delivery expiry.
  • src/agents/session-write-lock.test.ts now verifies timeout owner diagnostics include liveness/starttime details.

What failed before this fix, if known?

  • Redacted logs showed 25 suspended final deliveries discarded after repeated terminal reconciliation timeouts and 161 session write-lock timeout lines with only pid/path owner context.
  • Azure Crabbox before proof applied the focused regression tests to base c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54 and observed the expected regression failure.

If no test was added, why not?

N/A

Risk checklist

Did user-visible behavior change? (Yes/No)

Yes

Did config, environment, or migration behavior change? (Yes/No)

No

Did security, auth, secrets, network, or tool execution behavior change? (Yes/No)

No

What is the highest-risk area?

Subagent final-delivery cleanup state.

How is that risk mitigated?

The delivery payload is still compacted on expiry, but final text is copied into the existing completion fallback field before the payload is removed; pressure-pruned deliveries remain unchanged.

Current review state

What is the next action?

Maintainer review and CI.

What is still waiting on author, maintainer, CI, or external proof?

Live replay of the original private sessions was not run. Required CI is currently blocked by an unrelated type error in extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts:872.

Which bot or reviewer comments were addressed?

Autoreview reported no accepted/actionable findings. Azure Crabbox before/after focused evidence was added to the PR description.

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S maintainer Maintainer-authored PR labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 29, 2026, 1:20 AM ET / 05:20 UTC.

Summary
The PR preserves suspended subagent final-delivery text in completion fallback state before expiring the delivery payload, adds an expired-after-durable-fallback discard reason, expands session write-lock timeout owner diagnostics, and updates focused agent tests.

PR surface: Source +46, Tests +15. Total +61 across 5 files.

Reproducibility: yes. from source and supplied before-proof: current main still clears completion fallback on suspended delivery discard and reports only pid-level lock timeout ownership. I did not run the repro locally in this read-only review.

Review metrics: 1 noteworthy metric.

  • Persisted delivery state: 1 discard reason added; completion fallback retained on expiry. This changes durable subagent run state that recovery and read paths may inspect, so maintainers should notice the session-state surface before merge.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted after-fix live gateway/session replay or runtime log for the contended delivery and lock diagnostic path.
  • [P1] Coordinate or rebase against the overlapping session-lock PRs before merge.
  • [P1] Get required CI green or record a maintainer waiver for unrelated failures.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body has WSL2/Azure Crabbox focused regression-test output and redacted before logs, but no after-fix live gateway/session replay or redacted runtime log; add proof with private details redacted and update the PR body to trigger re-review, or ask a maintainer to comment @clawsweeper re-review if it does not.

Risk before merge

  • [P1] After-fix proof exercises focused local/Azure Crabbox regression tests, but it does not include a redacted live gateway/session replay or after-fix runtime log from a contended session path.
  • [P2] The PR changes durable subagent completion and delivery state; if the fallback preservation is wrong, requester sessions could still miss or stale-read subagent completion text after expiry.
  • [P1] The same acquireSessionWriteLock surface overlaps open PRs fix(gateway): bound startup sidecar fanout #85399 and fix(agents): retry stale session lock reports #87342, and required checks at the reviewed head include type/package-boundary failures that need coordination, rebase, or maintainer waiver.

Maintainer options:

  1. Require live proof and rebase before merge (recommended)
    Ask for a redacted after-fix gateway/session replay or runtime log, then rebase against the overlapping lock-path work and rerun required checks before landing.
  2. Accept focused proof with a maintainer waiver
    Maintainers can decide the supplied failing/passing regression tests and redacted before logs are enough, but that should be recorded alongside any CI waiver before merge.
  3. Pause behind the lock-path cluster
    If one of the overlapping session-lock PRs becomes canonical, pause this PR until the durable delivery fix can be rebased or split cleanly.

Next step before merge

  • [P1] Protected maintainer label plus proof sufficiency, CI, and overlapping lock-path coordination are human merge decisions rather than a narrow automated repair.

Security
Cleared: No concrete security or supply-chain concern found; the diff touches core agent session state/tests and adds local lock diagnostics without dependency, workflow, credential, auth, network, or execution-surface changes.

Review details

Best possible solution:

Land a rebased branch after maintainers accept or improve the live proof, coordinate the overlapping session-lock work, and get required CI green or explicitly waived.

Do we have a high-confidence way to reproduce the issue?

Yes, from source and supplied before-proof: current main still clears completion fallback on suspended delivery discard and reports only pid-level lock timeout ownership. I did not run the repro locally in this read-only review.

Is this the best way to solve the issue?

Mostly yes: the code path is narrow and reuses existing completion fallback and lock-inspection state rather than adding a new scheduler or config surface. The unresolved part is merge readiness, especially live proof, CI, and lock-path coordination.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 13cb9f82777e.

Label changes

Label justifications:

  • P1: The PR targets loss of subagent completion delivery and session lock diagnostics in real agent workflows, which is an urgent agent/session-state regression class.
  • merge-risk: 🚨 session-state: Merging changes durable subagent completion fallback and discard state, so a bad merge could preserve or expose incorrect session completion state.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦪 silver shellfish and patch quality is 🐚 platinum hermit.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body has WSL2/Azure Crabbox focused regression-test output and redacted before logs, but no after-fix live gateway/session replay or redacted runtime log; add proof with private details redacted and update the PR body to trigger re-review, or ask a maintainer to comment @clawsweeper re-review if it does not.
Evidence reviewed

PR surface:

Source +46, Tests +15. Total +61 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 3 52 6 +46
Tests 2 19 4 +15
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 71 10 +61

What I checked:

  • Current main still drops fallback on suspended delivery expiry: Current main clears completion.fallbackResultText and fallbackCapturedAt when a suspended pending final delivery is discarded, so the central fix is not already implemented on main. (src/agents/subagent-registry.ts:810, 13cb9f82777e)
  • PR preserves the durable fallback before compaction: The PR copies frozenResultText or fallbackFrozenResultText into the existing completion fallback fields when expiry discards the delivery payload. (src/agents/subagent-registry.ts:707, f9c6fc7de291)
  • Current main timeout owner is only pid-level: Current main formats the session write-lock timeout owner as only pid=<pid> or unknown, matching the reported diagnostic gap. (src/agents/session-write-lock.ts:825, 13cb9f82777e)
  • PR adds lock owner diagnostics: The PR formats timeout owner details with pid, liveness, starttime/currentStarttime, createdAt, ageMs, and stale reasons from the existing lock inspection path. (src/agents/session-write-lock.ts:640, f9c6fc7de291)
  • Focused regression coverage exists: The PR updates focused tests to assert the new durable fallback state and expanded owner diagnostics. (src/agents/subagent-registry.test.ts:1858, f9c6fc7de291)
  • Protected review state and checks: Live API data shows the PR remains open with the protected maintainer label and status: 📣 needs proof; check-run data includes failed type/package-boundary jobs at the reviewed head. (f9c6fc7de291)

Likely related people:

  • Vincent Koc: Current shallow blame attributes the central subagent discard and session lock inspection implementations to the grafted current-main refactor commit, so this is the strongest local routing signal despite limited history depth. (role: recent area contributor; confidence: medium; commits: bf30361bc89d; files: src/agents/subagent-registry.ts, src/agents/session-write-lock.ts)
  • Ted Li: The latest current-main history for src/agents/subagent-registry.ts includes adjacent agent/subagent timeout work, making this a useful secondary routing signal for subagent completion behavior. (role: recent adjacent contributor; confidence: medium; commits: 8a60f39221db; files: src/agents/subagent-registry.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. labels May 25, 2026
@clawsweeper

clawsweeper Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@omarshahine

Copy link
Copy Markdown
Contributor

Thanks for the PR. Before this can move forward, please add live proof from the affected surface, not just unit tests, mocked tests, or source inspection.

A useful proof update should include:

  • the exact build/SHA tested
  • the real environment used
  • the command, UI flow, channel flow, provider request, or other live path exercised
  • before/after symptom evidence where applicable
  • the observed result after the patch
  • any remaining proof gaps

Please redact secrets, tokens, phone numbers, and private message content from logs or screenshots.

@galiniliev galiniliev self-assigned this May 25, 2026

@martingarramon martingarramon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified at f9c6fc7d.

Two distinct changes in this PR:

Lock timeout diagnostics (session-write-lock.ts): formatSessionLockTimeoutOwner assembles pid + liveness + starttime + age into the timeout error. The pinCurrentProcessStartTimeForTest addition in the test is needed to make starttime assertions deterministic. ✓

Durable fallback preservation (subagent-registry.ts): before expiring a suspended delivery, checks frozenResultText ?? fallbackFrozenResultText and persists it into completion.fallbackResultText if present. discardReason distinguishes "expired-after-durable-fallback" from "expired". The type in subagent-registry.types.ts is updated to match. ✓

Three CI failures need resolution before merge:

  • check-prod-types and check-additional-extension-package-boundary — likely the new fallbackResultText / fallbackCapturedAt fields on the completion type are not exported through the package boundary; verify src/agents/subagent-registry.types.ts re-exports reach the extension bundle entry.
  • check-test-types — if a new field is referenced in a test that the type declaration doesn't yet include.

LGTM pending CI.

@galiniliev

Copy link
Copy Markdown
Contributor Author

Maintainer prep update for f9c6fc7de291905233e7d5ca65201f6d299e00ab.

Evidence collected

  • Azure Crabbox before proof: silver-prawn / cbx_e06ef86ae274, base c4bce00727ec5f6a5b9c78e0ad34524f5dc1db54, applied the PR regression tests to the base, observed BEFORE_FIX_TEST_EXIT=1 and BEFORE_FIX_EXPECTED_REGRESSION=1.
  • Azure Crabbox after proof: violet-lobster / cbx_3eda7a42d715, head f9c6fc7de291905233e7d5ca65201f6d299e00ab, ran pnpm test src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts, observed AFTER_FIX_TEST_EXIT=0 and AFTER_FIX_REGRESSION_PASSED=1.
  • Local focused proof also passed via node scripts/run-vitest.mjs src/agents/subagent-registry.test.ts src/agents/session-write-lock.test.ts.
  • Autoreview was clean: no accepted/actionable findings, overall patch-correct confidence 0.84.

Review result
No blocking code findings from maintainer review. I do not see a feature drop or regression in the PR itself: the change preserves durable suspended subagent fallback text before compacting expired payloads, and the lock timeout diagnostic expansion stays scoped to owner/liveness/starttime/staleness metadata.

Current blocker
CI was rerun at https://github.com/openclaw/openclaw/actions/runs/26407542199 and still has three failed jobs: check-prod-types, check-test-types, and check-additional-extension-package-boundary. The shared failure is an unrelated type error in extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts:872 where rttMeasurement.source is inferred as string instead of the literal "request-to-observed-message". This PR only touches src/agents/session-write-lock* and src/agents/subagent-registry*.

I would not call this fully land-ready while required CI is red. The PR-specific code/test evidence is clean; remaining decision is whether to fix/override the unrelated qa-lab CI blocker and whether the targeted Azure before/after proof is enough versus requiring a live gateway/session replay.

@BingqingLyu

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. size: S status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Session write-lock timeouts block subagent delivery lanes [Bug]: Subagent terminal reconciliation can expire suspended final delivery

5 participants