Skip to content

fix(agents): dispatch subagent spawn in process#90612

Merged
steipete merged 4 commits into
openclaw:mainfrom
lanzhi-lee:codex/sessions-spawn-inprocess
Jun 7, 2026
Merged

fix(agents): dispatch subagent spawn in process#90612
steipete merged 4 commits into
openclaw:mainfrom
lanzhi-lee:codex/sessions-spawn-inprocess

Conversation

@lanzhi-lee

Copy link
Copy Markdown
Contributor

Summary

Route sessions_spawn's internal gateway RPC through in-process dispatch when the caller is already running inside the gateway process.

Background

A Feishu-delivered profiling request exposed a sessions_spawn timeout while several agent runs were active in the same gateway process. The observed failure was not a Feishu delivery problem: the message send path eventually returned a successful message id. The failing path was the subagent spawn control plane: sessions_spawn called back into the same gateway over ws://127.0.0.1:<port>, while the gateway event loop was saturated by active agent work. That self-connection pattern can delay the WebSocket handshake and the timeout timer itself, so a configured 60s timeout can surface much later.

Announce delivery already avoids this class of failure by dispatching the gateway method in process. This PR applies the same ownership boundary to subagent spawn's internal gateway calls.

What changed

  • Added hasInProcessGatewayContext() so callers can detect whether in-process gateway dispatch is available.
  • Changed callSubagentGateway() to use dispatchGatewayMethodInProcess() when a gateway context exists, avoiding loopback WebSocket self-connections for in-process sessions_spawn work.
  • Preserved fallback behavior: outside the gateway process, subagent spawn still uses the existing callGateway() WebSocket path.
  • Preserved method scopes: admin-only lifecycle calls such as sessions.delete and sessions.patch force a synthetic admin client, while ordinary agent calls keep the normal write-scoped behavior.
  • Tightened in-process dispatch timeout behavior so timeoutMs bounds the initial response wait, while accepted/final two-phase calls still return the initial accepted response immediately when expectFinal is not requested and wait for final only when it is requested.

What this does not do

  • It does not solve all event-loop starvation. Heavy context assembly, streaming processing, or other CPU-heavy work can still delay unrelated timers and I/O.
  • It does not change model streaming, compaction, or worker-thread behavior.
  • It does not change Feishu/message delivery behavior; the reported missing message was determined to be a timing misunderstanding rather than a delivery failure.
  • It does not add a live high-concurrency reproduction harness. Coverage here is focused on the dispatch contract and timeout/scope regressions that caused the spawn path to be fragile.

Verification

  • node scripts/run-vitest.mjs src/agents/subagent-spawn.test.ts src/gateway/server-plugins.test.ts
  • pnpm run tsgo:core
  • pnpm run tsgo:test:src
  • git diff --check
  • .agents/skills/autoreview/scripts/autoreview --mode commit --commit HEAD --prompt "Review the final rebased sessions_spawn in-process dispatch commit. Focus on gateway scope preservation, timeout semantics, fallback behavior, and subagent cleanup."

Real behavior proof

Behavior addressed: sessions_spawn no longer self-connects to the local gateway WebSocket when it is already executing inside the gateway process.

Real environment tested: Local OpenClaw source checkout on macOS, rebased onto current upstream/main.

Exact steps or command run after this patch: node scripts/run-vitest.mjs src/agents/subagent-spawn.test.ts src/gateway/server-plugins.test.ts; pnpm run tsgo:core; pnpm run tsgo:test:src; git diff --check; final autoreview command listed above.

Evidence after fix: The targeted Vitest run passed 2 shards covering src/gateway/server-plugins.test.ts and src/agents/subagent-spawn.test.ts; tsgo:core and tsgo:test:src completed successfully; final autoreview reported no accepted/actionable findings.

Observed result after fix: In-process subagent spawn calls dispatch through dispatchGatewayMethodInProcess; admin cleanup calls retain synthetic admin scope; initial in-process responses time out if no response is sent, but accepted two-phase responses still return before handler completion unless final output is requested.

What was not tested: A live saturated-gateway repro with multiple concurrent real model streams and Feishu/Telegram delivery was not run in this PR.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. labels Jun 5, 2026
@clawsweeper

clawsweeper Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 7, 2026, 5:50 AM ET / 09:50 UTC.

Summary
The branch routes internal sessions_spawn gateway calls through in-process dispatch when a gateway context exists, preserves the WebSocket fallback, and adds timeout/scope tests plus docs.

PR surface: Source +115, Tests +167, Docs +2. Total +284 across 7 files.

Reproducibility: no. high-confidence live reproduction was established here. Source inspection shows current main still uses callGateway for subagent spawn control calls, and the PR adds focused contract tests for the in-process path rather than a saturated-gateway repro.

Review metrics: none identified.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🐚 platinum hermit
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • none.

Risk before merge

  • [P1] The PR changes same-process gateway dispatch timing for subagent spawn and cleanup; a missed response or scope edge could affect agent availability even with targeted tests passing.
  • [P1] No live saturated-gateway run is attached; the current proof gate is satisfied by maintainer override, so maintainers are explicitly owning that residual runtime uncertainty.

Maintainer options:

  1. Accept proof override after CI (recommended)
    Land once required checks pass if maintainers are comfortable owning the lack of attached live saturated-gateway proof for this availability-sensitive dispatch change.
  2. Require live gateway proof
    Pause merge until a redacted real OpenClaw gateway run shows sessions_spawn taking the in-process path and returning the accepted response under active agent work.

Next step before merge

  • [P2] No narrow ClawSweeper repair is indicated; the remaining action is normal maintainer merge handling around CI and the proof-override availability risk.

Security
Cleared: No concrete security or supply-chain regression was found; the scope-sensitive paths preserve write-scoped agent calls and synthetic admin scope only for admin-only lifecycle methods.

Review details

Best possible solution:

Land the focused in-process dispatch fix after required CI and maintainer acceptance of the proof override, keeping the WebSocket fallback for callers outside the gateway process.

Do we have a high-confidence way to reproduce the issue?

No high-confidence live reproduction was established here. Source inspection shows current main still uses callGateway for subagent spawn control calls, and the PR adds focused contract tests for the in-process path rather than a saturated-gateway repro.

Is this the best way to solve the issue?

Yes, this appears to be the best bounded fix for the reported self-connection failure mode: reuse the existing in-process gateway dispatch seam when a gateway context exists, while leaving broader event-loop starvation work out of scope.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against bab18d567b0c.

Label changes

Label changes:

  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Override: A maintainer applied proof: override for this PR.
  • remove rating: 🦪 silver shellfish: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: 📣 needs proof: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P1: The PR targets an agent subspawn timeout in an active gateway process, which can break real agent workflows.
  • merge-risk: 🚨 availability: Merging changes in-process gateway response and timeout behavior for subagent spawn, so a regression could stall or fail agent runs.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🐚 platinum hermit and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Override: A maintainer applied proof: override for this PR.
Evidence reviewed

PR surface:

Source +115, Tests +167, Docs +2. Total +284 across 7 files.

View PR surface stats
Area Files Added Removed Net
Source 4 130 15 +115
Tests 2 167 0 +167
Docs 1 6 4 +2
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 7 303 19 +284

What I checked:

  • Current main behavior: Current main resolves admin-only scopes and then always calls callGateway, so the subagent spawn control path still uses the loopback WebSocket helper before this PR. (src/agents/subagent-spawn.ts:248, bab18d567b0c)
  • PR implementation: The PR adds hasInProcessGatewayContext() and dispatches object-parameter subagent gateway calls through dispatchGatewayMethodInProcess, while retaining callGateway(request) as the fallback. (src/agents/subagent-spawn.ts:258, 6c00c17a4588)
  • Gateway dispatch semantics: The PR changes in-process dispatch to wait for the first response with a timeout budget, return accepted responses without awaiting handler completion, and wait for final only when requested. (src/gateway/server-plugins.ts:434, 6c00c17a4588)
  • Scope contract: Core method descriptors classify agent as operator.write and sessions.delete/sessions.patch as operator.admin, matching the PR's least-privilege agent path and synthetic-admin cleanup path. (src/gateway/methods/core-descriptors.ts:154, 6c00c17a4588)
  • Regression coverage: The added tests cover in-process spawn dispatch, admin-scoped cleanup after spawn failure, first-response timeout, accepted-response return, final-response timeout budget, and post-accepted handler rejection cleanup. (src/gateway/server-plugins.test.ts:776, 6c00c17a4588)
  • Maintainer proof context: The live timeline shows proof: override was added by steipete on 2026-06-07 after the previous ClawSweeper review asked for real behavior proof. (6c00c17a4588)

Likely related people:

  • Yuval Dinodia: Current main blame for the central callSubagentGateway and in-process dispatch code points at commit bb27cbd, so this is the strongest local history signal for the current files despite the shallow/grafted history. (role: recent area contributor; confidence: medium; commits: bb27cbd46dbd; files: src/agents/subagent-spawn.ts, src/gateway/server-plugins.ts, docs/tools/subagents.md)
  • Peter Steinberger: The latest PR commit adjusts the in-process timeout budget and the timeline shows Peter added the proof override, making him a likely routing candidate for the remaining merge-risk decision. (role: recent branch contributor and reviewer context; confidence: high; commits: 6c00c17a4588; files: src/gateway/server-plugins.ts, src/gateway/server-plugins.test.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. labels Jun 5, 2026
@openclaw-barnacle openclaw-barnacle Bot added the docs Improvements or additions to documentation label Jun 5, 2026
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 5, 2026
@steipete steipete self-assigned this Jun 7, 2026
@steipete steipete added the proof: override Maintainer override for the external PR real behavior proof gate. label Jun 7, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the triage: mock-only-proof Candidate: PR proof only shows tests, mocks, snapshots, lint, typecheck, or CI. label Jun 7, 2026
@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 7, 2026
@steipete

steipete commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Land-ready proof for final head 58ef6cc9c429fc87bd14a2de16ce2d6b3d92de1b.

Maintainer fixups added here:

  • 6c00c17a458 fix(gateway): keep in-process dispatch timeout budget
  • 58ef6cc9c42 test(gateway): avoid promise executor timer returns

Local proof:

  • node scripts/run-vitest.mjs src/agents/subagent-spawn.test.ts src/gateway/server-plugins.test.ts passed: src/gateway/server-plugins.test.ts 98 tests, src/agents/subagent-spawn.test.ts 24 tests.
  • pnpm lint --threads=8 passed.
  • git diff --check origin/main...HEAD && git diff --check passed.
  • Autoreview passed with no accepted/actionable findings after the gateway timeout and timer-cleanup fixups.

CI proof:

  • Full PR checks on 58ef6cc9c429fc87bd14a2de16ce2d6b3d92de1b are green, including Real behavior proof after maintainer proof: override.

Known proof gap: no live saturated gateway/Feishu run. The override is for the narrow in-process dispatch/timer behavior covered by focused gateway/subagent tests plus full CI.

@steipete steipete merged commit 58bab0c into openclaw:main Jun 7, 2026
158 checks passed
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request Jun 8, 2026
* fix(agents): dispatch subagent spawn in process

* docs: update subagent gateway dispatch note

* fix(gateway): keep in-process dispatch timeout budget

* test(gateway): avoid promise executor timer returns

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
wangmiao0668000666 pushed a commit to wangmiao0668000666/openclaw that referenced this pull request Jun 9, 2026
* fix(agents): dispatch subagent spawn in process

* docs: update subagent gateway dispatch note

* fix(gateway): keep in-process dispatch timeout budget

* test(gateway): avoid promise executor timer returns

---------

Co-authored-by: Peter Steinberger <steipete@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation gateway Gateway runtime merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. P1 High-priority user-facing bug, regression, or broken workflow. proof: override Maintainer override for the external PR real behavior proof gate. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants