Skip to content

fix(agents,gateway): three subagent announce delivery failures in loopback token-auth setups#85716

Open
jailbirt wants to merge 1 commit into
openclaw:mainfrom
jailbirt:fix/subagent-scope-stale-lock-yield-context
Open

fix(agents,gateway): three subagent announce delivery failures in loopback token-auth setups#85716
jailbirt wants to merge 1 commit into
openclaw:mainfrom
jailbirt:fix/subagent-scope-stale-lock-yield-context

Conversation

@jailbirt

Copy link
Copy Markdown
Contributor

Summary

Three related bugs that break subagent completion delivery when running OpenClaw with loopback token-auth (e.g. Google Chat channel, sessions_spawn orchestration pattern). All three were found and validated on a production Linux/GCP deployment: OpenClaw 2026.5.20, Node 22, systemd user service.


Fix 1 — src/gateway/call.ts — device identity for BACKEND calls with explicit scopes

Fixes: #77807

shouldOmitDeviceIdentityForGatewayCall unconditionally omitted device identity for all BACKEND + GATEWAY_CLIENT + loopback calls. Internal calls such as callSubagentGateway carry explicit operator scopes like ["operator.write"]. Without device identity the gateway cannot verify those scopes against the paired device token and rejects with:

Subagent completion direct announce failed: missing scope: operator.write

Fix: return false (keep device identity) when params.scopes is non-empty. Also threads scopes through resolveDeviceIdentityForGatewayCall so the helper can inspect them.


Fix 2 — src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts — await settle after timeout

waitForSessionsYieldAbortSettle raced the settle promise against a 2 s timeout and returned immediately on timeout, leaving the session transcript file lock held. The next turn on the same persistent session (e.g. a Google Chat DM) failed with "file lock stale", triggering a model fallback and surfacing an internal error message to end users.

Fix: after logging the timeout warning, await params.settlePromise.catch(() => {}) so the file lock is always released before the function returns.


Fix 3 — src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts — strip context message before completion announce

When a new incoming message aborts an active sessions_yield, the openclaw.sessions_yield context message (containing [Context: The previous turn ended intentionally via sessions_yield...]) remains in the session transcript. When a subagent completion announce subsequently re-runs the agent to deliver the result, the agent sees this context message and responds via sessions_yield again (producing a custom_message). The announce system does not recognise a custom_message as a visible reply and emits:

Subagent announce give up: completion agent did not produce a visible reply

stripSessionsYieldArtifacts already strips the interrupt custom type (openclaw.sessions_yield_interrupt) but not the context custom type (openclaw.sessions_yield).

Fix: strip both types in the in-memory messages loop and in the fileEntries loop.


Test plan

  • Single sub-agent turn delivers correctly with no missing scope: operator.write
  • Two rapid consecutive messages (second arrives while sub-agent runs): both deliver, no fallback, no stale lock, no scope errors
  • Persistent session (e.g. Google Chat DM) does not accumulate stale locks across turns

Tested on production: Google Chat + Pipedrive sub-agent workflow, 5+ consecutive turns, no errors.

Reported-by: jailbirt jailbirt@theeye.io

🤖 Generated with Claude Code

Fixes three related bugs that break subagent completion delivery in
loopback token-auth setups (e.g. Google Chat with sessions_spawn).
All three were found and validated on a production Linux/GCP deployment
(openclaw 2026.5.20, Node 22, systemd user service).

---

## Fix 1 — gateway: keep device identity for BACKEND calls with explicit scopes

File: src/gateway/call.ts
Fixes: openclaw#77807

`shouldOmitDeviceIdentityForGatewayCall` unconditionally omitted device
identity for all BACKEND + GATEWAY_CLIENT + loopback calls. Internal
calls such as `callSubagentGateway` carry explicit operator scopes like
`["operator.write"]`. Without device identity the gateway cannot verify
those scopes against the paired device token and rejects with:

  Subagent completion direct announce failed: missing scope: operator.write

Fix: return false (keep device identity) when `params.scopes` is
non-empty. Also threads `scopes` through `resolveDeviceIdentityForGatewayCall`
so the helper can inspect them.

---

## Fix 2 — sessions_yield: await settle promise after timeout

File: src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts

`waitForSessionsYieldAbortSettle` raced the settle promise against a
2 s timeout and returned immediately on timeout, leaving the session
transcript file lock held. The next turn on the same persistent session
(e.g. a Google Chat DM) then failed with "file lock stale", triggering
a model fallback and surfacing an internal error message to end users.

Fix: after logging the timeout warning, await `settlePromise.catch(() => {})`
so the file lock is always released before the function returns.

---

## Fix 3 — sessions_yield: strip context message before completion announce

File: src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts

When a new incoming message aborts an active `sessions_yield`, the
`openclaw.sessions_yield` context message (containing "[Context: The
previous turn ended intentionally via sessions_yield...]") remains in
the session transcript. When a subagent completion announce subsequently
re-runs the agent to deliver the result, the agent sees this context
message and responds via `sessions_yield` again (a `custom_message`).
The announce system does not recognise a `custom_message` as a visible
reply and emits:

  Subagent announce give up: completion agent did not produce a visible reply

`stripSessionsYieldArtifacts` already strips the interrupt custom type
(`openclaw.sessions_yield_interrupt`) but not the context custom type
(`openclaw.sessions_yield`). Fix: strip both types in the in-memory
messages loop and in the fileEntries loop.

---

Tested on production (Google Chat + Pipedrive sub-agent workflow):
- Single sub-agent turn: delivers correctly
- Two rapid consecutive messages (second arrives while sub-agent runs):
  both deliver, no fallback, no stale lock, no scope errors

Reported-by: jailbirt <jailbirt@theeye.io>
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: XS triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 23, 2026
@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-23 13:05 UTC / May 23, 2026, 9:05 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR threads resolved scopes into Gateway device identity selection, keeps device identity for explicitly scoped backend loopback calls, waits for sessions_yield abort settle after timeout, and strips sessions_yield context custom messages from live and file transcripts.

Reproducibility: Partially. The linked scope bug has redacted live reproduction and current source still matches the implicated scope/identity path; I did not reproduce the two sessions_yield failures in a live channel during this read-only pass.

PR rating
Overall: 🧂 unranked krab
Proof: 🦪 silver shellfish
Patch quality: 🦪 silver shellfish
Summary: The root-cause signal is useful, but missing proof plus a blocking timeout regression and unupdated auth tests make the PR not quality-ready yet.

Rank-up moves:

  • Keep the sessions_yield abort settle path bounded and add focused coverage for timeout behavior.
  • Update src/gateway/call.test.ts for the new non-empty scoped identity behavior and the preserved empty/unscoped behavior.
  • Add redacted real behavior proof, such as terminal output or logs from the fixed Google Chat subagent flow.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs stronger real behavior proof before merge: The PR body says a production Google Chat and Pipedrive workflow passed 5+ turns, but it does not include redacted logs, terminal output, screenshots, recordings, or an artifact showing the after-fix behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • The sessions_yield change turns the timeout path into an unbounded wait; if activeSession.abort() never settles, a channel turn can hang instead of recovering after the warning.
  • The Gateway change applies to any non-empty scoped backend loopback shared-auth call, not only the subagent agent call, so it needs explicit auth-boundary review and regression coverage.
  • The PR body asserts production validation, but it does not include redacted logs, terminal output, screenshots, or artifacts showing after-fix behavior.

Maintainer options:

  1. Patch With Bounded Cleanup And Tests (recommended)
    Before merge, keep sessions_yield abort cleanup bounded, update the Gateway scoped-identity tests for non-empty and empty scopes, and add redacted real behavior proof for the subagent delivery flow.
  2. Explicitly Accept The Auth Contract Change
    Maintainers can accept that non-empty scoped backend loopback calls now load paired device identity, but should record why this does not broaden operator scope or credential trust.
  3. Split The Stack
    If the sessions_yield failures cannot be independently verified, split the linked scope fix from the sessions_yield transcript and settle-timeout changes.

Next step before merge
Needs contributor proof and maintainer review of the Gateway auth boundary plus the bounded sessions_yield cleanup; this is not a safe automated repair lane.

Security
Needs attention: The PR changes Gateway operator scope and device-identity selection, so it needs focused auth-boundary tests and owner review before merge.

Review findings

  • [P1] Keep the settle timeout from becoming unbounded — src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts:48
  • [P2] Update the scoped backend identity contract tests — src/gateway/call.ts:343
Review details

Best possible solution:

Land a narrowed regression-led fix that preserves device identity only for the intended scoped internal call path, keeps abort-settle cleanup bounded, updates gateway and sessions_yield coverage, and includes redacted real behavior proof.

Do we have a high-confidence way to reproduce the issue?

Partially. The linked scope bug has redacted live reproduction and current source still matches the implicated scope/identity path; I did not reproduce the two sessions_yield failures in a live channel during this read-only pass.

Is this the best way to solve the issue?

No. The scoped identity direction is plausible for the linked bug, but the sessions_yield timeout must remain bounded and the auth contract needs focused tests plus security review before merge.

Label justifications:

  • P1: The item affects real subagent completion delivery in token-auth channel workflows and can block or hang user-visible agent responses.
  • merge-risk: 🚨 message-delivery: The PR changes subagent announce and sessions_yield cleanup behavior that determines whether completion messages are delivered visibly.
  • merge-risk: 🚨 security-boundary: The PR changes when Gateway backend loopback calls attach paired device identity for scoped operator calls.
  • merge-risk: 🚨 availability: The sessions_yield timeout path can become an unbounded await if abort settlement never resolves.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🦪 silver shellfish, patch quality is 🦪 silver shellfish, and The root-cause signal is useful, but missing proof plus a blocking timeout regression and unupdated auth tests make the PR not quality-ready yet.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body says a production Google Chat and Pipedrive workflow passed 5+ turns, but it does not include redacted logs, terminal output, screenshots, recordings, or an artifact showing the after-fix behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Full review comments:

  • [P1] Keep the settle timeout from becoming unbounded — src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts:48
    This changes the timeout path from a bounded recovery into an unbounded wait on the same abort-settle promise that already failed to settle in time. If activeSession.abort() hangs, the runner never reaches the session write-lock cleanup or returns from the turn, so one stuck sessions_yield abort can hang the channel/session instead of degrading after the warning. Keep any post-timeout cleanup bounded or release the lock through a separate bounded path.
    Confidence: 0.86
  • [P2] Update the scoped backend identity contract tests — src/gateway/call.ts:343
    This new return false path makes explicitly scoped backend loopback calls load device identity, but the existing callGateway test still asserts deviceIdentity is null for scopes: ["operator.admin"]. That test will fail, and the auth boundary needs replacement coverage showing non-empty scopes keep identity while empty or unscoped shared-token loopback calls preserve the old behavior.
    Confidence: 0.91

Overall correctness: patch is incorrect
Overall confidence: 0.88

Security concerns:

  • [medium] Review scoped backend identity preservation — src/gateway/call.ts:343
    Non-empty scopes now make backend gateway-client loopback shared-auth calls attach paired device identity for all methods, which changes the existing direct-local shared-token contract and should be proven not to broaden credential or operator-scope trust.
    Confidence: 0.84

What I checked:

  • PR diff changes scoped backend loopback identity behavior: The PR adds a non-empty scopes exception to shouldOmitDeviceIdentityForGatewayCall, so scoped backend loopback shared-auth calls keep device identity instead of returning null. (src/gateway/call.ts:343, 89db7a09f11a)
  • Current main omits identity without scope awareness: Current main decides identity omission from backend mode, gateway-client name, shared auth, and loopback URL only; resolved scopes are not part of the current helper inputs. (src/gateway/call.ts:326, 917549190624)
  • Existing test contract conflicts with the PR behavior: The current callGateway test expects an explicit scoped backend loopback shared-token call to keep deviceIdentity null, so the PR needs to update or replace that contract with focused coverage for the new scope exception. (src/gateway/call.test.ts:645, 917549190624)
  • sessions_yield timeout is intentionally bounded on main: Current main races the abort settle promise against SESSIONS_YIELD_ABORT_SETTLE_TIMEOUT_MS and returns after logging timeout, while the PR awaits the original settle promise after timeout. (src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts:23, 917549190624)
  • Abort settle promise comes from active session abort: The settle promise being awaited is derived from activeSession.abort(), so making the post-timeout path wait forever can hang the run if abort settlement is the stalled operation. (src/agents/pi-embedded-runner/run/attempt.ts:2465, 917549190624)
  • Related bug has live reproduction evidence: The linked issue discussion includes redacted live reproduction showing native sessions_spawn fails with missing scope: operator.write while a direct Gateway agent RPC with the same full-scope operator device token succeeds.

Likely related people:

  • Vincent Koc: Current-main blame attributes the checked-in Gateway call helper and sessions_yield helper lines to 7f05be0, which covers the central files touched by this PR. (role: recent area contributor; confidence: medium; commits: 7f05be041e06; files: src/gateway/call.ts, src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts, src/agents/pi-embedded-runner/run/attempt.ts)
  • brokemac79: Recent history shows f4b92f5 simplified subagent completion handoff in src/agents/subagent-spawn.ts, adjacent to the subagent announce path affected here. (role: recent adjacent contributor; confidence: medium; commits: f4b92f5e6c70; files: src/agents/subagent-spawn.ts)
  • steipete: The public review trail on the linked scope issue routes related backend RPC and pairing behavior to earlier Gateway auth work, making this a useful review candidate for the auth boundary. (role: prior gateway auth contributor; confidence: medium; commits: e640c0a95fcb; files: src/gateway/call.ts, src/gateway/call.test.ts)
  • openperf: The prior public review on the linked scope issue identifies related subagent scope-upgrade work that left non-admin agent calls on least-privilege scope. (role: related subagent scope fix author; confidence: medium; commits: b40ef364b71c; files: src/agents/subagent-spawn.ts, src/agents/subagent-spawn.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 917549190624.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 23, 2026
@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@martingarramon martingarramon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI: 67/70 — 2 real failures from Fix 1

checks-node-agentic-gateway-core and agentic-control-plane-auth-node both fail on the same root cause.


Fix 1 — condition in shouldOmitDeviceIdentityForGatewayCall is too broad

The added guard at call.ts:341-344:

if (requestedScopes.length > 0) {
  return false; // keep device identity
}

fires before the BACKEND + GATEWAY_CLIENT + loopback + hasSharedAuth block. Any call where the resolved scopes array is non-empty now retains device identity — not just subagent loopback calls.

Two tests in src/gateway/call.test.ts encode the opposite contract:

  • :427keeps direct-local backend shared-token auth independent of paired device state: expect(lastClientOptions?.deviceIdentity).toBeNull() — now fails
  • :658uses backend client metadata for explicit scoped default calls: same assertion

The production bug (subagent completion failing with missing scope: operator.write) is real. The fix path through callSubagentGateway → callGateway → callGatewayWithScopes (explicit-scopes branch) → executeGatewayRequestWithScopes → shouldOmitDeviceIdentityForGatewayCall is correct — that path needs device identity when scopes are present. One narrowing would be an explicit flag threaded from the subagent path (e.g., requireDeviceIdentity?: true); any equivalent that preserves the existing direct-local shared-token contract is acceptable.


Fix 2 — waitForSessionsYieldAbortSettle — direction correct, bounded-wait question

The await params.settlePromise.catch(() => {}) after the timeout log ensures the lock drains before the function returns. The intended direction is right. One question: if the settle promise itself stalls (e.g., a hung fs operation), this second await has no timeout fallback of its own. Is there an upper bound on how long settlePromise can remain pending, or does it need its own guard here?


Fix 3 — LGTM

stripSessionsYieldArtifacts: both the messages loop and fileEntries loop now cover SESSIONS_YIELD_CONTEXT_CUSTOM_TYPE, exactly mirroring the existing pattern for SESSIONS_YIELD_INTERRUPT_CUSTOM_TYPE. Ties the comparison to the shared constant (defined at line 5) rather than a raw string.

Fix 1 still needs narrowing before this is merge-ready.

@BingqingLyu

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 message-delivery 🚨 May drop, duplicate, misroute, suppress, or wrongly target messages. merge-risk: 🚨 security-boundary 🚨 May affect sandboxing, authorization, credentials, or sensitive data. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: XS status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: sessions_spawn fails with missing scope operator.write despite full-scope operator token

4 participants