fix(agents,gateway): three subagent announce delivery failures in loopback token-auth setups#85716
Conversation
Fixes three related bugs that break subagent completion delivery in loopback token-auth setups (e.g. Google Chat with sessions_spawn). All three were found and validated on a production Linux/GCP deployment (openclaw 2026.5.20, Node 22, systemd user service). --- ## Fix 1 — gateway: keep device identity for BACKEND calls with explicit scopes File: src/gateway/call.ts Fixes: openclaw#77807 `shouldOmitDeviceIdentityForGatewayCall` unconditionally omitted device identity for all BACKEND + GATEWAY_CLIENT + loopback calls. Internal calls such as `callSubagentGateway` carry explicit operator scopes like `["operator.write"]`. Without device identity the gateway cannot verify those scopes against the paired device token and rejects with: Subagent completion direct announce failed: missing scope: operator.write Fix: return false (keep device identity) when `params.scopes` is non-empty. Also threads `scopes` through `resolveDeviceIdentityForGatewayCall` so the helper can inspect them. --- ## Fix 2 — sessions_yield: await settle promise after timeout File: src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts `waitForSessionsYieldAbortSettle` raced the settle promise against a 2 s timeout and returned immediately on timeout, leaving the session transcript file lock held. The next turn on the same persistent session (e.g. a Google Chat DM) then failed with "file lock stale", triggering a model fallback and surfacing an internal error message to end users. Fix: after logging the timeout warning, await `settlePromise.catch(() => {})` so the file lock is always released before the function returns. --- ## Fix 3 — sessions_yield: strip context message before completion announce File: src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts When a new incoming message aborts an active `sessions_yield`, the `openclaw.sessions_yield` context message (containing "[Context: The previous turn ended intentionally via sessions_yield...]") remains in the session transcript. When a subagent completion announce subsequently re-runs the agent to deliver the result, the agent sees this context message and responds via `sessions_yield` again (a `custom_message`). The announce system does not recognise a `custom_message` as a visible reply and emits: Subagent announce give up: completion agent did not produce a visible reply `stripSessionsYieldArtifacts` already strips the interrupt custom type (`openclaw.sessions_yield_interrupt`) but not the context custom type (`openclaw.sessions_yield`). Fix: strip both types in the in-memory messages loop and in the fileEntries loop. --- Tested on production (Google Chat + Pipedrive sub-agent workflow): - Single sub-agent turn: delivers correctly - Two rapid consecutive messages (second arrives while sub-agent runs): both deliver, no fallback, no stale lock, no scope errors Reported-by: jailbirt <jailbirt@theeye.io>
|
Codex review: needs real behavior proof before merge. Latest ClawSweeper review: 2026-05-23 13:05 UTC / May 23, 2026, 9:05 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: Partially. The linked scope bug has redacted live reproduction and current source still matches the implicated scope/identity path; I did not reproduce the two sessions_yield failures in a live channel during this read-only pass. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Land a narrowed regression-led fix that preserves device identity only for the intended scoped internal call path, keeps abort-settle cleanup bounded, updates gateway and sessions_yield coverage, and includes redacted real behavior proof. Do we have a high-confidence way to reproduce the issue? Partially. The linked scope bug has redacted live reproduction and current source still matches the implicated scope/identity path; I did not reproduce the two sessions_yield failures in a live channel during this read-only pass. Is this the best way to solve the issue? No. The scoped identity direction is plausible for the linked bug, but the sessions_yield timeout must remain bounded and the auth contract needs focused tests plus security review before merge. Label justifications:
Full review comments:
Overall correctness: patch is incorrect Security concerns:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 917549190624. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
martingarramon
left a comment
There was a problem hiding this comment.
CI: 67/70 — 2 real failures from Fix 1
checks-node-agentic-gateway-core and agentic-control-plane-auth-node both fail on the same root cause.
Fix 1 — condition in shouldOmitDeviceIdentityForGatewayCall is too broad
The added guard at call.ts:341-344:
if (requestedScopes.length > 0) {
return false; // keep device identity
}fires before the BACKEND + GATEWAY_CLIENT + loopback + hasSharedAuth block. Any call where the resolved scopes array is non-empty now retains device identity — not just subagent loopback calls.
Two tests in src/gateway/call.test.ts encode the opposite contract:
:427—keeps direct-local backend shared-token auth independent of paired device state:expect(lastClientOptions?.deviceIdentity).toBeNull()— now fails:658—uses backend client metadata for explicit scoped default calls: same assertion
The production bug (subagent completion failing with missing scope: operator.write) is real. The fix path through callSubagentGateway → callGateway → callGatewayWithScopes (explicit-scopes branch) → executeGatewayRequestWithScopes → shouldOmitDeviceIdentityForGatewayCall is correct — that path needs device identity when scopes are present. One narrowing would be an explicit flag threaded from the subagent path (e.g., requireDeviceIdentity?: true); any equivalent that preserves the existing direct-local shared-token contract is acceptable.
Fix 2 — waitForSessionsYieldAbortSettle — direction correct, bounded-wait question
The await params.settlePromise.catch(() => {}) after the timeout log ensures the lock drains before the function returns. The intended direction is right. One question: if the settle promise itself stalls (e.g., a hung fs operation), this second await has no timeout fallback of its own. Is there an upper bound on how long settlePromise can remain pending, or does it need its own guard here?
Fix 3 — LGTM
stripSessionsYieldArtifacts: both the messages loop and fileEntries loop now cover SESSIONS_YIELD_CONTEXT_CUSTOM_TYPE, exactly mirroring the existing pattern for SESSIONS_YIELD_INTERRUPT_CUSTOM_TYPE. Ties the comparison to the shared constant (defined at line 5) rather than a raw string.
Fix 1 still needs narrowing before this is merge-ready.
Summary
Three related bugs that break subagent completion delivery when running OpenClaw with loopback token-auth (e.g. Google Chat channel,
sessions_spawnorchestration pattern). All three were found and validated on a production Linux/GCP deployment: OpenClaw 2026.5.20, Node 22, systemd user service.Fix 1 —
src/gateway/call.ts— device identity for BACKEND calls with explicit scopesFixes: #77807
shouldOmitDeviceIdentityForGatewayCallunconditionally omitted device identity for allBACKEND + GATEWAY_CLIENT + loopbackcalls. Internal calls such ascallSubagentGatewaycarry explicit operator scopes like["operator.write"]. Without device identity the gateway cannot verify those scopes against the paired device token and rejects with:Fix: return
false(keep device identity) whenparams.scopesis non-empty. Also threadsscopesthroughresolveDeviceIdentityForGatewayCallso the helper can inspect them.Fix 2 —
src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts— await settle after timeoutwaitForSessionsYieldAbortSettleraced the settle promise against a 2 s timeout and returned immediately on timeout, leaving the session transcript file lock held. The next turn on the same persistent session (e.g. a Google Chat DM) failed with"file lock stale", triggering a model fallback and surfacing an internal error message to end users.Fix: after logging the timeout warning,
await params.settlePromise.catch(() => {})so the file lock is always released before the function returns.Fix 3 —
src/agents/pi-embedded-runner/run/attempt.sessions-yield.ts— strip context message before completion announceWhen a new incoming message aborts an active
sessions_yield, theopenclaw.sessions_yieldcontext message (containing[Context: The previous turn ended intentionally via sessions_yield...]) remains in the session transcript. When a subagent completion announce subsequently re-runs the agent to deliver the result, the agent sees this context message and responds viasessions_yieldagain (producing acustom_message). The announce system does not recognise acustom_messageas a visible reply and emits:stripSessionsYieldArtifactsalready strips the interrupt custom type (openclaw.sessions_yield_interrupt) but not the context custom type (openclaw.sessions_yield).Fix: strip both types in the in-memory messages loop and in the
fileEntriesloop.Test plan
missing scope: operator.writeTested on production: Google Chat + Pipedrive sub-agent workflow, 5+ consecutive turns, no errors.
Reported-by: jailbirt jailbirt@theeye.io
🤖 Generated with Claude Code