🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode#15620
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65a308785c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| webhook: { | ||
| body: { parentOperationId, threadId, toolMessageId }, | ||
| delivery: 'qstash' as const, | ||
| // Keep the payload lean: the endpoint reloads the child's final state |
There was a problem hiding this comment.
Avoid unsigned fallback for sub-agent callback webhooks
For this QStash-authenticated endpoint, a publishJSON failure or missing QSTASH_TOKEN currently falls through HookDispatcher.deliverWebhook to a plain unsigned fetch; qstashAuth rejects that request, and fetchDeliver swallows the non-2xx response. In that failure mode the sub-agent completion callback is silently lost instead of being retried, leaving the parent parked, so this hook needs to fail hard or otherwise avoid the unsigned fallback.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## canary #15620 +/- ##
==========================================
+ Coverage 67.63% 67.64% +0.01%
==========================================
Files 3353 3354 +1
Lines 338269 338429 +160
Branches 35248 29500 -5748
==========================================
+ Hits 228786 228935 +149
- Misses 109292 109303 +11
Partials 191 191
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…ueue mode The callSubAgent completion bridge was a handler-only hook, which lives in process memory: in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher only delivers webhook-configured hooks, so the bridge never fired — the parent op stayed parked in waiting_for_async_tool forever after all sub-agents finished. - Give the bridge hook a webhook config (delivery: qstash) targeting the new /api/agent/webhooks/subagent-callback endpoint; local mode keeps the in-process handler. Both paths converge on AgentRuntimeService.completeSubAgentBridge (backfill + barrier/CAS resume). - Park-time self-check: after the parked state and operation row are persisted, re-run the resume barrier once to recover children that completed before the parent finished parking. - One-shot verify watchdog: when a completion finds the parent not yet resumable, schedule a delayed verifyAsyncToolBarrier re-check (no step lock, CAS-idempotent, never re-arms). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…p worker Post-rebase adaptation to canary's runtime restructure (#15609): - Route the webhook bridge through AiAgentService (like the /run step worker) so the runtime's models stay workspace-scoped — a bare AgentRuntimeService would be personal-scoped and the tool-message backfill / resume barrier could miss workspace-scoped rows. - Extract SubAgentBridgeParams into agentRuntime/types and add the completeSubAgentBridge passthrough next to executeStep. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
365289f to
ddabbe1
Compare
…failure
Address two review findings on the resume bridge:
- completeSubAgentBridge now checks updateToolMessage's { success } result
(it swallows transaction errors instead of throwing) and propagates all
infrastructure failures. The webhook endpoint then returns non-2xx so
QStash redelivers the whole bridge — previously a failed backfill was
acked with 200 and the parent stayed parked forever, since the verify
recheck only re-reads the barrier and cannot retry the backfill.
- New AgentHookWebhook.fallback: 'none' opts a qstash-delivered hook out of
the unsigned plain-fetch fallback, which can never authenticate against a
QStash-signed endpoint and only masked publish failures as silently
dropped 401s. The bridge hook uses it; dispatch escalates such delivery
failures to console.error instead of the debug namespace.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# 🚀 LobeHub Release (20260610) **Release Date:** June 10, 2026 **Since v2.2.2:** 131 merged PRs · 13 contributors > This weekly release strengthens agent collaboration across cloud, desktop, CLI, and workspace flows, with steadier runtime behavior and a broader foundation for workspace-scoped data. --- ## ✨ Highlights - **Agent execution across devices** — Unifies per-device working directories, project skill discovery, and sub-agent suspend/resume behavior across server, QStash, and device RPC flows. (#15543, #15566, #15481, #15620, #15591) - **Connector and sandbox platform** — Expands connector permissions, custom OAuth MCP connector onboarding, sandbox provider support, and user-uploaded file sync into cloud sandbox runs. (#15463, #15546, #15184, #15550) - **Desktop and CLI reliability** — Fixes desktop cold-start, auto-update, Windows build, CLI skill discovery, and `lh connect` agent dispatch paths. (#15547, #15525, #15527, #15562, #15632, #15634) - **Pages and sharing** — Refreshes topic sharing, improves Page Editor layout behavior, and routes Page Agent tool execution through the server-side editor path. (#15581, #15556, #15588, #15023, #15610) - **Model availability and provider updates** — Adds user-scoped LobeHub model availability, Claude Fable 5, Qwen thinking preservation, and MiniMax M3 updates. (#15590, #15639, #13494, #15376) --- ## 🏗️ Core Product & Architecture ### Agent Runtime & Heterogeneous Agents - Improves sub-agent lifecycle handling, including async suspend/resume, queue-mode QStash resume delivery, and blocking nested sub-agent calls. (#15481, #15620, #15575) - Stabilizes heterogeneous agent ingestion and streaming with raw stream dumps, per-turn usage, image forwarding on regenerate, and duplicate-text fixes. (#15602, #15577, #15592, #15585) - Adds execution-device and working-directory controls across device RPC, legacy defaults, and remote-spawned Claude Code sessions. (#15543, #15566, #15591, #15572) - Improves runtime diagnostics and compatibility, including Gemini multimodal output capture, abort stream semantics, and trace quality analysis. (#15535, #13677, #15508) --- ## 📱 Platforms, Integrations & UX ### Connectors, Sandbox & Tools - Ships API-level connector tool permissions, custom OAuth MCP connector onboarding, and connector-first runtime execution. (#15463, #15546) - Adds sandbox provider support, cloud sandbox file sync, and safer external URL file input handling with SSRF validation. (#15184, #15550, #12657) - Improves tool visibility and execution with pinned app-fixed tools, ANSI output rendering, gateway-tunneled MCP calls, and automatic headless tool runs. (#15509, #15516, #15469, #15492) ### Desktop, CLI & Web UX - Restores desktop startup and reload behavior, preserves IPC error causes, and keeps the tab bar new-tab action visible across routes. (#15547, #15597, #15638) - Fixes desktop update and build stability for browser quit guards, macOS update signing, and Windows Visual Studio detection. (#15525, #15527, #15562) - Shows the plan-limit upgrade UI on desktop builds. (#15628) - Adds the Agent Run delivery checker and fixes CLI device dispatch plus skill list/search output. (#15489, #15634, #15632) - Refreshes onboarding, auth source preservation, topic UI states, referral/Fable campaign copy, and chat-input control bar behavior. (#15629, #15544, #15573, #15614, #15616, #15617, #15622, #15643) --- ## 🔒 Security, Reliability & Rollout Notes - External URL file input now includes SSRF validation for safer Google file handling. (#12657) - Database workspace-scope migrations are part of this release; self-hosted operators should run the normal migration path before serving the updated app. (#15446, #15465, #15468, #15472) - The release branch was re-cut from `canary` and includes the latest `main` release-version commit so `v2.2.2` is the verified compare base. --- ## 👥 Contributors @ONLY-yours, @sxjeru, @hardy-one, @xujingli, @hezhijie0327, @Coooolfan, @arvinxx, @tjx666, @Innei, @rivertwilight, @rdmclin2, @cy948, @AmAzing129 **Full Changelog**: v2.2.2...release/weekly-20260610-recut-3
💻 Change Type
🔗 Related Issue
🔀 Description of Change
In server (queue) mode, a parent agent parked on
callSubAgentnever resumed after all sub-agents finished — the run just stopped after the tool calls completed.Root cause: the sub-agent completion bridge (
createSubAgentBridgeHook) was a handler-function-only hook. Handler hooks live in process memory; in queue mode (AGENT_RUNTIME_MODE=queue)HookDispatcher.dispatchonly delivers hooks with awebhookconfig (getSerializedHooksfilters onh.webhook), so the bridge never fired.tryResumeParentFromAsyncToolwas never called and the parent op stayed inwaiting_for_async_toolforever. Local mode (in-process dispatch) was unaffected, which is why the integration test passed.Fix — webhook transport plus two race hardenings:
webhookconfig (delivery: 'qstash', explicit becausedeliverWebhookdefaults to plain fetch which the endpoint's QStash signature auth would reject) targeting the new/api/agent/webhooks/subagent-callbackendpoint. The endpoint resolves the userId from the child operation's metadata (same trust chain as/run), reloads the child's final state from the coordinator, and runs the bridge. Local mode keeps the in-process handler. Both paths converge on the newAgentRuntimeService.completeSubAgentBridge(backfill parent's placeholder tool message → barrier-check → CAS → schedule resume). Non-2xx responses let QStash redeliver, covering transient DB/Redis failures.agent_operationsrow are persisted, the parent now re-runs the resume barrier once to recover any resume that raced the park.scheduleVerifyOnHold), a delayedverifyAsyncToolBarrierre-check is scheduled (15s). It re-runs barrier + CAS without claiming the step lock, is idempotent, and never re-arms itself, so retries stay bounded at one per completion event. This covers transient failures around the last completion (a child dying between backfill and resume, a DB hiccup during the barrier read, a lost callback delivery). Pure sibling concurrency needs no extra cover: each completion checks the barrier only after committing its own backfill, so the last committer always sees every earlier one.The webhook payload is trimmed to
operationId/reason/statusviaeventFields: the endpoint reloads the child's final state from the coordinator, so the default payload (which ships the child's entire final answer vialastAssistantContent, plus any tool-produced attachments the shared lifecycle event extractor inlines) is dead weight.🧪 How to Test
Tested locally
Added/updated tests
No tests needed
src/server/routers/lambda/__tests__/integration/aiAgent/serverSubAgent.integration.test.ts— end-to-end park → sub-op → backfill → resume (local handler path) passes.AgentRuntimeService.test.ts— 7 new cases: verify scheduling on not-yet-parked / unsatisfied barrier / terminal states;completeSubAgentBridgebackfill from finalState, coordinator fallback (webhook path), error note on failure, resume despite backfill failure.subAgentCallback.test.ts— new endpoint handler tests (validation, 401, happy path, defaults, 500-for-redelivery).Full
src/server/services/agentRuntimesuite: 234 tests pass; type-check and eslint clean on touched files.Queue-mode scenario to verify in a deployed environment: ask an agent to call N sub-agents in parallel (e.g. 3 weather lookups) — the parent should resume and produce the summary after all children finish.
📝 Additional Information
execAgent's heterocompletionWebhookextraction (hooks.find(h => h.type === 'onComplete')?.webhook) is unaffected: thread hooks precede the bridge in the hook array and the firstonCompletematch still has no webhook.🤖 Generated with Claude Code