🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode by arvinxx · Pull Request #15620 · lobehub/lobehub

arvinxx · 2026-06-09T22:52:46Z

💻 Change Type

🔗 Related Issue

🔀 Description of Change

In server (queue) mode, a parent agent parked on callSubAgent never resumed after all sub-agents finished — the run just stopped after the tool calls completed.

Root cause: the sub-agent completion bridge (createSubAgentBridgeHook) was a handler-function-only hook. Handler hooks live in process memory; in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher.dispatch only delivers hooks with a webhook config (getSerializedHooks filters on h.webhook), so the bridge never fired. tryResumeParentFromAsyncTool was never called and the parent op stayed in waiting_for_async_tool forever. Local mode (in-process dispatch) was unaffected, which is why the integration test passed.

Fix — webhook transport plus two race hardenings:

QStash webhook bridge: the hook now carries a webhook config (delivery: 'qstash', explicit because deliverWebhook defaults to plain fetch which the endpoint's QStash signature auth would reject) targeting the new /api/agent/webhooks/subagent-callback endpoint. The endpoint resolves the userId from the child operation's metadata (same trust chain as /run), reloads the child's final state from the coordinator, and runs the bridge. Local mode keeps the in-process handler. Both paths converge on the new AgentRuntimeService.completeSubAgentBridge (backfill parent's placeholder tool message → barrier-check → CAS → schedule resume). Non-2xx responses let QStash redeliver, covering transient DB/Redis failures.
Park-time self-check: sub-agents are dispatched mid-step, so a fast child could complete before the parent's parked state was persisted — its resume attempt no-oped against the status guard with nothing left to retry. After the parked state and agent_operations row are persisted, the parent now re-runs the resume barrier once to recover any resume that raced the park.
One-shot verify watchdog: when a completion finds the parent not yet resumable (scheduleVerifyOnHold), a delayed verifyAsyncToolBarrier re-check is scheduled (15s). It re-runs barrier + CAS without claiming the step lock, is idempotent, and never re-arms itself, so retries stay bounded at one per completion event. This covers transient failures around the last completion (a child dying between backfill and resume, a DB hiccup during the barrier read, a lost callback delivery). Pure sibling concurrency needs no extra cover: each completion checks the barrier only after committing its own backfill, so the last committer always sees every earlier one.

The webhook payload is trimmed to operationId/reason/status via eventFields: the endpoint reloads the child's final state from the coordinator, so the default payload (which ships the child's entire final answer via lastAssistantContent, plus any tool-produced attachments the shared lifecycle event extractor inlines) is dead weight.

🧪 How to Test

Tested locally
Added/updated tests
No tests needed
src/server/routers/lambda/__tests__/integration/aiAgent/serverSubAgent.integration.test.ts — end-to-end park → sub-op → backfill → resume (local handler path) passes.
AgentRuntimeService.test.ts — 7 new cases: verify scheduling on not-yet-parked / unsatisfied barrier / terminal states; completeSubAgentBridge backfill from finalState, coordinator fallback (webhook path), error note on failure, resume despite backfill failure.
subAgentCallback.test.ts — new endpoint handler tests (validation, 401, happy path, defaults, 500-for-redelivery).
Full src/server/services/agentRuntime suite: 234 tests pass; type-check and eslint clean on touched files.

Queue-mode scenario to verify in a deployed environment: ask an agent to call N sub-agents in parallel (e.g. 3 weather lookups) — the parent should resume and produce the summary after all children finish.

📝 Additional Information

Reproduction characteristics: in queue mode the hang was deterministic (the bridge never fired), not a flaky race; the screenshot symptom is all sub-agent calls showing completed with the parent never producing its final answer.
execAgent's hetero completionWebhook extraction (hooks.find(h => h.type === 'onComplete')?.webhook) is unaffected: thread hooks precede the bridge in the hook array and the first onComplete match still has no webhook.
QStash redelivery of the callback is safe: the backfill is idempotent and the resume is CAS-guarded.

🤖 Generated with Claude Code

sourcery-ai

Sorry @arvinxx, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

vercel · 2026-06-09T22:52:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
lobehub	Ready	Preview, Comment	Jun 10, 2026 6:06am

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65a308785c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-09T22:57:30Z

+      webhook: {
+        body: { parentOperationId, threadId, toolMessageId },
+        delivery: 'qstash' as const,
+        // Keep the payload lean: the endpoint reloads the child's final state


Avoid unsigned fallback for sub-agent callback webhooks

For this QStash-authenticated endpoint, a publishJSON failure or missing QSTASH_TOKEN currently falls through HookDispatcher.deliverWebhook to a plain unsigned fetch; qstashAuth rejects that request, and fetchDeliver swallows the non-2xx response. In that failure mode the sub-agent completion callback is silently lost instead of being retried, leaving the parent parked, so this hook needs to fail hard or otherwise avoid the unsigned fallback.

Useful? React with 👍 / 👎.

codecov · 2026-06-09T23:05:20Z

Codecov Report

❌ Patch coverage is 81.90955% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.64%. Comparing base (1ed93b6) to head (e648c95).
⚠️ Report is 2 commits behind head on canary.

Additional details and impacted files

@@            Coverage Diff             @@
##           canary   #15620      +/-   ##
==========================================
+ Coverage   67.63%   67.64%   +0.01%     
==========================================
  Files        3353     3354       +1     
  Lines      338269   338429     +160     
  Branches    35248    29500    -5748     
==========================================
+ Hits       228786   228935     +149     
- Misses     109292   109303      +11     
  Partials      191      191

Flag	Coverage Δ
app	`60.19% <81.90%> (+0.02%)`	⬆️
database	`98.12% <ø> (ø)`
packages/agent-manager-runtime	`49.69% <ø> (ø)`
packages/agent-runtime	`81.06% <ø> (ø)`
packages/app-config	`44.58% <ø> (ø)`
packages/builtin-tool-lobe-agent	`20.07% <ø> (ø)`
packages/context-engine	`84.12% <ø> (ø)`
packages/conversation-flow	`91.29% <ø> (ø)`
packages/device-gateway-client	`90.18% <ø> (ø)`
packages/env	`11.42% <ø> (ø)`
packages/eval-dataset-parser	`95.15% <ø> (ø)`
packages/eval-rubric	`76.11% <ø> (ø)`
packages/file-loaders	`87.89% <ø> (ø)`
packages/locales	`0.87% <ø> (ø)`
packages/memory-user-memory	`74.99% <ø> (ø)`
packages/model-bank	`99.99% <ø> (ø)`
packages/model-runtime	`84.27% <ø> (ø)`
packages/prompts	`72.51% <ø> (ø)`
packages/python-interpreter	`92.90% <ø> (ø)`
packages/ssrf-safe-fetch	`0.00% <ø> (ø)`
packages/trpc	`40.43% <ø> (ø)`
packages/types	`35.18% <ø> (ø)`
packages/utils	`85.03% <ø> (ø)`
packages/web-crawler	`88.08% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Store	`68.41% <ø> (ø)`
Services	`54.25% <ø> (ø)`
Server	`97.15% <100.00%> (+0.11%)`	⬆️
Libs	`54.03% <ø> (-0.17%)`	⬇️
Utils	`82.08% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ueue mode The callSubAgent completion bridge was a handler-only hook, which lives in process memory: in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher only delivers webhook-configured hooks, so the bridge never fired — the parent op stayed parked in waiting_for_async_tool forever after all sub-agents finished. - Give the bridge hook a webhook config (delivery: qstash) targeting the new /api/agent/webhooks/subagent-callback endpoint; local mode keeps the in-process handler. Both paths converge on AgentRuntimeService.completeSubAgentBridge (backfill + barrier/CAS resume). - Park-time self-check: after the parked state and operation row are persisted, re-run the resume barrier once to recover children that completed before the parent finished parking. - One-shot verify watchdog: when a completion finds the parent not yet resumable, schedule a delayed verifyAsyncToolBarrier re-check (no step lock, CAS-idempotent, never re-arms). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…p worker Post-rebase adaptation to canary's runtime restructure (#15609): - Route the webhook bridge through AiAgentService (like the /run step worker) so the runtime's models stay workspace-scoped — a bare AgentRuntimeService would be personal-scoped and the tool-message backfill / resume barrier could miss workspace-scoped rows. - Extract SubAgentBridgeParams into agentRuntime/types and add the completeSubAgentBridge passthrough next to executeStep. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…failure Address two review findings on the resume bridge: - completeSubAgentBridge now checks updateToolMessage's { success } result (it swallows transaction errors instead of throwing) and propagates all infrastructure failures. The webhook endpoint then returns non-2xx so QStash redelivers the whole bridge — previously a failed backfill was acked with 200 and the parent stayed parked forever, since the verify recheck only re-reads the barrier and cannot retry the backfill. - New AgentHookWebhook.fallback: 'none' opts a qstash-delivered hook out of the unsigned plain-fetch fallback, which can never authenticate against a QStash-signed endpoint and only masked publish failures as silently dropped 401s. The bridge hook uses it; dispatch escalates such delivery failures to console.error instead of the debug namespace. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@ONLY-yours

# 🚀 LobeHub Release (20260610) **Release Date:** June 10, 2026 **Since v2.2.2:** 131 merged PRs · 13 contributors > This weekly release strengthens agent collaboration across cloud, desktop, CLI, and workspace flows, with steadier runtime behavior and a broader foundation for workspace-scoped data. --- ## ✨ Highlights - **Agent execution across devices** — Unifies per-device working directories, project skill discovery, and sub-agent suspend/resume behavior across server, QStash, and device RPC flows. (#15543, #15566, #15481, #15620, #15591) - **Connector and sandbox platform** — Expands connector permissions, custom OAuth MCP connector onboarding, sandbox provider support, and user-uploaded file sync into cloud sandbox runs. (#15463, #15546, #15184, #15550) - **Desktop and CLI reliability** — Fixes desktop cold-start, auto-update, Windows build, CLI skill discovery, and `lh connect` agent dispatch paths. (#15547, #15525, #15527, #15562, #15632, #15634) - **Pages and sharing** — Refreshes topic sharing, improves Page Editor layout behavior, and routes Page Agent tool execution through the server-side editor path. (#15581, #15556, #15588, #15023, #15610) - **Model availability and provider updates** — Adds user-scoped LobeHub model availability, Claude Fable 5, Qwen thinking preservation, and MiniMax M3 updates. (#15590, #15639, #13494, #15376) --- ## 🏗️ Core Product & Architecture ### Agent Runtime & Heterogeneous Agents - Improves sub-agent lifecycle handling, including async suspend/resume, queue-mode QStash resume delivery, and blocking nested sub-agent calls. (#15481, #15620, #15575) - Stabilizes heterogeneous agent ingestion and streaming with raw stream dumps, per-turn usage, image forwarding on regenerate, and duplicate-text fixes. (#15602, #15577, #15592, #15585) - Adds execution-device and working-directory controls across device RPC, legacy defaults, and remote-spawned Claude Code sessions. (#15543, #15566, #15591, #15572) - Improves runtime diagnostics and compatibility, including Gemini multimodal output capture, abort stream semantics, and trace quality analysis. (#15535, #13677, #15508) --- ## 📱 Platforms, Integrations & UX ### Connectors, Sandbox & Tools - Ships API-level connector tool permissions, custom OAuth MCP connector onboarding, and connector-first runtime execution. (#15463, #15546) - Adds sandbox provider support, cloud sandbox file sync, and safer external URL file input handling with SSRF validation. (#15184, #15550, #12657) - Improves tool visibility and execution with pinned app-fixed tools, ANSI output rendering, gateway-tunneled MCP calls, and automatic headless tool runs. (#15509, #15516, #15469, #15492) ### Desktop, CLI & Web UX - Restores desktop startup and reload behavior, preserves IPC error causes, and keeps the tab bar new-tab action visible across routes. (#15547, #15597, #15638) - Fixes desktop update and build stability for browser quit guards, macOS update signing, and Windows Visual Studio detection. (#15525, #15527, #15562) - Shows the plan-limit upgrade UI on desktop builds. (#15628) - Adds the Agent Run delivery checker and fixes CLI device dispatch plus skill list/search output. (#15489, #15634, #15632) - Refreshes onboarding, auth source preservation, topic UI states, referral/Fable campaign copy, and chat-input control bar behavior. (#15629, #15544, #15573, #15614, #15616, #15617, #15622, #15643) --- ## 🔒 Security, Reliability & Rollout Notes - External URL file input now includes SSRF validation for safer Google file handling. (#12657) - Database workspace-scope migrations are part of this release; self-hosted operators should run the normal migration path before serving the updated app. (#15446, #15465, #15468, #15472) - The release branch was re-cut from `canary` and includes the latest `main` release-version commit so `v2.2.2` is the verified compare base. --- ## 👥 Contributors @ONLY-yours, @sxjeru, @hardy-one, @xujingli, @hezhijie0327, @Coooolfan, @arvinxx, @tjx666, @Innei, @rivertwilight, @rdmclin2, @cy948, @AmAzing129 **Full Changelog**: v2.2.2...release/weekly-20260610-recut-3

sourcery-ai Bot reviewed Jun 9, 2026

View reviewed changes

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. deployment:server Server-side database mode feature:agent Assistant/Agent configuration and behavior labels Jun 9, 2026

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

arvinxx and others added 4 commits June 10, 2026 07:11

📝 docs(agent): correct verify-watchdog rationale comment

e566236

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

📝 docs(agent): clarify eventFields trimming rationale

e9bfc6a

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

arvinxx force-pushed the fix/server-subagent-resume-bridge branch from 365289f to ddabbe1 Compare June 9, 2026 23:17

vercel Bot deployed to Preview June 9, 2026 23:27 View deployment

vercel Bot deployed to Preview June 10, 2026 06:06 View deployment

arvinxx merged commit fdb529d into canary Jun 10, 2026
34 of 35 checks passed

arvinxx deleted the fix/server-subagent-resume-bridge branch June 10, 2026 08:00

This was referenced Jun 10, 2026

🚀 release: 20260610 #15641

Closed

🚀 release: 20260610 #15645

Closed

🚀 release: 20260610 #15647

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode#15620

🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode#15620
arvinxx merged 5 commits into
canaryfrom
fix/server-subagent-resume-bridge

arvinxx commented Jun 9, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

vercel Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arvinxx commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💻 Change Type

🔗 Related Issue

🔀 Description of Change

🧪 How to Test

📝 Additional Information

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

vercel Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arvinxx commented Jun 9, 2026 •

edited

Loading

vercel Bot commented Jun 9, 2026 •

edited

Loading

codecov Bot commented Jun 9, 2026 •

edited

Loading