Skip to content

🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode#15620

Merged
arvinxx merged 5 commits into
canaryfrom
fix/server-subagent-resume-bridge
Jun 10, 2026
Merged

🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode#15620
arvinxx merged 5 commits into
canaryfrom
fix/server-subagent-resume-bridge

Conversation

@arvinxx

@arvinxx arvinxx commented Jun 9, 2026

Copy link
Copy Markdown
Member

💻 Change Type

  • ✨ feat
  • 🐛 fix
  • ♻️ refactor
  • 💄 style
  • 👷 build
  • ⚡️ perf
  • ✅ test
  • 📝 docs
  • 🔨 chore

🔗 Related Issue

🔀 Description of Change

In server (queue) mode, a parent agent parked on callSubAgent never resumed after all sub-agents finished — the run just stopped after the tool calls completed.

Root cause: the sub-agent completion bridge (createSubAgentBridgeHook) was a handler-function-only hook. Handler hooks live in process memory; in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher.dispatch only delivers hooks with a webhook config (getSerializedHooks filters on h.webhook), so the bridge never fired. tryResumeParentFromAsyncTool was never called and the parent op stayed in waiting_for_async_tool forever. Local mode (in-process dispatch) was unaffected, which is why the integration test passed.

Fix — webhook transport plus two race hardenings:

  1. QStash webhook bridge: the hook now carries a webhook config (delivery: 'qstash', explicit because deliverWebhook defaults to plain fetch which the endpoint's QStash signature auth would reject) targeting the new /api/agent/webhooks/subagent-callback endpoint. The endpoint resolves the userId from the child operation's metadata (same trust chain as /run), reloads the child's final state from the coordinator, and runs the bridge. Local mode keeps the in-process handler. Both paths converge on the new AgentRuntimeService.completeSubAgentBridge (backfill parent's placeholder tool message → barrier-check → CAS → schedule resume). Non-2xx responses let QStash redeliver, covering transient DB/Redis failures.
  2. Park-time self-check: sub-agents are dispatched mid-step, so a fast child could complete before the parent's parked state was persisted — its resume attempt no-oped against the status guard with nothing left to retry. After the parked state and agent_operations row are persisted, the parent now re-runs the resume barrier once to recover any resume that raced the park.
  3. One-shot verify watchdog: when a completion finds the parent not yet resumable (scheduleVerifyOnHold), a delayed verifyAsyncToolBarrier re-check is scheduled (15s). It re-runs barrier + CAS without claiming the step lock, is idempotent, and never re-arms itself, so retries stay bounded at one per completion event. This covers transient failures around the last completion (a child dying between backfill and resume, a DB hiccup during the barrier read, a lost callback delivery). Pure sibling concurrency needs no extra cover: each completion checks the barrier only after committing its own backfill, so the last committer always sees every earlier one.

The webhook payload is trimmed to operationId/reason/status via eventFields: the endpoint reloads the child's final state from the coordinator, so the default payload (which ships the child's entire final answer via lastAssistantContent, plus any tool-produced attachments the shared lifecycle event extractor inlines) is dead weight.

🧪 How to Test

  • Tested locally

  • Added/updated tests

  • No tests needed

  • src/server/routers/lambda/__tests__/integration/aiAgent/serverSubAgent.integration.test.ts — end-to-end park → sub-op → backfill → resume (local handler path) passes.

  • AgentRuntimeService.test.ts — 7 new cases: verify scheduling on not-yet-parked / unsatisfied barrier / terminal states; completeSubAgentBridge backfill from finalState, coordinator fallback (webhook path), error note on failure, resume despite backfill failure.

  • subAgentCallback.test.ts — new endpoint handler tests (validation, 401, happy path, defaults, 500-for-redelivery).

  • Full src/server/services/agentRuntime suite: 234 tests pass; type-check and eslint clean on touched files.

Queue-mode scenario to verify in a deployed environment: ask an agent to call N sub-agents in parallel (e.g. 3 weather lookups) — the parent should resume and produce the summary after all children finish.

📝 Additional Information

  • Reproduction characteristics: in queue mode the hang was deterministic (the bridge never fired), not a flaky race; the screenshot symptom is all sub-agent calls showing completed with the parent never producing its final answer.
  • execAgent's hetero completionWebhook extraction (hooks.find(h => h.type === 'onComplete')?.webhook) is unaffected: thread hooks precede the bridge in the hook array and the first onComplete match still has no webhook.
  • QStash redelivery of the callback is safe: the backfill is idempotent and the resume is CAS-guarded.

🤖 Generated with Claude Code

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @arvinxx, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment Jun 10, 2026 6:06am

Request Review

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. deployment:server Server-side database mode feature:agent Assistant/Agent configuration and behavior labels Jun 9, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65a308785c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/server/src/services/agentRuntime/AgentRuntimeService.ts Outdated
Comment on lines +3112 to +3115
webhook: {
body: { parentOperationId, threadId, toolMessageId },
delivery: 'qstash' as const,
// Keep the payload lean: the endpoint reloads the child's final state

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid unsigned fallback for sub-agent callback webhooks

For this QStash-authenticated endpoint, a publishJSON failure or missing QSTASH_TOKEN currently falls through HookDispatcher.deliverWebhook to a plain unsigned fetch; qstashAuth rejects that request, and fetchDeliver swallows the non-2xx response. In that failure mode the sub-agent completion callback is silently lost instead of being retried, leaving the parent parked, so this hook needs to fail hard or otherwise avoid the unsigned fallback.

Useful? React with 👍 / 👎.

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.90955% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.64%. Comparing base (1ed93b6) to head (e648c95).
⚠️ Report is 2 commits behind head on canary.

Additional details and impacted files
@@            Coverage Diff             @@
##           canary   #15620      +/-   ##
==========================================
+ Coverage   67.63%   67.64%   +0.01%     
==========================================
  Files        3353     3354       +1     
  Lines      338269   338429     +160     
  Branches    35248    29500    -5748     
==========================================
+ Hits       228786   228935     +149     
- Misses     109292   109303      +11     
  Partials      191      191              
Flag Coverage Δ
app 60.19% <81.90%> (+0.02%) ⬆️
database 98.12% <ø> (ø)
packages/agent-manager-runtime 49.69% <ø> (ø)
packages/agent-runtime 81.06% <ø> (ø)
packages/app-config 44.58% <ø> (ø)
packages/builtin-tool-lobe-agent 20.07% <ø> (ø)
packages/context-engine 84.12% <ø> (ø)
packages/conversation-flow 91.29% <ø> (ø)
packages/device-gateway-client 90.18% <ø> (ø)
packages/env 11.42% <ø> (ø)
packages/eval-dataset-parser 95.15% <ø> (ø)
packages/eval-rubric 76.11% <ø> (ø)
packages/file-loaders 87.89% <ø> (ø)
packages/locales 0.87% <ø> (ø)
packages/memory-user-memory 74.99% <ø> (ø)
packages/model-bank 99.99% <ø> (ø)
packages/model-runtime 84.27% <ø> (ø)
packages/prompts 72.51% <ø> (ø)
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/trpc 40.43% <ø> (ø)
packages/types 35.18% <ø> (ø)
packages/utils 85.03% <ø> (ø)
packages/web-crawler 88.08% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store 68.41% <ø> (ø)
Services 54.25% <ø> (ø)
Server 97.15% <100.00%> (+0.11%) ⬆️
Libs 54.03% <ø> (-0.17%) ⬇️
Utils 82.08% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

arvinxx and others added 4 commits June 10, 2026 07:11
…ueue mode

The callSubAgent completion bridge was a handler-only hook, which lives in
process memory: in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher only
delivers webhook-configured hooks, so the bridge never fired — the parent op
stayed parked in waiting_for_async_tool forever after all sub-agents finished.

- Give the bridge hook a webhook config (delivery: qstash) targeting the new
  /api/agent/webhooks/subagent-callback endpoint; local mode keeps the
  in-process handler. Both paths converge on
  AgentRuntimeService.completeSubAgentBridge (backfill + barrier/CAS resume).
- Park-time self-check: after the parked state and operation row are
  persisted, re-run the resume barrier once to recover children that
  completed before the parent finished parking.
- One-shot verify watchdog: when a completion finds the parent not yet
  resumable, schedule a delayed verifyAsyncToolBarrier re-check (no step
  lock, CAS-idempotent, never re-arms).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…p worker

Post-rebase adaptation to canary's runtime restructure (#15609):

- Route the webhook bridge through AiAgentService (like the /run step
  worker) so the runtime's models stay workspace-scoped — a bare
  AgentRuntimeService would be personal-scoped and the tool-message
  backfill / resume barrier could miss workspace-scoped rows.
- Extract SubAgentBridgeParams into agentRuntime/types and add the
  completeSubAgentBridge passthrough next to executeStep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…failure

Address two review findings on the resume bridge:

- completeSubAgentBridge now checks updateToolMessage's { success } result
  (it swallows transaction errors instead of throwing) and propagates all
  infrastructure failures. The webhook endpoint then returns non-2xx so
  QStash redelivers the whole bridge — previously a failed backfill was
  acked with 200 and the parent stayed parked forever, since the verify
  recheck only re-reads the barrier and cannot retry the backfill.
- New AgentHookWebhook.fallback: 'none' opts a qstash-delivered hook out of
  the unsigned plain-fetch fallback, which can never authenticate against a
  QStash-signed endpoint and only masked publish failures as silently
  dropped 401s. The bridge hook uses it; dispatch escalates such delivery
  failures to console.error instead of the debug namespace.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@arvinxx arvinxx merged commit fdb529d into canary Jun 10, 2026
34 of 35 checks passed
@arvinxx arvinxx deleted the fix/server-subagent-resume-bridge branch June 10, 2026 08:00
arvinxx added a commit that referenced this pull request Jun 10, 2026
# 🚀 LobeHub Release (20260610)

**Release Date:** June 10, 2026  
**Since v2.2.2:** 131 merged PRs · 13 contributors

> This weekly release strengthens agent collaboration across cloud,
desktop, CLI, and workspace flows, with steadier runtime behavior and a
broader foundation for workspace-scoped data.

---

## ✨ Highlights

- **Agent execution across devices** — Unifies per-device working
directories, project skill discovery, and sub-agent suspend/resume
behavior across server, QStash, and device RPC flows. (#15543, #15566,
#15481, #15620, #15591)
- **Connector and sandbox platform** — Expands connector permissions,
custom OAuth MCP connector onboarding, sandbox provider support, and
user-uploaded file sync into cloud sandbox runs. (#15463, #15546,
#15184, #15550)
- **Desktop and CLI reliability** — Fixes desktop cold-start,
auto-update, Windows build, CLI skill discovery, and `lh connect` agent
dispatch paths. (#15547, #15525, #15527, #15562, #15632, #15634)
- **Pages and sharing** — Refreshes topic sharing, improves Page Editor
layout behavior, and routes Page Agent tool execution through the
server-side editor path. (#15581, #15556, #15588, #15023, #15610)
- **Model availability and provider updates** — Adds user-scoped LobeHub
model availability, Claude Fable 5, Qwen thinking preservation, and
MiniMax M3 updates. (#15590, #15639, #13494, #15376)

---

## 🏗️ Core Product & Architecture

### Agent Runtime & Heterogeneous Agents

- Improves sub-agent lifecycle handling, including async suspend/resume,
queue-mode QStash resume delivery, and blocking nested sub-agent calls.
(#15481, #15620, #15575)
- Stabilizes heterogeneous agent ingestion and streaming with raw stream
dumps, per-turn usage, image forwarding on regenerate, and
duplicate-text fixes. (#15602, #15577, #15592, #15585)
- Adds execution-device and working-directory controls across device
RPC, legacy defaults, and remote-spawned Claude Code sessions. (#15543,
#15566, #15591, #15572)
- Improves runtime diagnostics and compatibility, including Gemini
multimodal output capture, abort stream semantics, and trace quality
analysis. (#15535, #13677, #15508)

---

## 📱 Platforms, Integrations & UX

### Connectors, Sandbox & Tools

- Ships API-level connector tool permissions, custom OAuth MCP connector
onboarding, and connector-first runtime execution. (#15463, #15546)
- Adds sandbox provider support, cloud sandbox file sync, and safer
external URL file input handling with SSRF validation. (#15184, #15550,
#12657)
- Improves tool visibility and execution with pinned app-fixed tools,
ANSI output rendering, gateway-tunneled MCP calls, and automatic
headless tool runs. (#15509, #15516, #15469, #15492)

### Desktop, CLI & Web UX

- Restores desktop startup and reload behavior, preserves IPC error
causes, and keeps the tab bar new-tab action visible across routes.
(#15547, #15597, #15638)
- Fixes desktop update and build stability for browser quit guards,
macOS update signing, and Windows Visual Studio detection. (#15525,
#15527, #15562)
- Shows the plan-limit upgrade UI on desktop builds. (#15628)
- Adds the Agent Run delivery checker and fixes CLI device dispatch plus
skill list/search output. (#15489, #15634, #15632)
- Refreshes onboarding, auth source preservation, topic UI states,
referral/Fable campaign copy, and chat-input control bar behavior.
(#15629, #15544, #15573, #15614, #15616, #15617, #15622, #15643)

---

## 🔒 Security, Reliability & Rollout Notes

- External URL file input now includes SSRF validation for safer Google
file handling. (#12657)
- Database workspace-scope migrations are part of this release;
self-hosted operators should run the normal migration path before
serving the updated app. (#15446, #15465, #15468, #15472)
- The release branch was re-cut from `canary` and includes the latest
`main` release-version commit so `v2.2.2` is the verified compare base.

---

## 👥 Contributors

@ONLY-yours, @sxjeru, @hardy-one, @xujingli, @hezhijie0327, @Coooolfan,
@arvinxx, @tjx666, @Innei, @rivertwilight, @rdmclin2, @cy948,
@AmAzing129

**Full Changelog**:
v2.2.2...release/weekly-20260610-recut-3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deployment:server Server-side database mode feature:agent Assistant/Agent configuration and behavior size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant