Skip to content

🐛 fix(server): restore sub-agent forking in QStash step worker#15609

Merged
arvinxx merged 2 commits into
canaryfrom
fix/sub-agent-forking-in-step-worker
Jun 9, 2026
Merged

🐛 fix(server): restore sub-agent forking in QStash step worker#15609
arvinxx merged 2 commits into
canaryfrom
fix/sub-agent-forking-in-step-worker

Conversation

@arvinxx

@arvinxx arvinxx commented Jun 9, 2026

Copy link
Copy Markdown
Member

💻 Change Type

  • 🐛 fix
  • ♻️ refactor

🔀 Description of Change

The bug. In QStash mode every agent step runs in a fresh HTTP request via the hono runStep handler, which built a bare AgentRuntimeService without the execSubAgent fork callback. The callback is an in-process closure owned by AiAgentService and never survives the queue boundary, so buildServerSubAgentRunner returned undefinedctx.subAgent was undefined → lobe-agent.callSubAgent failed in cloud with:

SUB_AGENT_UNAVAILABLE — "Sub-agent execution is not available in this runtime."

The fix. Step through AiAgentService.executeStep instead of constructing a second bare runtime. AiAgentService already builds an internal AgentRuntimeService wired with the fork callback, so the step now runs on a runtime that carries execSubAgent. No duplicate runtime, no manual rebinding — this also respects the existing AiAgentService → AgentRuntimeService dependency direction (injecting the service the other way would be circular).

Refactor (folded in). To separate the "task" concept from "sub-agent":

  • Renamed the internal execSubAgentTaskexecSubAgent (method, runtime/tool-execution context fields, options, private callback, and the ExecSubAgent{Params,Result} types).
  • Made the method an auto-bound arrow field so it no longer needs .bind(this) when passed as a callback.
  • The external lambda procedure name (execSubAgentTask) and the client service are intentionally left unchanged.

🧪 How to Test

  • Added/updated tests

runStep.test.ts now asserts stepping goes through AiAgentService (which preserves the fork callback) and stays workspace-scoped. Verified:

  • bun run type-check — clean across the repo
  • Affected server suites — 156 passed (runStep, RuntimeExecutors, execGroupSubAgentTask, lambda aiAgent.execGroupSubAgentTask, task integration)
  • Client store suites exercising the kept execSubAgentTask client method — 33 passed

📝 Additional Information

No API/contract change: the tRPC procedure name and client-facing types are untouched, so this is server-internal only. No migration needed.

In QStash mode every agent step runs in a fresh HTTP request via the
hono `runStep` handler, which built a bare AgentRuntimeService without
the `execSubAgent` fork callback. As a result `lobe-agent.callSubAgent`
failed with SUB_AGENT_UNAVAILABLE in cloud (the in-process callback
never survives the queue boundary).

Step through AiAgentService.executeStep instead, reusing its internal
runtime that is already wired with the fork callback — no second runtime,
no manual rebinding.

Also rename the internal `execSubAgentTask` → `execSubAgent` (method,
runtime/tool context fields, options, ExecSubAgent{Params,Result} types)
to separate the "task" concept from "sub-agent", and make the method an
auto-bound arrow field so it no longer needs `.bind(this)`. The external
lambda procedure name (`execSubAgentTask`) and the client service are
left unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment Jun 9, 2026 5:02pm

Request Review

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @arvinxx, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. feature:agent Assistant/Agent configuration and behavior labels Jun 9, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98cb1346cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/types/src/agentExecution/index.ts
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 68.57143% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.14%. Comparing base (af3f0ea) to head (2dce166).
⚠️ Report is 3 commits behind head on canary.

Additional details and impacted files
@@            Coverage Diff            @@
##           canary   #15609     +/-   ##
=========================================
  Coverage   67.14%   67.14%             
=========================================
  Files        3353     3353             
  Lines      338505   338506      +1     
  Branches    35060    30383   -4677     
=========================================
+ Hits       227278   227281      +3     
+ Misses     111036   111034      -2     
  Partials      191      191             
Flag Coverage Δ
app 60.14% <68.57%> (+<0.01%) ⬆️
database 89.90% <ø> (ø)
packages/agent-manager-runtime 49.69% <ø> (ø)
packages/agent-runtime 81.06% <ø> (ø)
packages/app-config 44.58% <ø> (ø)
packages/builtin-tool-lobe-agent 18.52% <ø> (ø)
packages/context-engine 84.12% <ø> (ø)
packages/conversation-flow 91.29% <ø> (ø)
packages/device-gateway-client 90.18% <ø> (ø)
packages/env 11.42% <ø> (ø)
packages/eval-dataset-parser 95.15% <ø> (ø)
packages/eval-rubric 76.11% <ø> (ø)
packages/fetch-sse 87.28% <ø> (ø)
packages/file-loaders 87.89% <ø> (ø)
packages/locales 0.87% <ø> (ø)
packages/memory-user-memory 74.99% <ø> (ø)
packages/model-bank 99.99% <ø> (ø)
packages/model-runtime 84.23% <ø> (ø)
packages/prompts 72.51% <ø> (ø)
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/trpc 40.43% <ø> (ø)
packages/types 35.15% <ø> (ø)
packages/utils 85.03% <ø> (ø)
packages/web-crawler 88.08% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store 68.24% <ø> (ø)
Services 54.25% <ø> (ø)
Server 97.03% <100.00%> (ø)
Libs 54.19% <ø> (ø)
Utils 82.08% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…elegate

`execSubAgent` was a loose top-level option on AgentRuntimeService, which
hid that it is not ordinary config but an upward call: the low-level
runtime, mid-step, triggering a high-level pipeline that lives in
AiAgentService (the layer above it).

Introduce `AgentRuntimeDelegate` as the single named home for these
upward-call capabilities, and inject it as `delegate: { execSubAgent }`.
The interface doc states the convention so future "runtime must trigger a
higher-layer pipeline" capabilities land in the same place instead of
sprawling as ad-hoc options.

Scope is deliberately the injection surface (options + service field +
AiAgentService wiring). The downstream executor/tool context keeps its
flat `execSubAgent` field — the tool runner wants the unpacked capability,
not the whole delegate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arvinxx arvinxx merged commit 4b5e001 into canary Jun 9, 2026
34 of 35 checks passed
@arvinxx arvinxx deleted the fix/sub-agent-forking-in-step-worker branch June 9, 2026 16:41
@arvinxx arvinxx mentioned this pull request Jun 9, 2026
arvinxx added a commit that referenced this pull request Jun 9, 2026
…p worker

Post-rebase adaptation to canary's runtime restructure (#15609):

- Route the webhook bridge through AiAgentService (like the /run step
  worker) so the runtime's models stay workspace-scoped — a bare
  AgentRuntimeService would be personal-scoped and the tool-message
  backfill / resume barrier could miss workspace-scoped rows.
- Extract SubAgentBridgeParams into agentRuntime/types and add the
  completeSubAgentBridge passthrough next to executeStep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
arvinxx added a commit that referenced this pull request Jun 10, 2026
…ueue mode (#15620)

* 🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode

The callSubAgent completion bridge was a handler-only hook, which lives in
process memory: in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher only
delivers webhook-configured hooks, so the bridge never fired — the parent op
stayed parked in waiting_for_async_tool forever after all sub-agents finished.

- Give the bridge hook a webhook config (delivery: qstash) targeting the new
  /api/agent/webhooks/subagent-callback endpoint; local mode keeps the
  in-process handler. Both paths converge on
  AgentRuntimeService.completeSubAgentBridge (backfill + barrier/CAS resume).
- Park-time self-check: after the parked state and operation row are
  persisted, re-run the resume barrier once to recover children that
  completed before the parent finished parking.
- One-shot verify watchdog: when a completion finds the parent not yet
  resumable, schedule a delayed verifyAsyncToolBarrier re-check (no step
  lock, CAS-idempotent, never re-arms).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* 📝 docs(agent): correct verify-watchdog rationale comment

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* 📝 docs(agent): clarify eventFields trimming rationale

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* ♻️ refactor(agent): align subagent-callback with workspace-scoped step worker

Post-rebase adaptation to canary's runtime restructure (#15609):

- Route the webhook bridge through AiAgentService (like the /run step
  worker) so the runtime's models stay workspace-scoped — a bare
  AgentRuntimeService would be personal-scoped and the tool-message
  backfill / resume barrier could miss workspace-scoped rows.
- Extract SubAgentBridgeParams into agentRuntime/types and add the
  completeSubAgentBridge passthrough next to executeStep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* 🐛 fix(agent): fail sub-agent callback loudly on backfill or delivery failure

Address two review findings on the resume bridge:

- completeSubAgentBridge now checks updateToolMessage's { success } result
  (it swallows transaction errors instead of throwing) and propagates all
  infrastructure failures. The webhook endpoint then returns non-2xx so
  QStash redelivers the whole bridge — previously a failed backfill was
  acked with 200 and the parent stayed parked forever, since the verify
  recheck only re-reads the barrier and cannot retry the backfill.
- New AgentHookWebhook.fallback: 'none' opts a qstash-delivered hook out of
  the unsigned plain-fetch fallback, which can never authenticate against a
  QStash-signed endpoint and only masked publish failures as silently
  dropped 401s. The bridge hook uses it; dispatch escalates such delivery
  failures to console.error instead of the debug namespace.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
@arvinxx arvinxx mentioned this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:agent Assistant/Agent configuration and behavior size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant