🐛 fix(server): restore sub-agent forking in QStash step worker#15609
Conversation
In QStash mode every agent step runs in a fresh HTTP request via the
hono `runStep` handler, which built a bare AgentRuntimeService without
the `execSubAgent` fork callback. As a result `lobe-agent.callSubAgent`
failed with SUB_AGENT_UNAVAILABLE in cloud (the in-process callback
never survives the queue boundary).
Step through AiAgentService.executeStep instead, reusing its internal
runtime that is already wired with the fork callback — no second runtime,
no manual rebinding.
Also rename the internal `execSubAgentTask` → `execSubAgent` (method,
runtime/tool context fields, options, ExecSubAgent{Params,Result} types)
to separate the "task" concept from "sub-agent", and make the method an
auto-bound arrow field so it no longer needs `.bind(this)`. The external
lambda procedure name (`execSubAgentTask`) and the client service are
left unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 98cb1346cb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## canary #15609 +/- ##
=========================================
Coverage 67.14% 67.14%
=========================================
Files 3353 3353
Lines 338505 338506 +1
Branches 35060 30383 -4677
=========================================
+ Hits 227278 227281 +3
+ Misses 111036 111034 -2
Partials 191 191
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…elegate
`execSubAgent` was a loose top-level option on AgentRuntimeService, which
hid that it is not ordinary config but an upward call: the low-level
runtime, mid-step, triggering a high-level pipeline that lives in
AiAgentService (the layer above it).
Introduce `AgentRuntimeDelegate` as the single named home for these
upward-call capabilities, and inject it as `delegate: { execSubAgent }`.
The interface doc states the convention so future "runtime must trigger a
higher-layer pipeline" capabilities land in the same place instead of
sprawling as ad-hoc options.
Scope is deliberately the injection surface (options + service field +
AiAgentService wiring). The downstream executor/tool context keeps its
flat `execSubAgent` field — the tool runner wants the unpacked capability,
not the whole delegate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…p worker Post-rebase adaptation to canary's runtime restructure (#15609): - Route the webhook bridge through AiAgentService (like the /run step worker) so the runtime's models stay workspace-scoped — a bare AgentRuntimeService would be personal-scoped and the tool-message backfill / resume barrier could miss workspace-scoped rows. - Extract SubAgentBridgeParams into agentRuntime/types and add the completeSubAgentBridge passthrough next to executeStep. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ueue mode (#15620) * 🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode The callSubAgent completion bridge was a handler-only hook, which lives in process memory: in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher only delivers webhook-configured hooks, so the bridge never fired — the parent op stayed parked in waiting_for_async_tool forever after all sub-agents finished. - Give the bridge hook a webhook config (delivery: qstash) targeting the new /api/agent/webhooks/subagent-callback endpoint; local mode keeps the in-process handler. Both paths converge on AgentRuntimeService.completeSubAgentBridge (backfill + barrier/CAS resume). - Park-time self-check: after the parked state and operation row are persisted, re-run the resume barrier once to recover children that completed before the parent finished parking. - One-shot verify watchdog: when a completion finds the parent not yet resumable, schedule a delayed verifyAsyncToolBarrier re-check (no step lock, CAS-idempotent, never re-arms). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * 📝 docs(agent): correct verify-watchdog rationale comment Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * 📝 docs(agent): clarify eventFields trimming rationale Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ♻️ refactor(agent): align subagent-callback with workspace-scoped step worker Post-rebase adaptation to canary's runtime restructure (#15609): - Route the webhook bridge through AiAgentService (like the /run step worker) so the runtime's models stay workspace-scoped — a bare AgentRuntimeService would be personal-scoped and the tool-message backfill / resume barrier could miss workspace-scoped rows. - Extract SubAgentBridgeParams into agentRuntime/types and add the completeSubAgentBridge passthrough next to executeStep. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * 🐛 fix(agent): fail sub-agent callback loudly on backfill or delivery failure Address two review findings on the resume bridge: - completeSubAgentBridge now checks updateToolMessage's { success } result (it swallows transaction errors instead of throwing) and propagates all infrastructure failures. The webhook endpoint then returns non-2xx so QStash redelivers the whole bridge — previously a failed backfill was acked with 200 and the parent stayed parked forever, since the verify recheck only re-reads the barrier and cannot retry the backfill. - New AgentHookWebhook.fallback: 'none' opts a qstash-delivered hook out of the unsigned plain-fetch fallback, which can never authenticate against a QStash-signed endpoint and only masked publish failures as silently dropped 401s. The bridge hook uses it; dispatch escalates such delivery failures to console.error instead of the debug namespace. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
💻 Change Type
🔀 Description of Change
The bug. In QStash mode every agent step runs in a fresh HTTP request via the hono
runStephandler, which built a bareAgentRuntimeServicewithout theexecSubAgentfork callback. The callback is an in-process closure owned byAiAgentServiceand never survives the queue boundary, sobuildServerSubAgentRunnerreturnedundefined→ctx.subAgentwas undefined →lobe-agent.callSubAgentfailed in cloud with:The fix. Step through
AiAgentService.executeStepinstead of constructing a second bare runtime.AiAgentServicealready builds an internalAgentRuntimeServicewired with the fork callback, so the step now runs on a runtime that carriesexecSubAgent. No duplicate runtime, no manual rebinding — this also respects the existingAiAgentService → AgentRuntimeServicedependency direction (injecting the service the other way would be circular).Refactor (folded in). To separate the "task" concept from "sub-agent":
execSubAgentTask→execSubAgent(method, runtime/tool-execution context fields, options, private callback, and theExecSubAgent{Params,Result}types)..bind(this)when passed as a callback.execSubAgentTask) and the client service are intentionally left unchanged.🧪 How to Test
runStep.test.tsnow asserts stepping goes throughAiAgentService(which preserves the fork callback) and stays workspace-scoped. Verified:bun run type-check— clean across the reporunStep,RuntimeExecutors,execGroupSubAgentTask, lambdaaiAgent.execGroupSubAgentTask, task integration)execSubAgentTaskclient method — 33 passed📝 Additional Information
No API/contract change: the tRPC procedure name and client-facing types are untouched, so this is server-internal only. No migration needed.