[codex] fix monitor notifications for subagents#3933
Conversation
Route subagent-owned monitor notifications back into the owning agent instead of the parent queue. Keep owner agents alive while their monitors can still produce notifications, and clean up owned monitors silently when the owner exits. Fixes #3925 Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
There was a problem hiding this comment.
Pull request overview
This PR fixes misrouted Monitor tool notifications when the monitor is started by a subagent, by scoping monitor notifications to the owning agent and ensuring subagents can idle-wait to consume monitor output even when the model produces a no-tool response.
Changes:
- Track
ownerAgentIdon monitor entries and route notifications via per-agent callbacks (instead of the shared parent callback). - Extend the headless reasoning loop to accept/await “external inputs” (SendMessage + owned Monitor notifications) between model rounds, including idle-wait behavior.
- Add cleanup hooks so owned monitor callbacks are unregistered and any still-running owned monitors are cancelled when the owning agent exits (including background, fork, and resumed agents).
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/tools/monitor.ts | Records ownerAgentId (via agent context) on newly started monitors. |
| packages/core/src/tools/monitor.test.ts | Adds coverage for recording owner agent id when started within agent context. |
| packages/core/src/services/monitorRegistry.ts | Adds per-agent notification routing, owner-scoped cancellation, and owner-running checks. |
| packages/core/src/services/monitorRegistry.test.ts | Verifies owner-only routing, no fallback to parent, owner-running checks, and owner cancellation behavior. |
| packages/core/src/tools/agent/agent.ts | Registers/unregisters owned monitor notification callbacks and wires external-input drain + idle-wait for subagents (bg/fg/fork). |
| packages/core/src/tools/agent/agent.test.ts | Updates stubs and adds tests asserting monitor notifications are enqueued as external inputs and cleaned up. |
| packages/core/src/agents/runtime/agent-types.ts | Introduces AgentExternalInput union type (string or notification object). |
| packages/core/src/agents/runtime/agent-headless.ts | Adds setExternalMessageWaiter + setExternalMessageWaitPredicate plumbing into AgentCore options. |
| packages/core/src/agents/runtime/agent-headless.test.ts | Adds tests for idle-wait and draining notification inputs after a no-tool response. |
| packages/core/src/agents/runtime/agent-core.ts | Implements external-input draining/injection and idle-wait (with timeout/cancel/max-turns handling). |
| packages/core/src/agents/background-tasks.ts | Extends queued message handling to AgentExternalInput and adds waitForMessages + wakeup plumbing. |
| packages/core/src/agents/background-tasks.test.ts | Adds test coverage for waitForMessages. |
| packages/core/src/agents/background-agent-resume.ts | Wires resumed agents to wait for external inputs and routes owned monitor notifications into their queues. |
| packages/core/src/agents/background-agent-resume.test.ts | Adds coverage for owned monitor notification routing on resumed agents. |
Comments suppressed due to low confidence (1)
packages/core/src/services/monitorRegistry.ts:327
- MonitorRegistry.reset() clears
monitorsbut leaves the newagentNotificationCallbacksmap intact. Since resetBackgroundStateForSessionSwitch() calls monitorRegistry.reset() on session switches, stale per-agent callbacks can be retained across sessions (and potentially keep closures/registries alive longer than intended). Consider clearingagentNotificationCallbacksinside reset() as well to fully drop per-agent routing state when the registry is reset.
reset(): void {
if (this.monitors.size === 0) return;
for (const entry of this.monitors.values()) {
this.clearIdleTimer(entry);
if (entry.status === 'running') {
entry.abortController.abort();
}
}
this.monitors.clear();
// Notify subscribers that the registry's contents changed wholesale
// — without this, the dialog snapshot in `useBackgroundTaskView`
// would keep rendering the now-cleared rows until an unrelated
// register/settle event happens. Mirrors BackgroundShellRegistry /
// BackgroundTaskRegistry's reset paths.
this.fireStatusChange();
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Clear owner monitor callbacks on registry reset, avoid owner-helper allocation in the monitor idle-wait path, and guard waitable external-input queues against abort races after listener registration. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Align Monitor owner-context imports with the runtime agent context module added on main, and update the related test to use the current runWithAgentContext signature. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Collect owner monitor ids before cancelling so pruning terminal entries during cancellation cannot affect the iteration. Add coverage for cancelling owner monitors while retained terminal entries are pruned. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
wenshao
left a comment
There was a problem hiding this comment.
[Critical] packages/core/src/agents/runtime/agent-core.ts:502: runInAgentFrames() restores the subagent name/runtime view but not the agent-id async context. The deferred approval respond path resumes tool execution through this helper, so a subagent Monitor that requires approval can execute with getCurrentAgentId() === null and be registered as top-level instead of owned by the subagent. This is not anchorable to a changed diff line, but it should be fixed by restoring the logical owner agent id via runWithAgentContext(...) when re-entering deferred approval continuations.
— gpt-5.5 via Qwen Code /review
Wake subagent external-input waits when owner monitors are stopped silently, re-check wait predicates after empty wakes, and add coverage for foreground, fork, and resumed owner-monitor routing cleanup. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Make implicit-fork external input waits safe for overlapping waiters and persist external input kind for notification observability in subagent transcripts. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
The owner-to-callback routing fix is correct on the paths I traced. MonitorEntry carries ownerAgentId, dispatch picks the per-agent callback, and all four launch paths plus the resume service install the same trio (callback registration, external-input drain/wait, idle-wait predicate) before execution and clear it in their cleanup. Registry-layer test coverage is genuinely thorough. Two low-severity observations on cleanup edges below — neither blocks merge.
1. Resume path leaks the agent notification callback when pre-runBody setup throws (severity: low · confidence: high)
In the resume service the agent's monitor notification callback gets registered before the awaited subagent-start hook runs, but the unregister lives in the runBody finally. If the start hook (or any other await before runBody is invoked) throws, the outer catch handles the failure without clearing the callback. The leaked entry sits in the per-agent callback map until the same agent id is reused on a future resume (which overwrites it) or the registry is reset. A monitor that somehow routes through the stale closure drops silently because queueExternalInput rejects on a non-running entry, so the impact is contained — but it's still a real entry leak in the callback map.
2. Cleanup ordering produces missing-callback warnings during normal teardown (severity: low · confidence: very high)
The owned-monitor cleanup closures clear the agent's notification callback first and then cancel the agent's monitors with notify: false. The notify: false flag only suppresses the explicit terminal notification — the cancellation still aborts the child process, which fires the abort handler that flushes any buffered partial-line output. That flush goes through the same dispatch path, finds no callback (we just removed it), and emits the missing-callback warning intended to surface real routing bugs. Functionally correct (the agent has finished, dropping the line is the right outcome), but a chatty monitor whose tail output lacks a final newline produces noisy warn logs on every clean shutdown of a subagent that owned a monitor.
Verdict
COMMENT — routing fix is solid on the traced paths; the two findings above are observability nits worth tightening but not blockers.
Avoid owner monitor lost wakeups before idle wait registration, make silent monitor cancellation suppress partial-line notification dispatch, and clean resumed owner monitor callbacks when setup fails before execution. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
wenshao
left a comment
There was a problem hiding this comment.
整体评估见下方逐行评论。架构方向(Monitor owner 路由 + per-agent notification + idle-wait 语义)正确,改动分层清晰,测试覆盖充分(包括 overlap waiters / empty-wake retry / max_turns 短路 / lifecycle wake / abort 竞态 / 触发 prune 的 cancellation)。
下面 8 条主要是健壮性与兼容性建议,没有阻塞性问题。其中 #1、#2 是隐式 breaking change,如果有 SDK 公开使用建议在 release notes 中显式登记;#3、#5 影响线上行为(双发 ROUND_END 风险 / cleanup 期间 warn 噪声),建议本 PR 内修。
LGTM after #1/#2 的兼容性声明 + #3/#5 的小调整;#4/#6/#7/#8 可作为 follow-up。
Restore the logical agent id when approved tool continuations re-enter agent frames, keep external message kinds backwards compatible, and document widened background task inputs. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
|
@tanzhenxin 感谢 review,这两点我逐一确认了:
|
Resolve the latest origin/main conflict in chat recording metadata by preserving both external input kind and fork lineage fields. Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
wenshao
left a comment
There was a problem hiding this comment.
— deepseek-v4-pro via Qwen Code /review
Close the resumed agent transcript writer when setup fails before the run body starts, and cancel owner monitors before unregistering their callbacks to avoid stale notification races.\n\nCo-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
wenshao
left a comment
There was a problem hiding this comment.
补充几条非阻塞性建议,主要是文档/一致性 nits,不影响 approve 结论。CI 失败是 StandaloneSessionPicker 在 Windows 的预存在 flake,与本 PR 无关。
| monitorRegistry.cancelRunningForOwner(agentId, { notify: false }); | ||
| monitorRegistry.setAgentNotificationCallback(agentId, undefined); | ||
| monitorRegistry.setAgentLifecycleCallback(agentId, undefined); | ||
| }; |
There was a problem hiding this comment.
[Suggestion] 这个 cleanup 闭包不是幂等的:调用方(foreground/background/fork 三处)目前都只在 finally 中调用一次,但如果未来某条路径意外重复触发(比如 catch 后又 finally),第二次调用会重新走一次 cancelRunningForOwner(无 running monitor 时 no-op)+ setAgent*Callback(undefined)(无副作用)。
Resume 路径已经用 cleanedUpOwnedMonitorNotifications flag 显式 gate(见 background-agent-resume.ts:682)。建议这里也对齐:
| }; | |
| let cleanedUp = false; | |
| return () => { | |
| if (cleanedUp) return; | |
| cleanedUp = true; | |
| monitorRegistry.cancelRunningForOwner(agentId, { notify: false }); | |
| monitorRegistry.setAgentNotificationCallback(agentId, undefined); | |
| monitorRegistry.setAgentLifecycleCallback(agentId, undefined); | |
| }; |
虽然当前调用点没有重复触发,但和 resume 路径保持同一不变量,能让未来读者不必逐路径推敲幂等性。
| entry.abortController.abort(); | ||
| this.dispatchOwnerLifecycleWake(entry); | ||
| return; | ||
| } |
There was a problem hiding this comment.
[Suggestion] silent 与 noisy 路径的 settle / abort 调用顺序刚好相反:
- silent:
settle('cancelled')→abort()→dispatchOwnerLifecycleWake - noisy:
abort()→ 状态二次检查 →settle('cancelled')→emitTerminalNotification
这个顺序差异是有意的(先 settle 把状态改成 'cancelled',后续 in-flight stdout 触发 emitEvent 会被状态守卫拦下,不会再走 owner notification 路径),但代码里没注释。建议在 silent 分支顶上加一行 why,避免后续维护者按 noisy 路径的「先 abort 再 settle」惯性整成统一顺序而引入回归:
| } | |
| if (options.notify === false) { | |
| // Silent path: settle before abort so any in-flight stdout that | |
| // races with abort hits the 'running' guard in emitEvent and is | |
| // dropped instead of routed through the owner notification | |
| // callback. Noisy path tolerates the reverse order because its | |
| // terminal notification is the goal. | |
| this.settle(entry, 'cancelled'); | |
| debugLogger.info(`Monitor cancelled: ${monitorId}`); | |
| entry.abortController.abort(); | |
| this.dispatchOwnerLifecycleWake(entry); | |
| return; | |
| } |
| inputs: AgentExternalInput[]; | ||
| terminateMode?: AgentTerminateMode; | ||
| }> { | ||
| while (true) { |
There was a problem hiding this comment.
[Suggestion] while (true) 的退出依赖一个隐式不变量:「empty wake ⇒ predicate 在有限时间内翻成 false」。当前实现满足(silent cancel → settle 'cancelled' → hasRunningForOwner 立刻返回 false),但这个契约不是类型/接口层面强制的。
之前 review thread 已经讨论过、决定不加 cap。现在的代码里没有任何注释解释这个退出契约——后续如果有人增加非 cancel 触发的 lifecycle wake(比如 monitor restart),就可能破坏不变量却不自知。建议在 while (true) 上方加一句话:
// Loop exit invariant: each empty wake from waitForExternalMessages
// must correspond to a state where shouldWaitForExternalMessages()
// will eventually return false (current sources of empty wake are
// abort/timeout/registry reset/silent owner cancel — each transitions
// the predicate). New lifecycle wake sources MUST preserve this or
// this loop can spin.这是上面 thread 那次「不加 cap」决议的代码侧文档落地。
| // need to see this subagent's id as their `parentAgentId`. | ||
|
|
||
| if (isFork) { | ||
| const forkMonitorInputs = createLocalExternalInputQueue(); |
There was a problem hiding this comment.
[Suggestion] implicit fork 不接 BackgroundTaskRegistry 而走独立 createLocalExternalInputQueue(),是经过讨论的:implicit fork 没有 registry entry,也没有可被 send_message 定位的 task target,local queue 只承担 fork-owned Monitor notification。run_in_background: true 的 background fork 走 registry queue,因此 SendMessage 路径在 background fork 上是覆盖的。
这个 rationale 目前只散落在 PR review thread 里,下一个读这段代码的人没法快速理解为什么 fork 这条路径长得不一样。建议加一行内联注释:
| const forkMonitorInputs = createLocalExternalInputQueue(); | |
| if (isFork) { | |
| // Implicit fork has no BackgroundTaskRegistry entry (no task | |
| // target for send_message), so monitor notifications can't go | |
| // through registry.queueExternalInput like foreground/background | |
| // do. The local queue serves owner-routed Monitor notifications | |
| // only; SendMessage to implicit forks is intentionally | |
| // unsupported. run_in_background:true forks fall through to the | |
| // background path above, which uses the registry queue. | |
| const forkMonitorInputs = createLocalExternalInputQueue(); |
| this.eventEmitter.off(AgentEventType.TOOL_CALL, onFgToolCall); | ||
| this.eventEmitter.off(AgentEventType.USAGE_METADATA, onFgUsageMetadata); | ||
| signal?.removeEventListener('abort', onParentAbort); | ||
| cleanupOwnedMonitorNotifications(); |
There was a problem hiding this comment.
[Suggestion] 时序观察,非 bug:foreground finally 在这里先调 cleanupOwnedMonitorNotifications(),再走到下面的 registry.unregisterForeground(hookOpts.agentId)。cleanupOwnedMonitorNotifications 内部已经 cancelRunningForOwner({notify:false}) 把所有 owner monitor 切到 cancelled,所以这段窗口期不会有 monitor 触发新 notification。但 entry 在 BackgroundTaskRegistry 里仍然 status='running',理论上:
- 此时如果有别的代码路径触发
queueExternalInput(agentId, ...),仍会成功入队到一个不会再被 drain 的 entry。 - 实际触发源主要就是 owner monitor → 已经被 silent cancel 关掉。
影响很小,目前没有 bug 表现。如果后续接入新的 external input source(比如 SendMessage 在 foreground 也支持),需要记得这条路径也要在 cleanupOwnedMonitorNotifications() 里 detach,否则有 stale enqueue 风险。这条只是留个 marker,不需要现在改。
| expect(running[0].ownerAgentId).toBeUndefined(); | ||
| }); | ||
|
|
||
| it('records the current agent as owner when monitor is started by a subagent', async () => { |
There was a problem hiding this comment.
[Suggestion] 当前测试在 runWithAgentContext('agent-123', () => execute(...)) 下面断言 entry.ownerAgentId === 'agent-123',覆盖到 top-level execute 读 ALS 这一条路径。但 runInAgentFrames 的 inheritedAgentId 修复(agent-core.ts:1180、test 在 agent-core.test.ts)只在 unit 层面验证了 ALS 的恢复——没有端到端验证 deferred-approval respond → handleConfirmationResponse → attemptExecutionOfScheduledCalls → monitor.execute 整条链上 ownerAgentId 仍然被正确捕获。
建议补一个集成性测试:在 runWithAgentContext 里发起 monitor 但要求 approval,模拟 UI 调用 respond 从外部 async chain 触发,断言注册的 MonitorEntry.ownerAgentId === 'agent-123'。这能锁住 deferred-approval 链路的 owner ownership,避免后续调度器/scheduler 重构(比如把 attemptExecutionOfScheduledCalls 拆成独立 microtask)破坏不变量却 unit 测试还过。
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI failing: Test (windows-latest, Node 22.x). — DeepSeek/deepseek-v4-pro via Qwen Code /review
…ting Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
8f26704 to
d2b5a47
Compare
Summary
Validation
packages/coreandpackages/cli.Manual tmux E2E verification
<result>MONITOR_OWNER_ROUTE_OK</result>, followed by the subagent final success report. The parent-facing tmux log does not containMonitor "owner-route-repro" event #1: MONITOR_OWNER_ROUTE_OKorMonitor "owner-route-repro" completed, which is the pre-fix leakage pattern.tmp/monitor-owner-route-tmux-20260508-013222/tmux-readable-full.log/Users/jinye.djy/.qwen/projects/-Users-jinye-djy--codex-worktrees-b302-qwen-code/subagents/7c2b5533-c12c-49e2-b878-15857eb41a76/agent-general-purpose-tool-9df702ad847c4d28aee4831b14097253.jsonl人工验证
prompt
修复前:

prompt
修复后:

prompt
Scope / Risk
Testing Matrix
Testing matrix notes:
npm run dev -- --approval-mode yolo.Linked Issues / Bugs
Fixes #3925
Closes #3666