feat: add Agent Team experimental feature for parallel sub-agent coordination#4844
Conversation
First stage of re-porting the Agent Team feature (originally PR #2886) onto current main. The branch had diverged 362 commits behind a parallel rewrite of the agent runtime, so the feature is being re-applied stage by stage rather than merged. This stage lands the self-contained agents/team/ subsystem (TeamManager, mailbox, identity, tasks, leader permission bridge, test-utils) plus the team_create/team_delete and task_create/task_update/task_list tools, with the additive plumbing they need: - Config: TeamManager/TeamContext accessors, cleanupTeamRuntime, and isAgentTeamEnabled (settings or QWEN_CODE_ENABLE_AGENT_TEAM=1). - Tool registry: team/task tools registered lazily, gated on the flag. - Runtime hooks: completeOnIdle for one-shot teammates; args on the agent approval event; teammate-aware tool exclusion sets. - Backend types: TeamAgentHandle, optional getAgent, completeOnIdle. - New experimental.agentTeam setting. Everything is gated behind the experimental flag and inert by default. Build is green; team unit tests and all touched-file regressions pass.
Stage 1 of the Agent Team re-port. Extends the send_message tool so it can route to a teammate (or "*" for broadcast) via TeamManager in addition to its existing background-task path, and supports the shutdown_request control message (leader-only). Recipient selection is a oneOf over `to`/`task_id`. Layered on top of main's classifier integration: send_message keeps its 'ask' default permission and forwards the routing fields + message to the AUTO classifier, since the message is an instruction the recipient executes. Re-adds the team-lifecycle E2E test, which now passes end to end (create -> tasks -> messages -> list -> update -> delete).
Stage 3 of the Agent Team re-port. Adds the `name` parameter to the Agent tool: when a team is active and a name is given, the call routes through TeamManager.spawnTeammate instead of launching a one-shot subagent. Without a team the call is rejected up front rather than silently falling back. The tool description advertises team coordination only when the experimental flag is on. Ported onto main's rewritten Agent tool.
Stage 5 (partial) of the Agent Team re-port. Teaches the ToolMessage result renderer about the TeamResultDisplay and TaskListResultDisplay shapes so the team/task tools' output is shown via their returnDisplay text instead of a stringified object, and adds a JSON.stringify safeguard for any other non-string display object. The remaining CLI wiring (nonInteractiveCli + useGeminiStream team drivers, permissionController.handleTeammateApproval) is coupled to the turn-loop Teammate handling and will land together with Stage 4.
Stage 4 of the Agent Team re-port. Adds SendMessageType.Teammate and includes it in isTopLevelInteraction so that a teammate message delivered to the leader resets the loop detector and opens an interaction span, the same as a user/cron/notification turn. Per the agreed minimal integration, teammate turns deliberately do NOT run the UserQuery/Cron block — they don't bump commit attribution, aren't recorded as user messages, and don't trigger auto-memory prefetch. That keeps the edit to main's restructured turn loop to a single condition.
Stage 5 of the Agent Team re-port. Wires the non-interactive/headless run loop to the active team: it subscribes to TeamManager changes, drains teammate messages into the leader's conversation as SendMessageType.Teammate turns, waits for teammate activity when the leader has no pending tool calls, and routes teammate tool-approval requests through the session's permission channel (SDK in stream-json mode; YOLO/cancel fallback otherwise). Adds PermissionController.handleTeammateApproval and exposes it on the ControlService permission facade. Ported onto main's restructured run loop (which added its own cron/notification drain mechanism).
Final Stage 5 piece of the Agent Team re-port. Wires the interactive (TUI) useGeminiStream hook to the active team: it subscribes to TeamManager, queues teammate messages, and drains them into the conversation as SendMessageType.Teammate turns when idle, guarded against racing the notification drain. Treats teammate turns like user/cron for image-format checks and new-prompt stats. Ported onto main's rewritten hook.
E2E verificationRun headless against a live model (glm-5.1) with the feature flag on, plus the unit suite. The team flow runs end-to-end: flag-gated tool registration → create a team → spawn named teammates via the Agent tool → teammates run concurrently and report back → leader consolidates → team cleanup. Clean exits, no hangs.
With the flag off (default), the team tools are not registered and behavior is unchanged. |
|
Thanks for the PR, @tanzhenxin! This is a substantial feature — let's walk through the gate. Template looks good ✓ — all required sections present, bilingual, test plan with evidence. On direction: this is squarely on the roadmap. Claude Code already ships Agent Teams (research preview since ~2.1.33, with ongoing fixes through the latest releases — teammate model inheritance, permission propagation, memory leaks, tmux messaging). This repo has a On approach: 10K additions / 60 files is large, but the decomposition into 7 layered commits (foundation → message routing → teammate spawning → rendering → turn-loop integration → headless/interactive drivers) makes it reviewable. The entire feature is behind One thing worth thinking about (not a blocker): Moving on to code review. 🔍 中文说明感谢贡献,@tanzhenxin!这是一个重要的功能——让我们过一下门禁。 模板完整 ✓ — 所有必要章节齐全,双语,测试计划带证据。 方向:完全符合路线图。Claude Code 已经发布了 Agent Teams(从 ~2.1.33 开始的研究预览,一直到最新版本都有持续修复——队友模型继承、权限传播、内存泄漏、tmux 消息)。本仓库有 方案:10K 行新增 / 60 个文件规模较大,但拆分为 7 个层次化的提交(基础设施 → 消息路由 → 队友派生 → 渲染 → 回合循环集成 → 无界面/交互式驱动)使其可审阅。整个功能在 一个值得考虑的点(不是阻塞项): 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
|
@qwen-code /review |
The Agent Team mailbox and task files import proper-lockfile, but the dependency was never declared in package.json, so a clean `npm ci` (as CI runs) failed to resolve the module — cascading into implicit-any and possibly-undefined errors in the same files. It built locally only because the working tree's node_modules already had the package from an earlier install. Adds proper-lockfile to packages/core dependencies and @types/proper-lockfile to root devDependencies (matching the original feature branch), and regenerates the lockfile. Build and typecheck are clean.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
| promises.push(this.sendMessage(LEADER_NAME, message, fromName)); | ||
| } | ||
|
|
||
| await Promise.all(promises); |
There was a problem hiding this comment.
[Suggestion] broadcast uses Promise.all, which means if one sendMessage call fails (e.g. a teammate terminates mid-broadcast and its mailbox is deleted), the entire broadcast rejects — even though other teammates successfully received the message. The leader then sees an error with no indication of partial delivery.
Consider using Promise.allSettled and logging any individual delivery failures:
| await Promise.all(promises); | |
| const results = await Promise.allSettled(promises); | |
| for (const r of results) { | |
| if (r.status === 'rejected') { | |
| debug.warn(`broadcast delivery failed: ${r.reason}`); | |
| } | |
| } |
— qwen3.7-max via Qwen Code /review
| this.agentIdentities.clear(); | ||
| this.teamEventEmitter.removeAllListeners(); | ||
|
|
||
| await this.backend.cleanup(); |
There was a problem hiding this comment.
[Suggestion] cleanup() clears eventBridgeCleanups, pendingMessages, lastActivityAt, agentIdentities, and removes event listeners — but does not clear _shutdownPending. If a TeamManager instance is ever inspected after cleanup (in tests or debug tooling), stale entries in _shutdownPending could cause false positives.
| await this.backend.cleanup(); | |
| this._shutdownPending.clear(); | |
| await this.backend.cleanup(); |
— qwen3.7-max via Qwen Code /review
| /** | ||
| * List all tasks for a team, optionally filtered. | ||
| */ | ||
| export async function listTasks( |
There was a problem hiding this comment.
[Suggestion] listTasks() reads every task JSON file from disk on every call. This is invoked from several hot paths: isAgentBusy(), unassignTeammateTasks(), getAgentStatuses(), scanIdleAgentsForTasks() (which fires on every onTasksUpdated event), and the task_list tool itself. With many tasks, a single task update can trigger an amplification loop — the notifyTasksUpdated → scanIdleAgentsForTasks → listTasks chain does O(n) file reads per teammate.
For an experimental feature behind a flag with MAX_TEAMMATES=10, this is acceptable today. For a follow-up, consider maintaining an in-memory task index (Map) updated under the existing file lock, and using it for filtered queries — falling back to disk reads only on cold start.
— qwen3.7-max via Qwen Code /review
| // ─── Helpers ──────────────────────────────────────────────── | ||
|
|
||
| /** Get the next task ID by scanning existing files. */ | ||
| async function getNextTaskId(dir: string): Promise<string> { |
There was a problem hiding this comment.
[Suggestion] getNextTaskId() scans all task filenames on every createTask() call to find the max numeric ID. Combined with the EEXIST retry loop (up to 15 attempts), a burst of concurrent task creates can do up to 15× readdir calls.
For a follow-up, consider storing a monotonically increasing counter in a separate file protected by proper-lockfile, making ID generation O(1).
— qwen3.7-max via Qwen Code /review
| */ | ||
| private ensureLeaderInboxPolling(): void { | ||
| if (this.pollingInterval) return; | ||
| this.pollingInterval = setInterval(() => void this.pollLeaderInbox(), 500); |
There was a problem hiding this comment.
[Suggestion] The leader inbox is polled every 500ms via setInterval regardless of whether any messages are being sent. This keeps the Node.js event loop active and prevents idle-state optimizations during long teammate work periods.
For a follow-up, consider an event-driven approach where sendMessage/writeMessage notify the TeamManager via an in-process event when new messages are available, using polling only as a fallback. Alternatively, increasing the interval to 2–5s in the steady state with an immediate trigger for the hot path would reduce overhead.
— qwen3.7-max via Qwen Code /review
| * that inject the structured shutdown message directly without | ||
| * going through `requestShutdown`. */ | ||
| markShutdownRequested(name: string): void { | ||
| this._shutdownPending.add(name); |
There was a problem hiding this comment.
[Suggestion] Test coverage for TeamManager internals has notable gaps. The following methods are critical to team session correctness but have no direct tests:
cleanup()— full teardown sequence (polling stop, listener unsubscribe, event bridge cleanup) is not verified.team-delete.test.tsmocks it asvi.fn().waitForTeammateActivity— resolves on four conditions (message,terminated,timeout,aborted), none tested. ThenonInteractiveCliexit loop depends on this.allRemainingStalled/abortStalledTeammates— 600s stall threshold not tested. Risk of false positives (aborting active agents) or false negatives (infinite hangs).readLeaderInboxOrQuarantine— corrupt-file quarantine at the TeamManager level (the code path the leader actually hits during polling) is untested.
Consider adding focused tests for at least the cleanup and wait-loop paths, as they govern whether a team session can start, run, and exit correctly.
— qwen3.7-max via Qwen Code /review
…imental) Triaged the unresolved review threads from the superseded PR #2886 against the re-ported code and applied the valid fixes: - task_update(status:'deleted') now enforces the same ownership guard as updateTask, so a teammate cannot delete another teammate's task. - Completing a task and adding a blocks edge in the same call no longer leaves the dependent permanently blocked by the just-completed task. - listTasks treats a momentarily-empty (mid-create) task file as a create in flight and skips it instead of quarantining and losing the task. - Fire-and-forget coordination calls (flush, auto-claim, unassign, poll) log rejections instead of surfacing as unhandled rejections. - pollLeaderInbox re-checks the leader callback after the awaited read so a detach during the read cannot throw or drop the batch. - scanIdleAgentsForTasks skips teammates with a pending shutdown. - broadcast uses allSettled so one terminated recipient does not fail the whole broadcast. - Hybrid tool-response+teammate turns reset the loop detector, preventing a false LoopDetected when a polling leader merges teammate messages. - useGeminiStream drains its teammate queue on a manager swap; the join event carries the teammate model for the UI tab label. - Removed dead consumeUnreadByType and an unreachable ENOENT branch. Verified: core unit tests (incl. new regressions for the delete guard, the complete+addBlocks re-block, and the empty-file create race) plus live L3 (3-agent) and L4 (4-agent) E2E, both clean.
| * with an empty queue is not considered active — it has | ||
| * finished its current work and is waiting to be re-engaged. | ||
| */ | ||
| hasActiveTeammates(): boolean { |
There was a problem hiding this comment.
[Suggestion] Race between flushNextMessage and hasActiveTeammates. After flushNextMessage delivers a message to an IDLE teammate via enqueueWithIdentity (line 1244), the agent's status transition from IDLE → RUNNING is asynchronous. In the window between enqueue and status change, hasActiveTeammates() sees: agent status = IDLE, TeamManager queue = empty, and returns false.
If the nonInteractiveCli wait loop checks hasActiveTeammates() in that window, it exits prematurely and the leader may call team_delete while a teammate is about to process a just-delivered message.
Suggested fix: Track recently-enqueued messages in a separate set that hasActiveTeammates also checks:
private readonly recentlyEnqueued = new Set<string>();
// In flushNextMessage, after enqueue:
this.recentlyEnqueued.add(agentId);
// Clear in onStatusChange when agent goes non-IDLE:
this.recentlyEnqueued.delete(agentId);
// In hasActiveTeammates:
if (this.recentlyEnqueued.has(member.agentId)) return true;— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
I think this one is a false alarm — the IDLE→RUNNING transition is synchronous, not async.
enqueueMessage starts runLoop synchronously (agent-interactive.ts:251-256), whose sync prefix calls runOneRound, which sets RUNNING at agent-interactive.ts:154 before its first await (line 166). sendMessage awaits flushNextMessage, and there's no await between queue.shift() and enqueueWithIdentity (TeamManager.ts:1249-1253) — so by the time control returns to the leader, status is already RUNNING, and hasActiveTeammates returns true via the status !== IDLE check. The empty-queue + IDLE window doesn't occur.
Left unchanged — but happy to add the guard if you can point to a path (for the InProcessBackend teammates use) where enqueue doesn't synchronously start the loop.
| // name on the next `team_create`. Sweep once more after a | ||
| // short delay to catch the race. | ||
| await deleteTeamDirs(teamName); | ||
| await new Promise((r) => setTimeout(r, 250)); |
There was a problem hiding this comment.
[Suggestion] The fixed 250ms delay between the two deleteTeamDirs calls is a fragile mitigation for the recreate-after-delete race. Under high system load, slow disk I/O, or when a teammate has a long-running tool call that takes longer than 250ms to settle after cleanup(), the directory can be recreated after the second deleteTeamDirs, leaving orphan artifacts that wedge the team name on the next team_create.
Consider a retry loop with exponential backoff (e.g., 3 attempts at 100ms, 500ms, 2000ms) or watching the directory briefly for recreation, rather than relying on a single fixed delay:
| await new Promise((r) => setTimeout(r, 250)); | |
| await deleteTeamDirs(teamName); | |
| for (const delayMs of [100, 500, 2000]) { | |
| await new Promise((r) => setTimeout(r, delayMs)); | |
| await deleteTeamDirs(teamName); | |
| } |
— Qwen Code via /review
There was a problem hiding this comment.
I'd lean toward leaving the fixed delay as-is for now. The suggested [100, 500, 2000] backoff would add up to ~2.6s of guaranteed latency to every team_delete to catch a race the existing double-sweep already handles in the common case — the cure costs more than the disease here. If we see actual orphan-dir flakiness in practice I'm happy to add bounded retries (and probably watch the dir rather than fixed sleeps). Leaving open in case you disagree.
| if (this.leaderMessageCallback === wrappedCallback) { | ||
| this.leaderMessageCallback = origCb; | ||
| } | ||
| origCb?.(msg); |
There was a problem hiding this comment.
[Suggestion] When an external setLeaderMessageCallback(newCb) fires during waitForTeammateActivity (e.g., during team teardown or manager swap), the wrapper correctly avoids restoring origCb over newCb. However, the current message is still forwarded to origCb (the pre-wait callback) rather than to newCb. The external caller who installed newCb expecting to receive all subsequent messages silently misses this one.
This is a narrow edge case (team teardown is the most common trigger), but the message loss is silent and could make debugging teammate-final-message delivery confusing. Consider documenting this trade-off, or delivering to whichever callback is currently installed at the time the message arrives.
— Qwen Code via /review
There was a problem hiding this comment.
I don't think newCb actually misses the message here. Delivery reads this.leaderMessageCallback fresh at TeamManager.ts:705-708, with no await between the read and the call. Once an external setLeaderMessageCallback(newCb) runs, wrappedCallback is no longer the installed callback and is unreachable (it's only ever assigned to that field), so the next message goes straight to newCb. The wrappedCallback → origCb forward only fires when no swap happened, where forwarding to origCb is correct.
So I believe there's no lost-message path — but leaving this open in case you're seeing a concrete delivery route I'm missing.
| agentId: string, | ||
| opts?: { checkAgentBusy?: boolean; ownerName?: string }, | ||
| ): Promise<SwarmTask | undefined> { | ||
| if (opts?.checkAgentBusy) { |
There was a problem hiding this comment.
[Suggestion] TOCTOU gap in claimTask with checkAgentBusy: the isAgentBusy() check reads all in-progress tasks without holding any lock, then the per-task lock is acquired later. Between the busy check and the lock acquisition, a concurrent event (e.g., updateTask from another teammate assigning work, or scanIdleAgentsForTasks racing with this call) could give this agent a new in-progress task, making it busy after the check passed. The agent ends up owning multiple in-progress tasks despite the guard.
This is a soft constraint (the system doesn't crash), but it undermines the "one task per agent" invariant that auto-claim and the UI rely on. Consider moving the busy check inside the per-task lock, or using a compare-and-swap pattern.
— Qwen Code via /review
There was a problem hiding this comment.
Real but soft — I'd like to defer this one. isAgentBusy scans all in-progress task files, so moving it inside this task's per-task lock wouldn't actually close the gap: a concurrent claim of a different task by the same agent isn't covered by this task's lock. A correct fix needs per-agent serialization/CAS, which feels disproportionate for a best-effort "one task per agent" invariant in an experimental feature — momentary double-ownership self-heals as the agent finishes a task. Happy to file a follow-up issue to track a proper per-agent guard. Leaving open.
…eam tasks deleteTask checked ownership against a pre-lock read, so a concurrent claimTask/updateTask could reassign the owner between the check and the unlink — silently destroying another teammate's task. Acquire the lock first, then re-read and re-check ownership inside it before unlinking, mirroring updateTask. Reciprocal edge cleanup now runs after the lock is released (never holding two per-task locks at once) but before the single tasks-updated notification, so no listener observes a phantom blocker. blockTask issued its two updateTask writes via Promise.all; two calls over the same pair in opposite directions could deadlock on per-task locks. Serialize the writes to remove the lock-ordering hazard.
- Cap per-agent pending messages (MAX_PENDING_MESSAGES). The queue only drains when its recipient goes IDLE, so an unbounded queue let a single looping teammate balloon a busy teammate's memory; sendMessage now applies backpressure once the cap is reached. - Wrap auto-claimed task content (subject/description, authored by another agent) in a <task_content> envelope with a defensive instruction so it is treated as data, not as instructions to obey. - Surface fire-and-forget coordination failures (flush, auto-claim, unassign) to the leader's conversation. They were only logged via a namespaced debug logger, i.e. invisible in production, despite mapping to silent stuck-teammate / stuck-task symptoms.
A task's subject/description becomes the prompt an idle teammate auto-claims and executes with full tool access — the same privileged-sink shape as send_message. Both tools inherited the base default 'allow', which short-circuits the classifier in AUTO mode. Override getDefaultPermission to 'ask' so that injection path stays under the classifier / human-in-the-loop, matching send_message.
The JSDoc cited team teammates as the use-case for completeOnIdle:true, but teammates set it to false so they settle to IDLE (not COMPLETED) and stay alive for follow-up messages and auto-claim. Document the actual semantics and the invariant the leader's wait loop relies on.
- fireAndForget: wrap leaderMessageCallback in try/catch so a throwing callback cannot re-introduce the unhandled rejection the wrapper exists to prevent (enforces the documented 'must not throw from this catch'). - tryAutoClaimTask: nonce-tag the <task_content> envelope with the per-session envelopeNonce (same pattern as formatLeaderEnvelope) so a teammate-authored description cannot forge the closing tag and break out of the protected zone via a </task_content> payload.
Use Promise.allSettled (was Promise.all) for post-unlink edge cleanup so a single failing dependent (corrupt JSON, EACCES, lock exhaustion) no longer skips notifyTasksUpdated for the dependents that were cleaned. Without this their blockedBy is cleared but scanIdleAgentsForTasks never re-runs, leaving them stuck idle with no recovery (the task file is already unlinked, so a retry returns false). Per-failure warnings are logged.
The hook bridging team TEAMMATE_JOINED events to agent-tab registration (useTeamInProcess) was authored but never mounted in AgentViewProvider — only useArenaInProcess was. As a result teammate tabs never registered and the teammate tab bar never appeared during in-process team runs. Mount useTeamInProcess alongside useArenaInProcess, and label teammate tabs by name rather than model (teammates inherit the leader's model, so a model label collapses to a generic "teammate" and is identical across the team). Add a regression test asserting the provider mounts the team bridge.
Add a standalone streaming demo of the agent-team feature that captures the full lifecycle and the teammate tab navigation into a single GIF (scenarios/agent-team-demo.ts). Supporting engine fixes: - capture(): scroll the xterm viewport to the live bottom before screenshotting, so a capture taken after an idle period shows the current state instead of stale top-of-buffer scrollback. - scenario-runner: skip scenarios/*.ts files with no default export (driver scripts that guard their own entrypoint), so batch runs don't choke.
…lakiness
The concurrent-write test fired 10 writeMessage() calls at one inbox,
each contending for the same proper-lockfile lock with a fixed,
non-randomized backoff. On Windows, slower fs syscalls let the tail
writers exhaust the retry budget before winning the lock, throwing
ELOCKED ("Lock file is already being held") — a flaky failure that
alternated pass/fail across CI runs.
Add a per-inbox in-process Mutex (async-mutex, the pattern already used
in jsonl-utils and writeContextFile) so same-process writers serialize
in memory and only one reaches for the file lock at a time. The
proper-lockfile lock stays inside the mutex to preserve cross-process
safety between agent processes. Also randomize the lock backoff to
de-synchronize genuine cross-process contenders.
A teammate's report was injected into the leader's conversation as a raw <teammate_message_<nonce>> envelope and rendered verbatim as a user bubble — a large, scaffolding-heavy block on screen for what is often the biggest payload in the feature. Adopt the two-text split the notification queue already uses: the full nonce-tagged envelope still goes to the leader's model, but the user now sees a compact "● <name> reported back" line in its place. The verbatim USER bubble is suppressed for SendMessageType.Teammate exactly as it is for Cron, and coordination-error notices get the same treatment. The leader callback now delivers both the model text and a display string built in TeamManager (where the structured sender/summary live), so the UI never parses the envelope. Headless is unchanged — it ignores the extra arg.
…F doesn't cut off The leader sits idle (no Main-view output) while teammates read their files, so Phase C captures no frames until a report lands — making maxPolls the real wall-clock budget. At 80 polls (~112s) a slow second scout could exhaust it before reporting, ending the GIF mid-run. Bump to 200 polls (~5min) so the capture outlasts the slowest scout plus the combined summary and delete; the loop still exits early on `deleted` and idle polls capture no frames, so the GIF doesn't bloat.
|
Direction is sound (mirrors Claude Code's Agent Teams) and the flag isolation is clean. But the core is over-engineered: it's an in-process feature (all teammates share one Node process), yet the leader's inbox goes through lock-guarded files + ~1s polling while teammates use a plain in-memory queue — same operation, two transports. I'd route the leader inbox in-memory too (dropping 中文方向成立(对标 Claude Code 的 Agent Teams),flag 隔离也干净。但核心过度设计:明明是进程内特性(队友共享一个 Node 进程),leader 收件箱却走文件锁 + 每秒轮询,而队友走内存队列——同一操作两套传输。建议 leader 也改内存(可删掉 |
…nsient I/O - createTask claims the ID with an empty O_EXCL placeholder and fills it via temp-file + rename, so concurrent readers never see partial JSON (which the quarantine would have destroyed mid-create) - listTasks quarantines only on parse failures; transient read errors (EMFILE/EIO/EACCES) skip the file for one round instead of renaming a healthy task away, and the read fan-out is capped at 16 - updateTask / claimTask / releaseOwnedTask guard the in-lock readFile against ENOENT (resetTaskList and the quarantine rename run without per-task locks), mirroring deleteTask - cover releaseOwnedTask's three defensive branches with tests
… abort on idle agents - a message enqueued from inside the synchronous IDLE STATUS_CHANGE emit (TeamManager's flush) landed after the run loop's final empty check while `processing` was still true — enqueueMessage would not restart the loop and the message stranded in a dead queue; the loop now re-checks the queue after `processing` flips false - abort() on an idle/initializing agent only set the signal: no loop was running to observe it, so the agent never reached a terminal status and allTeammatesTerminated()-style gates never fired; abort now settles CANCELLED directly when no loop is in flight - regression tests drive the real AgentInteractive (stub model, real loop) through send-during-idle-emit and abort-while-idle
…active FakeAgent modeled a friendlier runtime than the one that ships: enqueueMessage processed inline (no queue, no processing flag, resurrecting terminal agents via unconditional RUNNING), which is exactly what masked the flush-into-dead-queue bug. It now queues while a round is in flight, drains before settling IDLE, drops messages after abort()/shutdown() like the real drained queue, and never resurrects a terminal agent.
… peer messages - spawnTeammate now checks the agent's status after spawnAgent resolves: start() reports chat-creation failure via FAILED without throwing, so the leader was told the teammate joined while sends were accepted into a queue that could never flush; a failed spawn now rolls back and surfaces the reason (with a terminal-status replay in setupEventBridge for the attach race) - shutdown_rejected now clears _shutdownPending: a teammate that declined once stayed excluded from auto-claim and kill-armed on any later "shutdown_approved" mention - peer-to-peer deliveries get a fresh-nonce envelope like leader deliveries, closing inline leader impersonation between teammates (deliberately not the leader-trust nonce, which must never reach teammate context)
The teammate ALS identity propagates into anything a teammate spawns, so prepareTools() keeps choosing the teammate exclusion set for nested agents — without WORKFLOW in it, a teammate-launched workflow re-arms the O(k^n) recursive fan-out the subagent exclusion set prevents.
…dency cycles
- task_create / task_update now project their content (subject,
description, status, owner, edges) to the AUTO classifier — the base
'' sentinel projected to an empty object, so the classifier ruled on
task_create({}) and the 'ask' override was blind; the interactive
confirmation now shows the description (truncated), since that text
is what a claiming teammate executes
- task_update rejects self-edges and dependency cycles instead of
silently persisting a graph that auto-claim can never unblock
- regression tests pin the 'ask' default and a non-empty classifier
projection for both tools
…name Nothing deletes team dirs on normal exit (only an explicit team_delete does), so every Ctrl+C, completed headless run, or crash permanently wedged the team name behind createTeamFile's wx-exclusive create, with manual rm -rf as the default recovery. team_create now records the owner identity (leadSessionId + leadPid) and, on EEXIST, reclaims the team when the recorded lead process is gone (or is this process); only a live concurrent owner keeps the name refused.
…ll/@/slash preprocessing Teammate envelopes are model-authored text already rendered as a notification line by the teammate drain, but they still flowed through the user-input preprocessing: with shell mode active a teammate report was EXECUTED as a shell command, and a leading / or an @path was reinterpreted against the leader's session. They now early-return like Notification.
… in chat history Teammate envelopes are machine-driven re-entries like Cron and Notification: user-authored UserPromptSubmit hooks must not fire on (or block) internal coordination traffic. They also never reached any chat-recording path — record them like notifications so a resumed session restores the same compact info line the live UI rendered.
…d rejections The stream-json listener voided handleTeammateApproval's promise while the handler's own error path re-issues a respond() that can reject (teammate terminated mid-request) — an unhandledRejection that can take down an SDK session. The call site now catches like its headless siblings, and the controller's catch-path respond(Cancel) is wrapped so the method never rejects out of its own error path.
|
Round 5 addressed — commits b5d6018..f3d32a9, per-thread replies cite the specific fix. All 8 blockers are fixed. Of the robustness items, everything is fixed except leader-inbox compaction (one item — the unassign resurrect — was already addressed by 74cc344 before your pass landed). The compaction thread is deliberately deferred: it collides with @pomelo-nwu's proposal to route the leader inbox in-memory, and I'd rather not rework the cursor scheme twice — both go into one follow-up issue on the leader-inbox transport. On the "for the record" list: agreed on all counts, and they go into the same follow-up issue — headless Verification: core 426 tests (team / runtime / tools / client suites) + cli 106 (useGeminiStream, permissionController) green; |
|
Thanks @pomelo-nwu — the two-transport observation is a fair hit: with every teammate in-process, lock-guarded files + ~1s polling for the leader inbox is more machinery than the job needs. The file transport does buy crash-inspectable artifacts and is what an out-of-process backend would need (the Backend interface anticipates one), but if in-process stays the only backend, routing the leader inbox in-memory and deleting mailbox.ts's locks/quarantine/polling is the right end state. I'm not swapping the transport at round 5 of this PR, though — it would invalidate the just-stabilized offset/compaction behavior wenshao also flagged. Consolidating both (your in-memory proposal + the leader-inbox compaction finding) into a follow-up issue on the leader-inbox transport, to land while the feature is still behind the experimental flag. On TeamManager tests: the coordination harness (real TeamManager + FakeBackend) covers routing, priority, claiming, shutdown protocol, broadcast, and — new this round — spawn failure; this round also added integration tests against the real agent runtime where the fakes had been flattering it. Agreed it should grow further before leaving experimental — that's in the follow-up too. |
# Conflicts: # packages/cli/src/config/config.ts # packages/core/src/config/config.ts
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
No high-confidence issues found. LGTM! ✅
The Agent Team feature is well-structured with comprehensive test coverage (261 team-specific tests passing) and thorough defensive programming. Five low-confidence findings involving timing-dependent races in auto-claim and notification ordering were identified but not confirmed with sufficient certainty to warrant blocking — they are suitable for follow-up review.
— qwen3.7-max via Qwen Code /review
DragonnZhang
left a comment
There was a problem hiding this comment.
No high-confidence issues found. The Agent Team feature is well-engineered with extensive defensive coding: nonce envelopes for anti-spoofing, proper permission defaults ('ask' for privileged sinks), path traversal prevention (sanitizeName, assertValidTaskId), concurrency controls (mutex + file locks, inbox serialization, rollback on spawn failures), resource management (message queue backpressure, metadata size limits), and thorough error handling (corrupt file quarantine, fire-and-forget with error logging). All CI checks pass. LGTM. -- claude-sonnet-4-20250514 via Qwen Code /review
Independent maintainer verification — Linux, real build + live E2ERan an independent verification pass of this PR on Linux (the author tested macOS; Windows/Linux were CI-only), covering build, the touched unit suites, deterministic flag gating, and four live end-to-end scenarios (headless + interactive TUI via tmux, in both yolo and default approval modes) against a live model. Environment: Debian Linux (kernel 6.12.63, x86_64) · Node v22.22.2 · npm 10.9.7 · tmux 3.5a · head Summary
No orphan processes or tmux panes left behind after any scenario (teammates are in-process); a follow-up prompt in the same session after team deletion behaves normally. 5. Headless E2E (yolo,
|
|
Thanks for the independent Linux verification pass, @wenshao — especially the live default-approval-mode run that surfaced both findings. Finding B (teammate identity lost on the approval-resume path) is fixed in #4979, taking the fast-follow route you suggested. The fix is essentially your second option — the teammate identity is captured while the loop frame is still live and restored across the post-approval resume boundary, generalized to the shared resume path so it covers the message tool, the task tools, and the leader-only Finding A (teammate pending approvals invisible from the Main view) we're treating as the follow-up polish you suggested. Two tiers: a cheap interim cue (a distinct tab-bar state for awaiting-approval and/or a Main-view notice line) in the near-term follow-up, and the proper fix — activating 中文说明感谢 @wenshao 的独立 Linux 验证——尤其是默认审批模式下的真实运行,发现了这两个问题。 Finding B(审批恢复路径丢失队友身份)已在 #4979 修复,采用了你建议的紧随其后的方式。修复本质上就是你给的第二个方案——在循环帧仍存活时捕获队友身份,并在审批后恢复的边界上重建,并推广到共用的恢复路径,从而一处覆盖消息工具、任务工具以及 Finding A(Main 视图看不到队友的待审批)按你建议作为后续打磨处理。分两层:近期后续里先加一个轻量提示(给待审批一个不同的 tab 栏状态,和/或 Main 视图的提示行),而真正的修复——激活 |
… SECURITY comment R1 review by @wenshao (4 [Suggestion] threads + 1 review-body comment). The fifth thread, [Critical] nested deadlock on a thunk-level limiter, was already caught and fixed in commit 0401ac88f by the dispatch-layer refactor — that commit was authored before the review posted, so the thread is on the pre-fix code; verified resolved by independent round-1 self-review fan-out that converged on the same finding from a different angle, and by a real-LLM E2E scenario (parallel-in-pipeline at concurrency=1 against qwen3-max). settleToNullArray observability + abort docs (T2 / T3 wenshao): - settled.map now logs the discarded rejection reason at debug level when a thunk rejects. Operators investigating a workflow that returned unexpected nulls can now disambiguate the four indistinguishable null paths (dispatch failure, 1000-cap, pipeline stage exception, non-JSON-serializable result) via the WORKFLOW debug logger; the contract to the script stays opaque. - Docstring now explicitly explains the abort-responsiveness path: the apparent Promise.allSettled "wait for all to complete" is in practice "wait for all to reach an abort-aware rejection" because the dispatch signal is threaded all the way down to subagent.execute, and the limiter's separate abort listener drains the not-yet-started queue instantly. env-override hard ceilings (T4 wenshao): - HARD_MAX_AGENTS_PER_RUN_CEILING = 10000 caps QWEN_CODE_MAX_WORKFLOW_AGENTS. - HARD_MAX_CONCURRENCY_CEILING = 64 caps QWEN_CODE_MAX_WORKFLOW_CONCURRENCY. - Both clamp with a debug warning rather than silently dropping the override. - Two RED-verified tests cover the over-ceiling clamp path (and a just-under value preserved). - Not a security issue (env is operator-controlled), but stops a fat-finger =999999999 from silently uncapping the run. reviveInRealm SECURITY comment (T5 wenshao): - Added a SECURITY block warning future maintainers that the revival function MUST stay inside the vm init runInContext block. JSON / Array / Object here are vm-realm globals; extracting this textually-identical helper into a host-side utility would resolve those names against the host realm and silently reopen the T1/T8/T14 escape that the revival is designed to prevent. The textual identity to a host-side util is exactly the trap. Declined this round (review-body 4 pipeline test-coverage sub-claims): - pipeline abort, parallel+pipeline shared-limiter, pipeline 1000-cap, pipeline E2E. After the dispatch-layer refactor in 0401ac88f, parallel and pipeline mechanically share the SAME countedDispatch.limiter.run path — the parallel-side abort / cap / concurrency tests cover the mechanism. Explicit per-shape sibling tests would not catch a regression that the parallel versions don't already catch. 145/145 workflow-suite tests pass; tsc + eslint clean. The 2 config tests fail locally only because the rebase pulled in #4844 (Agent Team)'s new proper-lockfile dependency which the symlinked node_modules doesn't have — CI resolves on fresh install.
Deferred hardening from the #4844 review, all behind the experimental flag: - Recognize a teammate's shutdown approval by a structured signal at send time instead of a free-text regex over its reply, so a reply that merely mentions the token in prose no longer aborts a pending teammate. - Replace the per-session envelope nonce (which the leader model could echo and leak to a teammate) with a stable tag plus structural escaping of the delimiter in untrusted bodies — nothing left to leak. - Bound the leader inbox: consumed messages are marked read so the retention compaction fires, replacing the never-compacting index cursor. No persisted-format change. - Evict per-inbox in-process locks on team teardown. Refs #4844
…4721) (#4947) * feat(core): add createConcurrencyLimiter sliding-window util A p-limit-style concurrency limiter that keeps at most `limit` thunks in flight and starts a queued thunk the instant a slot frees — so one instance can be SHARED across several fan-out calls and still hold the total in-flight count under a single cap. The existing in-repo concurrency control (memoryDiscovery.ts) is fixed-size sequential batching, not a sliding window, so this is a new primitive rather than a refactor. API: - run(thunk): schedule one thunk through the shared window; rejections propagate raw. - settleAll(thunks): batch convenience that resolves to a position-aligned Array<T|null> where a rejected thunk becomes null (errors-as-data) and the ONLY rejection is an abort of the limiter's signal — so an aborted run surfaces as a rejection rather than a silent array of nulls. Guards a non-positive-integer limit (mirrors background-tasks.ts), preserves input order, short-circuits empty input, and (with an AbortSignal) refuses to start new queued work once aborted. 13 unit tests cover the window cap, errors-as-data, order, sharing across calls, and the abort paths. * feat(core): Workflow P2 — parallel() + pipeline() + 1000-agent cap Implements the P2 phase of the dynamic-workflow port (#4721): concurrent fan-out primitives on top of P1's sequential agent(). parallel(thunks) - Runs thunks through a per-run shared sliding window (createConcurrencyLimiter, cap = max(1, min(16, cpus-2)) — the max() guards 1–2 core machines). - Resolves to a position-aligned array; a thunk that throws becomes null at its index (errors-as-data). parallel() itself only rejects on abort, so an aborted run surfaces a rejection rather than a silent array of nulls. - Rejects on a non-function element (eager promise instead of a thunk). pipeline(items, ...stages) - Parallel-of-chains: one thunk per item, all sharing the SAME window, so it is staggered (item A can be in stage 3 while item B is in stage 1) with no inter-stage barrier. Stage callbacks receive (prev, item, idx); the first stage's prev is the item itself. A stage that throws OR returns null drops that item to null and skips its remaining stages. 1000-agent-per-run cap - The orchestrator wraps this.dispatch with a counter, so EVERY agent() call — sequential, parallel, or pipeline — funnels through one chokepoint. A fan-out cannot bypass the cap; the 1001st call throws. SECURITY — vm-realm result revival (closes an uncovered escape) - vmAsync's resolve path is verbatim: it does NOT re-wrap resolved values. The host parallel/pipeline impl resolves with a HOST-realm array, so handing it to the script would reopen the T1/T8/T14 escape (out.constructor.constructor('return process')() reaches host process via the host Array.prototype chain). The vm wrapper now revives the array in-realm with JSON.parse(JSON.stringify(...)) — the same mechanism that makes `args` safe — before the script sees it. The pre-P2 escape test only probed the *Promise* (already vm-realm), not the resolved array; new tests probe the resolved array (outer + nested) and were verified to FAIL against a verbatim wrapper. Real impls are injected by the orchestrator; the sandbox keeps its throwing P1-unsupported stubs as the default when parallel/pipeline are not injected, so an un-wired sandbox still gives a clear error. Tool description updated to document the P2 surface. Full workflow suite + new orchestrator/sandbox/tool tests green; tsc + eslint clean on all touched files. * fix(core): Workflow P2 self-review — per-element revival + honest tool description Adversarial self-review (6-dimension finder fan-out) surfaced two genuine defects plus three test gaps. EAD-1 [major] — reviveInRealm did JSON.parse(JSON.stringify(WHOLE array)), so a single slot whose VALUE is non-serializable (a thunk that returns a BigInt or a circular object) threw on the entire array and REJECTED the whole parallel()/pipeline(), destroying every sibling result. That defeats errors-as-data for return values. Revive PER-ELEMENT instead: a bad slot becomes null at its index, siblings survive, and the outer array is still built in-realm so the host-process escape stays closed. Regression test: parallel([() => 'a', () => 1n, () => 'c', () => circular]) => ['a', null, 'c', null]. API-1 [major] — the WorkflowTool top-level description (passed to super()) still read "No parallel, no pipeline" while the param-schema description and the runtime both now support them. Updated to describe the P2 surface (parallel/pipeline, ≤16 in flight, ≤1000 total). Also refreshed the now-stale "scheduled for P2" messages on the un-injected fallback stubs to an accurate "unavailable: sandbox created without an implementation" wording. Test gaps closed: - TST-1: pipeline() now has a concurrency test proving it shares the SAME per-run window as parallel (peak in-flight === cap), so a pipeline impl that bypassed the shared limiter would fail. - TST-2: pipeline() staggering is now tested — item 0 reaches stage 2 long before item 1's slow stage 1 finishes, proving there is no inter-stage barrier (a stage-by-stage barrier impl would fail the <50ms threshold). - TST-3: mid-flight abort through the orchestrator is now tested (the prior test only used a pre-aborted controller), proving parallel() rejects after dispatches start rather than resolving with a silent array of nulls. 150 workflow-suite tests green; tsc + eslint clean on touched files. * feat(core): make Workflow P2 caps env-overridable Mirror the established QWEN_CODE_MAX_BACKGROUND_AGENTS / P1 QWEN_CODE_MAX_WORKFLOW_SECONDS precedent so operators can tune the P2 caps without a code change: - QWEN_CODE_MAX_WORKFLOW_AGENTS — override the per-run 1000-agent cap. - QWEN_CODE_MAX_WORKFLOW_CONCURRENCY — override the cpu-derived min(16, cpus-2) in-flight window with an explicit integer. Both use the house resolver shape (resolveMaxConcurrentBackgroundAgents): a non-integer / <1 value is rejected with a debug warning and the default is used. The agent cap default is renamed DEFAULT_MAX_AGENTS_PER_RUN and the cap message is now built from the resolved value. Resolvers take an injectable env arg for pure unit testing. Adds resolver unit tests (default / valid / invalid) plus an integration test proving QWEN_CODE_MAX_WORKFLOW_AGENTS=3 makes a 4-thunk parallel() yield exactly 3 results + 1 null at run time. 154 workflow-suite tests green. * fix(core): throttle Workflow P2 concurrency at the dispatch layer, not the thunk layer P2 self-review (independent adversarial fan-out) caught a real deadlock the mock-tested suite missed: the concurrency window was applied at the thunk level (parallel()/pipeline() scheduled their thunks through the shared limiter). A nested fan-out — e.g. `pipeline([items], item => parallel([...]))`, the canonical /deep-research shape — would have every outer slot held by an outer thunk awaiting an inner settleAll() whose thunks can never acquire a slot. pump() only re-runs from an in-flight thunk's finally, so the queue never drains: unrecoverable silent hang until the 30-min wall clock. On 1-3 core machines (limit = 1) a SINGLE nested call deadlocks; abort cannot break it because pump() is never re-invoked. Fix: the window throttles AGENT DISPATCHES, not orchestration thunks. The limiter now wraps `this.dispatch` inside countedDispatch, so only leaf agent() calls acquire a slot; parallel()/pipeline() compose promises freely via a plain Promise.allSettled + position-aligned null-map (settleToNullArray) and cannot deadlock when nested. This is also the correct "N agents in flight per run" semantics (the cap is about concurrent model calls, not orchestration depth) and makes abort prompt (dispatch slots free normally). The limiter's unused settleAll() is removed — its errors-as-data null-mapping + abort-reject moved into settleToNullArray in the orchestrator, where the batch semantics belong. The tool description's "≤16" is softened to "16 by default" now that QWEN_CODE_MAX_WORKFLOW_CONCURRENCY can raise it. Adds two RED-verified regression tests (nested parallel-in-pipeline and parallel-of-parallel, forced to concurrency=1) that deadlocked before the fix and now resolve in ms. 150 workflow-suite tests green; tsc + eslint clean. * fix(core): Workflow P2 round-2 self-review — prompt queue abort + AbortError consistency + doc accuracy Round-2 adversarial review of the post-F1 code (fresh finders + skeptics) found no new critical/major behaviour bugs (architectural convergence after the dispatch-layer concurrency fix), but surfaced two real correctness items plus a doc-accuracy cleanup pass. createConcurrencyLimiter — prompt queue abort - The limiter previously rejected queued jobs only lazily, inside pump(), which re-runs from an in-flight thunk's .finally. So if an in-flight thunk never settled (a buggy/hung future dispatcher), queued jobs would hang forever even after abort. Production today never hits this because subagent.execute always settles, but the limiter shouldn't lean on an unenforced invariant. Now an `{once: true}` 'abort' listener drains the queue the moment the signal fires. Adds a RED-confirmed regression test (limit=1, in-flight = `new Promise(()=>{})`, abort → queued must reject within 200ms). settleToNullArray — abort error type consistency - Was throwing `new Error('Workflow run aborted.')`, which `isAbortError()` (utils/errors.ts) does NOT recognise — an aborted parallel/pipeline would surface as a generic run failure. Now throws `new DOMException('Workflow run aborted.', 'AbortError')` to match the limiter, so the whole P2 abort path classifies uniformly. Doc accuracy pass (review caught 5 stale strings) - Tool descriptions accurately state the default cap is `min(16, cpus-2)` (not a flat "16"), document both env-overrides (QWEN_CODE_MAX_WORKFLOW_CONCURRENCY, QWEN_CODE_MAX_WORKFLOW_AGENTS), and note that a thunk resolving to a non-JSON-serializable value also becomes null at its index. - makeParallelImpl docstring updated: parallel() rejects on invalid input OR abort (not "only on abort" — that was contradicted by the array/function validation right above). - WorkflowTool fileoverview no longer claims "P1 sequential only". - Orchestrator.run() comment updated to describe the actual P2 signal flow (per-run limiter derived from abortOnTimeout, not P1's "sandbox-level signal intentionally not exposed"). - Wall-clock rationale loses its stale "P1 sequential" framing. 151 workflow-suite tests pass (was 150 + 1 new lazy-abort regression); tsc + eslint clean. * docs(core): Workflow P2 round-3 self-review — wall-clock honesty + cpu-floor + symmetric pipeline docs Round-3 adversarial review found one confirmed factual error in the round-2 doc rewrite plus three real consistency gaps. No new behaviour bugs; the architecture has converged after the round-1 dispatch-layer fix and the round-2 prompt-queue-abort + AbortError consistency. (1) wall-clock docstring: the round-2 rewrite claimed "even a long pipeline with the 1000-agent cap is bounded well under" 30 min. Arithmetically false: 1000 agents × 10-min subagent cap ÷ default 16-concurrency ≈ 10.5 hours, 20× the wall clock. Rewritten honestly: the wall clock is a 0-token-hang backstop, NOT a precise cost cap; for cost control point operators at the env-overridable per-run cap (QWEN_CODE_MAX_WORKFLOW_AGENTS) and concurrency window (QWEN_CODE_MAX_WORKFLOW_CONCURRENCY). (2) tool descriptions now show the actual default formula `max(1, min(16, cpus-2))`, including the outer max(1, ...) floor — without it, the displayed default would be -1 on a 1-CPU container even though the runtime clamps to 1. (3) tool descriptions now document the non-JSON-serializable→null rule for pipeline() as well as parallel() — they share the same reviveInRealm code path (per-element JSON round-trip), so the asymmetric docs were inaccurate. (4) settleToNullArray's AbortError comment is corrected: the round-2 commit overclaimed "uniform classification via isAbortError() at the WorkflowTool boundary". In reality the DOMException name is preserved at the HOST callsite inside the orchestrator, but vmAsync re-throws the script-visible rejection as a fresh `new Error(msg)` and the outer catch wraps it as WorkflowExecutionError — so isAbortError() at the tool boundary returns false either way. The DOMException is still useful as host-internal consistency, but the script-observability claim was wrong. Declined this round (intentional, documented): - F4/P2-R3-F2 wall-clock is plain Error, not AbortError — wall-clock IS a timeout, not an abort; semantically correct as-is. A unified abort surface is P3+ work. - R2-MIN-1 limiter listener leak — only triggers if signal outlives the limiter and never aborts; production caller is per-run and always aborts. - F6 new symbols not re-exported from index.ts — same internal-API decision as createConcurrencyLimiter. - Various test-vacuity nits — tests already cover real failure modes. 151 workflow-suite tests pass; tsc + eslint clean. * fix(core): Workflow P2 PR review R1 — observability + hard ceilings + SECURITY comment R1 review by @wenshao (4 [Suggestion] threads + 1 review-body comment). The fifth thread, [Critical] nested deadlock on a thunk-level limiter, was already caught and fixed in commit 0401ac88f by the dispatch-layer refactor — that commit was authored before the review posted, so the thread is on the pre-fix code; verified resolved by independent round-1 self-review fan-out that converged on the same finding from a different angle, and by a real-LLM E2E scenario (parallel-in-pipeline at concurrency=1 against qwen3-max). settleToNullArray observability + abort docs (T2 / T3 wenshao): - settled.map now logs the discarded rejection reason at debug level when a thunk rejects. Operators investigating a workflow that returned unexpected nulls can now disambiguate the four indistinguishable null paths (dispatch failure, 1000-cap, pipeline stage exception, non-JSON-serializable result) via the WORKFLOW debug logger; the contract to the script stays opaque. - Docstring now explicitly explains the abort-responsiveness path: the apparent Promise.allSettled "wait for all to complete" is in practice "wait for all to reach an abort-aware rejection" because the dispatch signal is threaded all the way down to subagent.execute, and the limiter's separate abort listener drains the not-yet-started queue instantly. env-override hard ceilings (T4 wenshao): - HARD_MAX_AGENTS_PER_RUN_CEILING = 10000 caps QWEN_CODE_MAX_WORKFLOW_AGENTS. - HARD_MAX_CONCURRENCY_CEILING = 64 caps QWEN_CODE_MAX_WORKFLOW_CONCURRENCY. - Both clamp with a debug warning rather than silently dropping the override. - Two RED-verified tests cover the over-ceiling clamp path (and a just-under value preserved). - Not a security issue (env is operator-controlled), but stops a fat-finger =999999999 from silently uncapping the run. reviveInRealm SECURITY comment (T5 wenshao): - Added a SECURITY block warning future maintainers that the revival function MUST stay inside the vm init runInContext block. JSON / Array / Object here are vm-realm globals; extracting this textually-identical helper into a host-side utility would resolve those names against the host realm and silently reopen the T1/T8/T14 escape that the revival is designed to prevent. The textual identity to a host-side util is exactly the trap. Declined this round (review-body 4 pipeline test-coverage sub-claims): - pipeline abort, parallel+pipeline shared-limiter, pipeline 1000-cap, pipeline E2E. After the dispatch-layer refactor in 0401ac88f, parallel and pipeline mechanically share the SAME countedDispatch.limiter.run path — the parallel-side abort / cap / concurrency tests cover the mechanism. Explicit per-shape sibling tests would not catch a regression that the parallel versions don't already catch. 145/145 workflow-suite tests pass; tsc + eslint clean. The 2 config tests fail locally only because the rebase pulled in #4844 (Agent Team)'s new proper-lockfile dependency which the symlinked node_modules doesn't have — CI resolves on fresh install. * fix(core): Workflow P2 PR review R2 — staggering test + revival logging + pipeline E2E R2 review by @DragonnZhang (re-review after R1 push) and @qwen-code-ci-bot (post-fix review). Three real items addressed; one style/wording inconsistency in the bot's review body declined per the round-weighted bar. T6 [Bug] staggering test deterministically fails on macOS-14 CI (DragonnZhang) - The test asserted item 0 reaches stage 2 within 50ms while item 1's stage 1 takes 120ms. That timing assumption holds only at concurrency ≥ 2. On GitHub's macos-14 runners (3 CPU cores) cpu-derived concurrency = 1, FIFO forces all stage-1 dispatches to settle before any stage-2 starts, and the ~122ms s2-of-0 timestamp blows the threshold. The test passed on my workstation but blocks the macos-latest CI matrix — root cause of the failing `Test (macos-latest, Node 22.x)` check that downgraded the prior APPROVE. - Replaced with a deterministic gate-based assertion that does NOT depend on wall-clock thresholds: force QWEN_CODE_MAX_WORKFLOW_CONCURRENCY=2, have item 1's stage 1 block on a Promise gate that only item 0's stage 2 can release. A staggered impl completes (item 0 advances while item 1 is held); a barrier impl deadlocks (item 0's stage 2 can't start until item 1's stage 1 finishes, which can't finish until item 0 reaches stage 2). Vitest timeout catches the barrier-deadlock case; the assertion `item0ReachedStage2 === true` is timing-free. T7 [Suggestion] reviveInRealm catch silently sets null with no log (qwen-code-ci-bot) - The R1 fix added debugLogger.warn for rejected thunks in settleToNullArray, but a thunk that *resolves* to a non-JSON-serializable value (BigInt / circular object) takes a different path through reviveInRealm's catch in the vm init script. Operators with debug logging on still couldn't distinguish "rejected" from "resolved-but-unserializable" — symmetric observability was missing. The R1 audit should have caught the sibling and didn't. - Added a host-side `logRevivalFailure(idx, reason)` hook to the bridge (debugLogger.warn host-side) and call it from reviveInRealm's catch with the coerced-to-string error message. The bridge contract is preserved: only primitive strings/numbers cross back; reviveInRealm itself stays inside the vm runInContext block per the SECURITY comment. T8 [Suggestion] no pipeline() end-to-end test through WorkflowTool (qwen-code-ci-bot) - This is the SAME finding wenshao raised in his R1 review-body, which I declined on a "parallel/pipeline share mechanism — symmetric tests redundant" basis. The bot's R2 raise provides specific mechanism evidence that breaks that argument: pipeline's vm wrapper uses `callPipeline.apply(null, arguments)` and `[items].concat(stages)` to spread the variadic stage list, a code path structurally distinct from parallel's single-argument call. A regression in the vm-to-host stage forwarding would not be caught by the parallel E2E. My R1 decline was based on incomplete grep — apologies, accepting now. - Added a pipeline E2E test mirroring the parallel E2E shape: full stack drive through WorkflowTool → orchestrator → sandbox revival, asserting the chained stage results `[11, 21]`. Declined this round (review-body): - qwen-code-ci-bot's workflow() stub wording inconsistency ("not supported in P1" vs the new "is unavailable: ..." on parallel/pipeline). R2 style/ nit per the round-weighted bar; no behavioural impact. 146/146 workflow-suite tests pass; tsc + eslint clean. * chore(core): Workflow P2 — dedupe WORKFLOW debugLogger + simplify revival error coercion Post-R2 /simplify pass. Two findings that are pure cleanup of code added in the R2 commit, with no scope drift: (1) The R2 fix added `createDebugLogger('WORKFLOW')` to workflow-sandbox.ts, duplicating the identical call in workflow-orchestrator.ts:21. Export the sandbox-side instance and import it in orchestrator — single source of truth, one fewer logger object retained for process lifetime. Direction is natural (orchestrator already imports from sandbox; the reverse would be circular). (2) The reviveInRealm catch coercion `String((e && e.message != null) ? e.message : e)` collapses to `String(e?.message ?? e)`. The truthy/null distinction the original drew (treating empty-string message different from the toString fallback) was not meaningful for a debug log line. Same behaviour for any realistic error; less noise to read. Deferred per the same self-review discipline that the R2 commit message documented: - `withEnv` helper to dedupe the 6-line env-var save/restore boilerplate (now 3 sites in workflow-orchestrator.test.ts): real ~10 LOC win but touches 2 pre-existing tests, out of R2 scope. - Rename `bridge.logRevivalFailure` to a generic `bridge.warn(category, msg)` for future vmAsync silent-reject logging: speculative; per the altitude analysis, "zero rename cost when the second consumer arrives" means deferring loses nothing. 146/146 workflow-suite tests pass; tsc + eslint clean.
What this PR does
Adds an experimental Agent Team mode. When enabled, the model can create a named team and spawn several sub-agents ("teammates") that work in parallel, message each other and the leader, and share a task list — and the leader collects their results into a consolidated answer. It is off by default and enabled via the
experimental.agentTeamsetting orQWEN_CODE_ENABLE_AGENT_TEAM=1. With the flag on, the model gains team coordination tools (create/delete a team, send messages to teammates, and create/update/list shared tasks), and the Agent tool gains anameparameter that spawns a long-lived named teammate instead of a one-shot sub-agent. Teammate work and messages surface in both headless and interactive (TUI) modes.Why it's needed
This re-ports the Agent Team feature onto current main. The original branch (#2886) was opened in April and fell roughly 360 commits behind after main independently rewrote the agent runtime it was built on, so it could no longer be merged cleanly — a mechanical merge produced code that did not build. This is a fresh implementation on top of the current runtime, split into small reviewable commits (foundation → message routing → teammate spawning → rendering → turn-loop integration → headless and interactive drivers). It supersedes #2886.
Reviewer Test Plan
How to verify
Enable the flag (
QWEN_CODE_ENABLE_AGENT_TEAM=1orexperimental.agentTeam: true) and give the model a task that benefits from several agents working together — for example: "Review this code for bugs with three reviewers in parallel (one on error handling, one on races, one on edge cases) and give me a consolidated report," or "Coordinate a team to build a small bookmark app." Expected: the model creates a team, spawns named teammates that run concurrently, the teammates report back to the leader as they finish, and the leader produces a single consolidated summary before the team is cleaned up. With the flag off (the default), none of the team tools are registered and behavior is unchanged.Evidence (Before & After)
See the E2E summary in the comment below. Before: the feature did not exist on main. After: the full team flow (create → spawn teammates → concurrent work → report back → consolidate → cleanup) runs end-to-end against a live model.
Tested on
macOS: built, unit + integration suites, and live headless E2E. Windows/Linux: covered by CI.
Environment
Built bundle run headless against a live model (glm-5.1); unit and integration tests via vitest.
Risk & Scope
Linked Issues
Supersedes #2886.
中文说明
这个 PR 做了什么
新增一个实验性的 Agent Team(智能体团队)模式。启用后,模型可以创建一个具名团队,并派生多个子智能体("队友")并行工作、互相以及与 leader 之间收发消息、共享任务列表,最后由 leader 汇总各队友的结果给出统一的答复。该功能默认关闭,可通过
experimental.agentTeam设置或QWEN_CODE_ENABLE_AGENT_TEAM=1环境变量启用。开启后,模型会获得团队协作工具(创建/删除团队、向队友发送消息、创建/更新/列出共享任务),并且 Agent 工具新增一个name参数,用于派生一个长期存在的具名队友,而不是一次性的子智能体。队友的工作与消息在无界面(headless)和交互式(TUI)两种模式下都会呈现。为什么需要它
本 PR 是将 Agent Team 功能重新移植到当前 main 分支。原分支(#2886)于四月开立,此后 main 独立重写了它所依赖的 agent 运行时,使其落后约 360 个提交,已无法干净合并——机械式合并会产生无法构建的代码。本 PR 是在当前运行时之上的全新实现,并拆分为多个便于审阅的小提交(基础设施 → 消息路由 → 队友派生 → 渲染 → 回合循环集成 → 无界面与交互式驱动)。它取代 #2886。
审阅者测试计划
如何验证
启用开关(
QWEN_CODE_ENABLE_AGENT_TEAM=1或experimental.agentTeam: true),给模型一个适合多个智能体协作的任务——例如:"让三位审阅者并行审查这段代码的 bug(一人看错误处理、一人看竞态、一人看边界情况),并给我一份汇总报告",或"协调一个团队构建一个小型书签应用"。预期:模型会创建团队,派生并发运行的具名队友,队友完成后向 leader 汇报,leader 在团队清理前给出一份统一的汇总。关闭开关(默认)时,团队工具不会被注册,行为保持不变。证据(前后对比)
见下方评论中的 E2E 测试摘要。之前:该功能在 main 上不存在。之后:完整的团队流程(创建 → 派生队友 → 并发工作 → 汇报 → 汇总 → 清理)可针对真实模型端到端运行。
测试平台
macOS:已构建,并通过单元 + 集成测试与真实模型的无界面 E2E。Windows/Linux:由 CI 覆盖。
风险与范围
关联 Issue
取代 #2886。