Skip to content

Feature Request: Port Dynamic Workflows / Ultracode from Claude Code 2.1.160 #4721

@LaZzyMan

Description

@LaZzyMan

What would you like to be added?

Port the Dynamic Workflows feature (announced by Anthropic in Claude Code 2.1.160) to qwen-code as a third tier of multi-agent execution, complementary to the existing /swarm tool (#3433) and the in-progress Agent Team (#2886).

What a dynamic workflow is

A model-authored JavaScript script that runs in a sandbox and orchestrates many subagents through a small set of primitives. The model writes the script on-the-fly for the user's request; the runtime sandboxes it; subagents fan out through the existing headless-agent path; one aggregated result returns to the main conversation.

The full API surface (all confirmed against upstream's published /deep-research workflow script and binary strings):

// Required first statement of every script
export const meta = {
  name: string,
  description: string,
  whenToUse?: string,
  phases?: Array<{ title: string, detail?: string, model?: string }>,
}

// Injected globals
phase(title: string): void
parallel(thunks: Array<() => Promise<T>>): Promise<Array<T | null>>
pipeline<T>(items: T[], ...stages: Array<(prev, item: T, idx: number) => Promise<any>>): Promise<any[]>
agent(prompt: string, opts?: {
  label?: string, phase?: string, schema?: object,
  model?: string, isolation?: 'worktree' | 'remote', agentType?: string,
}): Promise<any>
log(message: string): void
workflow(nameOrRef: string | { scriptPath: string }, args?: any): Promise<any>

args: any
budget: { total: number | null, spent(): number, remaining(): number }

// Stubbed (throw) to guarantee resume determinism
Date.now(), new Date(), Math.random()

Concrete example from Anthropic's shipped /deep-research workflow:

phase('Scope')
const scope = await agent('Decompose this research question into 5 search angles...', { schema: SCOPE_SCHEMA })

const searchResults = await pipeline(
  scope.angles,
  angle => agent(SEARCH_PROMPT(angle), { phase: 'Search', schema: SEARCH_SCHEMA }),
  searchResult => parallel(novelUrls(searchResult).map(src => () =>
    agent(FETCH_PROMPT(src), { phase: 'Fetch', schema: EXTRACT_SCHEMA }))),
)

phase('Verify')
const voted = await parallel(rankedClaims.map(claim => () =>
  parallel(Array.from({ length: 3 }, (_, v) => () =>
    agent(VERIFY_PROMPT(claim, v), { phase: 'Verify', schema: VERDICT_SCHEMA })))
    .then(verdicts => ({ ...claim, survives: verdicts.filter(v => v.refuted).length < 2 }))))

phase('Synthesize')
const report = await agent('Merge confirmed claims; write report...', { schema: REPORT_SCHEMA })
return { question, ...report }

Triggers

Four invocation paths matching upstream:

  1. Keyword in prompt — including the word workflow in a single-turn user prompt opts that turn into a workflow.
  2. /effort ultracode (session-only mode) — once enabled, the model auto-spawns workflows for substantive turns until session ends.
  3. Saved slash command — workflow scripts at .qwen/workflows/ (project) or ~/.qwen/workflows/ (user) are surfaced as slash commands (the /deep-research invocation style).
  4. Direct Workflow tool call by the model with inline script or name.

Hard caps (verbatim from upstream)

  • Concurrent agents: min(16, os.cpus().length - 2) per workflow
  • Total agent calls: 1000 per workflow lifetime
  • Schema-mismatch nudges: 2 in-conversation nudges per agent (binary line 307789 — error message reads "subagent completed without calling StructuredOutput (after 2 in-conversation nudges)"). Distinct from the stall-retry counter VOK = 5 which fires on no-progress agents, not schema validation failures.
  • Single nesting level for workflow() (workflow inside a child workflow throws)
  • Same-session resume only (in initial scope; cross-session is a v1.5 candidate, see decisions below)

Why is this needed?

What dynamic workflow does that swarm and Agent Team don't

Capability /swarm (shipped, #3433) Agent Team (PR #2886) Dynamic Workflow (this proposal)
Programming model Declarative tasks[] array Imperative team / mailbox API Imperative JS script
Multi-phase orchestration ❌ single shot ✅ via task board state ✅ via phase()
Pipeline (staggered, non-barrier) across stages ✅ via pipeline()
Structured-output / schema-validated agents ✅ via agent({ schema })
Resume / cached-prefix re-entry ✅ longest-unchanged-prefix
Programmable token budget budget.total/spent/remaining
Nested / saved workflows ✅ via workflow(name, args)
Inter-agent communication ✅ peer-to-peer mailbox ❌ results return to script
Lifecycle Ephemeral Persistent collaboration Ephemeral per-script

Dynamic workflow is strictly additive:

  • Keep /swarm for the simple "fan out 5 tasks, aggregate" case where a script is overkill.
  • Keep Agent Team for persistent peer-to-peer collaboration with mailboxes.
  • Add Workflow for the rich multi-phase orchestration case — matching upstream's Agent + Workflow coexistence.

Use cases this unlocks

  • Adversarial verification (deep-research-style 3-vote claim refutation against schema-validated claims)
  • Map-reduce with typed outputs (extract findings as JSON objects, not freeform text)
  • Multi-stage pipelines without per-stage barriers (item A in stage 3 while item B is still in stage 1 — dramatically reduces wall-clock for long fan-outs)
  • Resumable long runs (pause/restart without re-burning the agents that already completed)
  • Cost-bounded loops (while (budget.remaining() > 50_000) { spawn more verifiers })

Why the infrastructure cost is small

The implementation reuses most of what qwen-code already ships:

Subsystem Existing PR / file Reuse path
Headless subagent dispatch #3076 #3970, packages/core/src/agents/runtime/agent-headless.ts Each agent()AgentHeadless.create()
Background-task envelope + <task-notification> #3471 #3488 #3739, packages/core/src/agents/background-tasks.ts Add 'workflow' to TaskKind
Notification routing into main loop #3471, SendMessageType.Notification Zero change
BackgroundTasksPill / dialog / live panel #3488 #3768 #4477 Extend KIND_NAMES with 'workflow'
Per-source token aggregation UiTelemetryService.bySource Key by phase:label
Background-agent resume (1068 LOC, battle-tested) #3739 Reusable for cross-session resume v1.5
Concurrency cap pattern #4324, QWEN_CODE_MAX_BACKGROUND_AGENTS Workflow-scoped variant
MCP propagation to subagents shared Config.getToolRegistry() Zero change
Slash command registration packages/cli/src/services/BuiltinCommandLoader.ts Append /workflows, /effort
Subagent system-prompt template existing built-in agents New WORKFLOW_SUBAGENT_PROMPT constant

The only genuinely new subsystem is a node:vm-based JS sandbox (~150–250 LOC) that injects the documented globals and stubs Date.now / Math.random for resume determinism. Everything else is wiring.

Additional context

Local design artifacts

A full design pass has been completed against upstream @anthropic-ai/claude-code@2.1.160, with the API surface live-verified against the actual /deep-research workflow script captured from ~/.claude/projects/<session>/workflows/scripts/ plus binary strings cross-check. The following artifacts will be committed alongside the implementation PR:

  • Main design doc (788 lines): .qwen/design/dynamic-workflow-alignment-claude-code-2.1.160.md — per-axis upstream findings, qwen-code fit matrix (11 subsystems), gap analysis, phased plan, risks
  • Live-probe delta (~280 lines): .qwen/design/dynamic-workflow-alignment-claude-code-2.1.160-liveprobe-delta.md — API surface confirmation, with binary line citations and /deep-research source code references
  • E2E test plan (812 lines): .qwen/e2e-tests/dynamic-workflow-alignment.md — per-phase scenarios with stub-server harness

Every API signature in the design is Confirmed against either Anthropic's published /deep-research workflow source or the binary's literal tool-description constant.

Phased implementation plan

Each phase is independently shippable behind a feature gate:

Phase Scope Est. LOC
P1 Minimal Workflow tool: node:vm sandbox + sequential agent() + phase() + log(); foreground; no parallel/pipeline/schema/budget/resume ~600
P2 parallel(thunks) + pipeline(items, ...stages) + 16-concurrent / 1000-total caps + errors-as-data ~300
P3 agent({ schema, agentType }) → forced StructuredOutput contract + 2-nudge in-conversation retry on schema validation failure; agentType resolves against the declarative-agents registry from #4821 (graceful fallback to the built-in workflow subagent if agentType is unset or fails to resolve) ~300
P4 Extract meta ({name, description, whenToUse?, phases?[{title, detail?, model?}]}) before stripping it from the script source (replaces P1's stripExportMeta with extractAndStripMeta); /workflows slash command + phase-tree progress UI + BackgroundTasksPill KIND_NAMES extension ~400
P5 budget global + per-phase token rollup + optional per-run token ceiling ~200
P6 Resume via longest-unchanged-prefix cache; JSONL journal under <projectDir>/workflows/<sessionId>/ ~400
P7 (optional) Ultracode session-mode toggle + workflowKeywordTriggerEnabled keyword trigger ~200

Decisions needed before P1 (qwen-specific divergences from upstream)

These need maintainer sign-off because they leak into settings shape, env var names, and UI surface:

  1. JS sandbox choice: node:vm (zero dep, weak isolation; matches upstream's defense-in-depth posture, since the script has no fs/shell surface by design) vs isolated-vm (strong V8-level isolate, but adds a native dep that breaks the "pure JS, single npm install" property and requires prebuilt binaries for all platforms). Recommendation: node:vm for v1, escalate to isolated-vm only if a hostile-script threat model emerges.

  2. Keyword trigger default: upstream sets workflowKeywordTriggerEnabled = true. qwen-code users skew cost-sensitive across DashScope / OpenAI-compatible providers. Recommendation: default false on qwen-code, require explicit opt-in via settings. Diverges from upstream.

  3. Per-run token ceiling: upstream has only agent-count caps (16 concurrent / 1000 total) — no programmable token limit. Recommendation: add a qwen-only QWEN_CODE_MAX_TOKENS_PER_WORKFLOW env var, default unset, as a safety net. Diverges from upstream.

  4. Saved-workflows directory: .qwen/workflows/ + ~/.qwen/workflows/ (matches qwen convention) vs .claude/workflows/ + ~/.claude/workflows/ (matches upstream literal paths and aids portability of shared workflows across tools). Recommendation: .qwen/ paths; copy-pasted upstream workflows need a path adjustment.

  5. Cross-session resume: upstream is strictly same-session only. qwen-code already ships cross-session background-agent resume (Add background agent resume and continuation #3739, 1068 LOC). Recommendation: ship same-session in v1 to match upstream, extend to cross-session in v1.5 as a qwen-only improvement.

  6. Ultracode persistence semantics: upstream's ultracode: true is session-only (does not persist across sessions). qwen-code's settings layer has no "session-only key" concept today. Recommendation: match upstream — require re-toggle per session, document it.

Risks

  1. JS sandbox securitynode:vm is not a true security boundary. Mitigated by the fact that the script has no fs/shell surface by design (only spawned agents do I/O); we enforce this by not injecting process / require / fs / child_process into the context. Escalation path to isolated-vm if a real attacker model emerges. Workflow scripts are model-authored, not arbitrary user input.

  2. Token cost amplification — 16× concurrency × deep nesting × shared plan billing can burn quota fast. Mitigated by agent-count caps (16 / 1000), optional QWEN_CODE_MAX_TOKENS_PER_WORKFLOW ceiling, one-time consent banner via skipWorkflowUsageWarning setting, and the keyword-trigger default flip.

  3. Subagent state leakage — subagents share Config / ToolRegistry. Concurrent agents could leak through mutable per-call state in custom MCP servers. Mitigated by auditing Config for mutable per-call state and recommending isolation: 'worktree' for workflows that mutate files concurrently.

Known P1 limitations (deferred to later phases)

Surfaced during PR #4732 R7 review by @DragonnZhang; documented here so they don't get re-raised in subsequent rounds:

  1. In-script async microtask leak after wall-clock timeout — once an in-script async loop (e.g., (async () => { while(true) await 0 })()) starts inside the node:vm context, the wall-clock Promise.race rejects user-side but the microtask loop continues consuming host microtasks. node:vm provides no mechanism to halt async execution once started. Mitigated by the 30-min default cap (QWEN_CODE_MAX_WORKFLOW_SECONDS) and the opt-in feature gate. Proper fix: migrate the sandbox to worker_threads isolation in a future phase, where worker termination drops all in-flight microtasks.

  2. No memory cap on the vm contextnode:vm does not enforce a memory limit, so a script like const a=[]; while(true) a.push(new ArrayBuffer(1e8)) can OOM the host process. Operator mitigation: --max-old-space-size flag on the parent Node process. Same proper fix as (1): worker_threads isolation gives the worker its own heap with a resourceLimits.maxOldGenerationSizeMb cap.

Both are acceptable for P1 given the opt-in gate, ask permission level, and 30-min wall-clock backstop — but should be addressed alongside any future phase that loosens the gating (e.g., P7 keyword trigger / ultracode session-mode would broaden the activation surface and make stricter sandbox isolation more important).

Relation to existing features (not replaced)

Related upstream ports (coordinate with)

Upstream references

Acceptance criteria


Update — 2026-06-12: P1 + P2 shipped, 2.1.168 reverse pass

Captures actual shipped state and the deltas found in the Claude Code 2.1.168 binary scan. The original plan, decisions, and risks above remain authoritative for the unshipped phases; this section adds (i) what actually shipped vs the original LOC/scope estimate, (ii) what changed upstream between 2.1.160 and 2.1.168, and (iii) confirmed adjacent-infrastructure reuse paths for P3–P7.

Shipped phases

Phase PR Merged Source LOC Total LOC (with tests) Plan LOC Scope verdict
P1 #4732 2026-06-09 ~1207 3112 ~600 On plan, 0 missing features. Tests = 1905 LOC (61% of total).
P2 #4947 2026-06-12 06:16 UTC ~541 1173 ~300 On plan, 0 missing features. Tests = 668 LOC (57% of total).

LOC over-run vs the original ~600 / ~300 estimates is dominated by test coverage and review-round security hardening; no unplanned feature shipped. The "extras" below are all positive drift discovered during review.

P1 positive drift (beyond original plan):

  • 30-min async wall-clock timeout (QWEN_CODE_MAX_WORKFLOW_SECONDS, default 1800s) — catches 0-token hangs that vm timeout and the future budget cap cannot reach (T23 R2)
  • AbortSignal threading into subagent.execute so wall-clock abort propagates into in-flight subagents (T40 R4)
  • abortOnTimeout child-controller injection seam for explicit timeout coordination (T40 R4)

P2 positive drift (beyond original plan):

  • Hard ceilings on env-overridable caps: HARD_MAX_AGENTS_PER_RUN_CEILING = 10000, HARD_MAX_CONCURRENCY_CEILING = 64 — prevents fat-finger misconfiguration uncapping a runaway workflow (R1 wenshao T4)
  • Per-element vm-realm JSON revival of parallel/pipeline results instead of whole-array — closes T1/T8/T14 escape: a single non-serializable thunk result no longer wipes out sibling results (R1 self-review EAD-1)
  • Dispatch-layer concurrency throttling (not thunk-layer) — prevents nested parallel-in-pipeline deadlock (the canonical /deep-research shape); verified with a gate-based RED test on concurrency=1 (F1 fix)
  • Observability: debugLogger.warn for rejected thunks (R1), logRevivalFailure hook for non-serializable results (R2) — disambiguates "null at index" failure modes
  • Limiter prompt-queue abort listener — strengthens limiter invariant when an in-flight thunk hangs (R2)

Shipped env vars and caps

Env var Default Hard ceiling Notes
QWEN_CODE_ENABLE_WORKFLOWS unset (off) n/a '1' to enable (or enableWorkflows: true setting)
QWEN_CODE_DISABLE_WORKFLOWS unset n/a '1' is a force-disable kill switch
QWEN_CODE_MAX_WORKFLOW_CONCURRENCY max(1, min(16, cpus-2)) 64 sliding window per run
QWEN_CODE_MAX_WORKFLOW_AGENTS 1000 10000 total agent() calls per run
QWEN_CODE_MAX_WORKFLOW_SECONDS 1800 (30 min) n/a wall-clock per run

P3+ injection seams already pre-wired in P1/P2

  • agent({schema}), agent({agentType}), agent({isolation}), agent({model}) are all STUB-THROW today — P3 replaces the throw with the real implementation, no sandbox re-opening needed
  • SandboxOptions.budget interface is wired (default spent()/remaining() throw) — P5 injects the real implementation through the existing seam
  • SandboxOptions.parallel / SandboxOptions.pipeline are populated (P2)
  • resumeFromRunId + JSONL journal: NOT pre-wired — P6 is a net-new subsystem
  • Ultracode session mode + workflowKeywordTriggerEnabled: NOT pre-wired — P7 is net-new

Claude Code 2.1.168 reverse pass — deltas vs 2.1.160 baseline

Binary strings cross-compared across 2.1.161 / 2.1.162 / 2.1.168 against the 2.1.160-documented baseline at the top of this issue.

Unchanged: concurrent cap max(1, min(16, cpus-2)), per-run agent cap 1000, schema nudge count 2 in-conversation, wall-clock 30 min, single-level workflow nesting.

New upstream features (post-2.1.160) confirmed shipped:

Feature Evidence in binary Affects qwen plan
agent({schema}) enforcement error string: "subagent completed without calling StructuredOutput (after 2 in-conversation nudges)" P3 — contract locked, error msg should match
agent({agentType}) error string: "agent({agentType}): agent type '{agentType}' not found" P3 — contract locked
agent({isolation:'worktree'}) strings present P3 — match implementation
agent({isolation:'remote'}) error string: "agent({isolation:'remote'}) is not available in this build" P3 — keep parity, document as not-available
Budget telemetry tengu_workflow_budget_cap_exceeded P5 — telemetry name aligned
Resume telemetry tengu_workflow_journal_started_hit_respawn P6 — pattern validated
Agent memory (2.1.168 new) Scope for auto-loading agent memory files. 'user' - ~/.claude/agent-memory/<agentType>/, 'project' - .claude/agent-memory/<agentType>/ NEW: not in 2.1.160 baseline; candidate for a post-P7 follow-up

Verdict: Workflow surface is stable across 2.1.160 → 2.1.168 for everything #4721 covers. Upstream has already shipped P3 / P5 / P6 (only isolation:'remote' is gated off), giving qwen a clear contract to match. Agent-memory scoping by agentType is new in 2.1.168 — out of #4721's original scope, candidate for a post-P7 follow-up if there's appetite.

Adjacent infrastructure reuse confirmed on origin/main

Phase Existing subsystem Reuse path Estimated savings
P3 schema + agentType SubagentManager + agent-frontmatter-schema.ts (#4842 + #4996 already on main, CC 2.1.168 parity) await config.getSubagentManager().findSubagentByName(agentType, 'project') returns SubagentConfig carrying permissionMode, maxTurns, mcpServers, hooks, tools, disallowedTools, color; merge into workflow's hardcoded disallow floor ~300 LOC
P3 isolation:'worktree' qwen-code worktree subsystem (already in use elsewhere) Wire agent({isolation:'worktree'}) to spawn subagent in a fresh worktree; reuse cleanup helpers (new wiring, not raw reuse)
P4 UI BackgroundTasksPill + BuiltinCommandLoader Extend TaskKind union with 'workflow', append to KIND_NAMES map, register workflowsCommand mirroring tasksCommand / hookCommand patterns ~200 LOC
P5 budget AgentStatistics per-agent token tracker WorkflowBudgetImpl.spent() sums per-phase agent tokens; populate the existing P1 SandboxOptions.budget seam ~100-150 LOC
P6 resume jsonl-utils (writeLine/readLines/countLines) + FileHistoryService serialization pattern (#4897) New WorkflowStateRecord JSONL type; append on each phase completion; on resume, longest-prefix-matched replay ~200 LOC
P7 keyword trigger config.ts settings layer workflowKeywordTriggerEnabled mirroring agentTeamEnabled / forkSubagentEnabled patterns ~50 LOC

Agent Team (#4844) integration risk = LOW: Team and Workflow share the same SubagentManager registry so agentType semantics align, but their call sites (team.spawn() vs workflow.agent()) don't overlap. Documentation will need to disambiguate "which to use when".

Refined phase plan (no scope drift, refined LOC estimates)

The plan above remains the authoritative scope; this is a refined estimate based on the confirmed reuse paths. P3 ships as a single PR covering schema + agentType + isolation:'worktree' rather than splitting — the model-facing API and the sandbox-execution wiring are easier to review together than as two coupled PRs.

Phase Refined LOC est. (src + tests) Net-new subsystems Notes
P3 schema + agentType + isolation:'worktree' ~1200-1500 none (all reuse) Wires existing SubagentManager, existing worktree subsystem, the new StructuredOutput contract
P4 /workflows + UI + extractAndStripMeta ~700-900 none Extends existing BackgroundTasksPill / BuiltinCommandLoader
P5 budget ~400-500 none Populates the existing SandboxOptions.budget seam
P6 resume ~600-800 WorkflowStateRecord JSONL type Reuses jsonl-utils
P7 ultracode + keyword trigger ~150-250 none Settings-only

Remaining total ≈ 3000-3950 LOC across 5 PRs (average ~600/PR vs P1+P2 average ~2100/PR — easier reviews).

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions