feat(computer-use): zero-config built-in via open-computer-use MCP#4590
Conversation
…s, correct version comment
…across permission transitions
Regenerated schemas.ts from upstream open-computer-use@latest via scripts/sync-computer-use-schemas.ts. Key contract fixes: - element_index: type integer → string (upstream reads via optionalString) - x/y/from_x/from_y/to_x/to_y: type integer → number (upstream uses optionalDouble) - scroll: adds required direction enum + requires element_index (not pages) - click: adds optional mouse_button string enum (left/right/middle) - Descriptions updated to upstream verbatim text (no "REQUIRED:" prefix)
Rename coerceNumericStrings → coerceTypes and add Direction 2: when schema declares type: "string" and model sends a number, stringify it (e.g. element_index: 2 → "2"). This fixes the upstream runtime error where optionalString returns nil for numeric element_index. Direction 1 (string → number for integer/number fields) is preserved unchanged for x/y coordinate fields. Update tests: element_index coercion tests now reflect string schema type; add new "coerces integer element_index to string" test cases.
📋 Review SummaryThis PR adds a zero-config Computer Use built-in capability to qwen-code by integrating the upstream 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
DragonnZhang
left a comment
There was a problem hiding this comment.
Thanks for the detailed implementation. The overall architecture looks reasonable to me: keeping qwen-code as a thin deferred-tool facade over the upstream open-computer-use MCP server is much simpler than reimplementing desktop automation locally. I left a few inline comments on issues I think should be fixed before this is ready.
|
@DragonnZhang Thanks for the careful review — all three were real and fixed in three separate commits:
All 82 tests in |
… on Finder The previous probe called get_app_state on Finder, which has the side effect of activating the target app via upstream's unhide / open -b / AXRaise logic. Result: Finder popped to the foreground once per fresh session even when the user's task had nothing to do with it. The doctor CLI reads TCC + runtime preflight and prints a summary to stdout, exiting silently when permissions are granted. When any permission is missing, doctor launches the onboarding window via LaunchServices (which dedups so repeated invocations focus the existing window). We parse the stdout summary and rely on doctor's own window-launching for the UX trigger — no separate spawnDoctor call needed. Side effect for steady-state sessions (permissions already granted): ZERO Finder activation. The probe spawns npx -y doctor once per fresh client start (~200-500ms), and that's it. Also bumped pollIntervalMs default from 2s to 5s to amortize the npx-spawn overhead during the rare permission-grant flow.
Local Verification ReportBranch: Focused Tests
Total: 10 files, 326 tests passed, 0 failed Build
Code Review Notes
VerdictPASS — All focused tests pass (326/326), build and typecheck clean. Ready to merge. |
| const errorText = | ||
| returnDisplay || `Tool '${this.upstreamName}' returned isError=true`; | ||
| return { | ||
| llmContent: llmContent || errorText, |
There was a problem hiding this comment.
[Suggestion] When mcpResult.isError is true and llmContent is a Part[] containing image data (e.g., a screenshot from get_app_state that returned an error), the expression llmContent || errorText evaluates to the truthy Part[], so errorText is never included in what the model sees. The model receives a screenshot without knowing the tool returned an error, and may proceed with incorrect assumptions about the desktop state.
Consider prepending an error text part when isError is true and llmContent is an array:
| llmContent: llmContent || errorText, | |
| if (mcpResult.isError) { | |
| const errorText = | |
| returnDisplay || `Tool '${this.upstreamName}' returned isError=true`; | |
| let finalLlmContent = llmContent; | |
| if (Array.isArray(finalLlmContent)) { | |
| finalLlmContent = [{ text: `Error: ${errorText}` }, ...finalLlmContent]; | |
| } else { | |
| finalLlmContent = finalLlmContent || errorText; | |
| } | |
| return { | |
| llmContent: finalLlmContent, | |
| returnDisplay: errorText, | |
| error: { message: errorText }, | |
| }; | |
| } |
— qwen3.7-max via Qwen Code /review
| export function parseDoctorStdout(stdout: string): PermissionProbeResult { | ||
| const accessibilityGranted = /accessibility\s*=\s*granted/i.test(stdout); | ||
| const screenRecordingGranted = /screenrecording\s*=\s*granted/i.test(stdout); | ||
| if (!accessibilityGranted) return 'accessibility'; |
There was a problem hiding this comment.
[Suggestion] parseDoctorStdout returns 'accessibility' when stdout is empty or doesn't match either regex. This means an unparseable doctor output (e.g., upstream format change, locale difference) sends users into a 10-minute permission poll loop with no diagnostic about what went wrong. The raw stdout is discarded without logging.
Two improvements:
- Log the raw stdout at debug level before parsing, so format changes are diagnosable.
- Consider returning
'other'when neither permission keyword is found in stdout — this would skip the poll loop (per theprobe === 'other'early return at line 213) and let the actual tool call surface any real permission error, which is the same fallback behavior already used for doctor spawn failures.
— qwen3.7-max via Qwen Code /review
| return parseDoctorStdout(stdout); | ||
| } catch { | ||
| // Spawn failed (npx missing, network down on first run, timeout, etc.) | ||
| // OR doctor exited non-zero. Skip probe; the next real tool call |
There was a problem hiding this comment.
[Suggestion] The catch block swallows all errors silently — spawn failure, timeout, non-zero exit — with no logging. If npx is missing from $PATH (common in GUI-launched terminals on macOS), every doctor probe silently fails and returns 'other'. The bootstrap proceeds as if permissions are OK, and the first real tool call fails with a confusing error.
Adding a debug/warn log before returning 'other' would make this failure mode diagnosable:
| // OR doctor exited non-zero. Skip probe; the next real tool call | |
| } catch (err) { | |
| // Spawn failed (npx missing, network down on first run, timeout, etc.) | |
| // OR doctor exited non-zero. Skip probe; the next real tool call | |
| // will surface any permission error via upstream's normal error path. | |
| const message = err instanceof Error ? err.message : String(err); | |
| // eslint-disable-next-line no-console | |
| console.debug?.(`[computer-use] doctor probe failed: ${message}`); | |
| return 'other'; |
— qwen3.7-max via Qwen Code /review
| await writeFile(path, JSON.stringify(state, null, 2), 'utf8'); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
[Suggestion] saveInstallState uses plain writeFile while the codebase has a well-tested atomicWriteJSON utility (packages/core/src/utils/atomicFileWrite.ts) that performs write-to-temp + atomic rename with EPERM retry and fsync. Every other JSON state file in the codebase uses atomicWriteJSON.
If the process crashes mid-write, installed.json can be left as partial/corrupt JSON. loadInstallState handles this gracefully (returns undefined), but the user gets an unnecessary re-prompt. Using the atomic write utility would be consistent with the rest of the codebase:
| /** | |
| import { atomicWriteJSON } from '../../utils/atomicFileWrite.js'; | |
| export async function saveInstallState( | |
| home: string = homedir(), | |
| state: InstallState, | |
| ): Promise<void> { | |
| const path = installStatePathFor(home); | |
| await mkdir(dirname(path), { recursive: true }); | |
| await atomicWriteJSON(path, state); | |
| } |
— qwen3.7-max via Qwen Code /review
| command: 'npx', | ||
| args: ['-y', this.packageSpec, 'mcp'], | ||
| // Inherit env so HTTPS_PROXY etc. flow through to npx | ||
| env: { ...process.env } as Record<string, string>, |
There was a problem hiding this comment.
[Suggestion] The full process.env (including DASHSCOPE_API_KEY, ANTHROPIC_API_KEY, OPENAI_API_KEY, QWEN_SERVER_TOKEN, etc.) is passed to the third-party open-computer-use subprocess. While the comment justifies this for HTTPS_PROXY/PATH, the trust model here differs from user-configured MCP servers: this binary is auto-installed on first tool invocation with a single confirmation prompt.
Consider building a sanitized env that strips known secret-bearing patterns (*_API_KEY, *_SECRET, *_TOKEN, *_PASSWORD) while keeping essential vars (PATH, HOME, HTTPS_PROXY, LANG, TMPDIR, NODE_*):
const SAFE_ENV_PATTERNS = /^(PATH|HOME|USER|LANG|LC_|HTTPS?_PROXY|NO_PROXY|TMPDIR|NODE_|TERM)$/i;
const childEnv = Object.fromEntries(
Object.entries(process.env).filter(([k, v]) => v !== undefined && SAFE_ENV_PATTERNS.test(k))
);The same applies to bootstrap.ts:125 where process.env is passed to the doctor probe.
— qwen3.7-max via Qwen Code /review
| let mcpResult: CallToolResult; | ||
| try { | ||
| mcpResult = await client.callTool(this.upstreamName, this.params); | ||
| } catch (err) { |
There was a problem hiding this comment.
[Suggestion] This catch (err) block — which fires when the transport is genuinely broken after a failed reconnect — has no test coverage. All tool.test.ts fakes return a well-formed CallToolResult; none make callTool throw. The error-message format (Computer Use tool 'X' failed: ...) that users see when the upstream binary crashes mid-session is unverified.
Consider adding a test where the fake client's callTool rejects:
it('returns an error result when client.callTool throws', async () => {
const fake = makeFakeClient(async () => {
throw new Error('transport exploded');
});
ComputerUseClient.setSharedForTest(fake);
// ... build tool, invoke execute(), assert error format
});Similarly, bootstrap.ts:234 — the ctx.signal.aborted check in the permission poll loop — is never triggered in tests. No test calls .abort() during the poll. The abort-during-permission-wait path (user hits Ctrl+C while the onboarding window is open) is untested.
— qwen3.7-max via Qwen Code /review
| } | ||
| await new Promise((resolve) => setTimeout(resolve, pollIntervalMs)); | ||
| const next = await deps.probePermissions(deps.packageSpec); | ||
| if (next === 'ok' || next === 'other') return; |
There was a problem hiding this comment.
[Suggestion] This setTimeout is not abort-aware. When the user cancels (Ctrl+C fires ctx.signal.abort), the loop waits up to pollIntervalMs (default 5s) before noticing. Making the sleep abort-responsive would improve UX:
| if (next === 'ok' || next === 'other') return; | |
| await new Promise<void>((resolve) => { | |
| if (ctx.signal.aborted) return resolve(); | |
| const timer = setTimeout(resolve, pollIntervalMs); | |
| ctx.signal.addEventListener('abort', () => { clearTimeout(timer); resolve(); }, { once: true }); | |
| }); |
— qwen3.7-max via Qwen Code /review
| // it will get a normal upstream error ("element_index out of range") | ||
| // and naturally re-snapshot. | ||
| await this.stop(); | ||
| await this.start(); |
There was a problem hiding this comment.
[Suggestion] The reconnect path calls this.start() without passing onProgress. During the initial doStart(), a 3-second download hint timer fires progress messages. But during reconnect, if the npx cache has been evicted (system cleanup, CI environments), the reconnect can take ~60s with zero user feedback.
Consider threading the instance-level onProgress through:
| await this.start(); | |
| await this.stop(); | |
| await this.start(this.onProgress); |
— qwen3.7-max via Qwen Code /review
| // packages/core/src/tools/computer-use/constants.ts. Duplicated as a | ||
| // literal here because importing TypeScript from `scripts/` into the | ||
| // package tree adds tooling complexity for a single-string lookup. | ||
| const DEFAULT_PINNED_VERSION = '0.1.51'; |
There was a problem hiding this comment.
[Suggestion] DEFAULT_PINNED_VERSION = '0.1.51' duplicates the pin, and the script never imports PINNED_OPEN_COMPUTER_USE_VERSION from constants.ts. This contradicts both this file's own header ("The default is the currently-pinned version from constants.ts … Running with no args verifies the current pin is still in sync") and the bump procedure documented in constants.ts ("Run … it reads this constant by default") — neither is true: the script uses its own literal and unconditionally overwrites (it never verifies). Following the documented bump steps (edit constants.ts, run with no args) would regenerate schemas.ts against the stale 0.1.51 — the exact schema-drift the pin exists to prevent (and which already bit once, commit 8bb0ee189). The script already runs under tsx and imports the MCP SDK, so importing the constant is feasible; otherwise add a test asserting the two literals match.
— claude-opus-4-8 via Claude Code /qreview
| * If the retry also fails, the error is re-thrown without further | ||
| * reconnect attempts. | ||
| */ | ||
| async callTool( |
There was a problem hiding this comment.
[Suggestion] callTool invokes the SDK client.callTool({ name, arguments }) with no RequestOptions — neither the AbortSignal nor a timeout. In tool.ts execute() the abort signal is threaded into runBootstrap (line 139) but not into this call (line 143), and start()/doStart() take no signal either. So once a mutating desktop action (type_text/press_key/click/drag/set_value) is dispatched it can't be cancelled with Ctrl+C/ESC, and a slow first-run npx download/connect is uninterruptible (bounded only by the SDK's implicit request timeout). Every other MCP tool honors abort — mcp-tool.ts passes { signal, timeout } to its callTool. Suggest threading signal (and an explicit timeout) from execute() → callTool() → the SDK call.
— claude-opus-4-8 via Claude Code /qreview
| * Shared singleton instance, created with default options on first | ||
| * access. Tests can replace it via `setSharedForTest()`. | ||
| */ | ||
| static shared(): ComputerUseClient { |
There was a problem hiding this comment.
[Suggestion] The shared() singleton spawns the npx … open-computer-use mcp server, but nothing ever calls .stop() on it at shutdown. It is a static instance unknown to ToolRegistry/McpClientManager, so Config.shutdown() → toolRegistry.stop() doesn't reach it, and there is no registerCleanup/SIGINT/SIGTERM hook for it (the only production reference to ComputerUseClient.shared() is tool.ts:133). Cleanup then relies entirely on the upstream server exiting on stdin-EOF; if it doesn't, the npx→node→server chain orphans — potentially leaving a process that holds macOS Accessibility/Screen-Recording access alive after qwen-code exits. Suggest registering ComputerUseClient.shared().stop() (guarded by isStarted()) in the shutdown/cleanup path.
— claude-opus-4-8 via Claude Code /qreview
| // NOTE: mcp-tool.ts has an analogous private transformation (transformMcpContentToParts / | ||
| // transformImageAudioBlock); those helpers are not exported so we replicate | ||
| // the pattern here. A future PR should extract a shared utility. | ||
| const llmContent = buildLlmContent(mcpResult.content, this.upstreamName); |
There was a problem hiding this comment.
[Suggestion] Unlike the standard MCP path (mcp-tool.ts runs results through truncateTextParts → truncateToolOutput, honoring truncateToolOutputThreshold), buildLlmContent forwards upstream text content verbatim with no length cap, and the scheduler has no truncation backstop (truncation is per-tool; computer-use never calls it). get_app_state returns an accessibility tree produced by the third-party open-computer-use process — the trust boundary this PR introduces — so a large or hostile a11y tree (or a buggy/compromised upstream) can flood the model context in a single result with none of the size protection every other tool result gets. Suggest running text parts through truncateToolOutput before returning, mirroring mcp-tool.ts (the comment at lines 155-157 already notes a shared util should be extracted).
— claude-opus-4-8 via Claude Code /qreview
| { kind: 'unknown_permission', regex: /open-computer-use\s+doctor/i }, | ||
| ]; | ||
|
|
||
| export function detectPermissionError( |
There was a problem hiding this comment.
[Suggestion] detectPermissionError (and its PATTERNS) appears dead in production: bootstrap.ts:34 imports only the PermissionErrorKind type, the real flow classifies via parseDoctorStdout/probePermissions, and tool.ts execute() never classifies the isError result — grep finds no production caller. Yet permission-detector.test.ts has 5 tests for it, so the suite reports this logic as covered while it guards nothing live (and the regexes silently rot if upstream changes its error strings). Either wire it into execute()'s isError branch (to give the model an actionable "re-grant Screen Recording" hint — the mid-session-revocation path noted at bootstrap.ts:198-207 currently has no recovery), or delete the function + PATTERNS + its tests.
— claude-opus-4-8 via Claude Code /qreview
| * spawning. `behaviors` is a queue: the i-th entry is used on the i-th | ||
| * underlying tool invocation. | ||
| */ | ||
| class ReconnectTestClient extends ComputerUseClient { |
There was a problem hiding this comment.
[Suggestion] The "callTool reconnect path" tests don't exercise the production method: ReconnectTestClient overrides callTool (line 104) with a hand-copied reimplementation of the stop→start→retry-once logic, and every test drives that copy — so the real ComputerUseClient.callTool (client.ts:147-178) never runs (0% coverage on lines 148-178 when this file runs alone). A regression in the real method — dropping the single-retry guard (→ infinite reconnect), reordering stop/start, or the already-noted start()-without-onProgress at client.ts:171 — would leave these tests green. This is the safety-critical path (silent reconnect after macOS kills the binary on a TCC grant). Suggest injecting a fake inner this.client whose callTool throws Connection closed once then succeeds (with start/stop stubbed) and asserting the real callTool retries, rather than overriding the method under test.
— claude-opus-4-8 via Claude Code /qreview
| // Computer Use tools — built-in but backed by an upstream MCP server. | ||
| // All deferred; revealed only when the user-initiated request triggers | ||
| // a computer-use action. See packages/core/src/tools/computer-use/. | ||
| COMPUTER_USE_LIST_APPS: 'computer_use__list_apps', |
There was a problem hiding this comment.
[Suggestion] These 9 COMPUTER_USE_* entries (and the 9 in ToolDisplayNames) are unreferenced — grep finds zero consumers. Registration in computer-use/index.ts derives names independently from COMPUTER_USE_TOOL_NAMES in schemas.ts via the computer_use__${upstreamName} template, so this hand-maintained block is a second, unsynchronized source of truth: when the sync script adds/renames a tool, schemas.ts updates but this block silently rots with no compile error or test to catch it. Separately, the ToolDisplayNames values here are the snake-case canonical names instead of CamelCase (every other entry, e.g. CronCreate, is CamelCase), which breaks convention and makes tool-utils.ts:40 generate the nonsense alias computer_use__list_appsTool. Suggest deleting these unused constants, or adding a test asserting they equal COMPUTER_USE_TOOL_NAMES.map(n => 'computer_use__' + n).
— claude-opus-4-8 via Claude Code /qreview
What this PR does
Adds a zero-config Computer Use built-in capability to qwen-code. Nine tools —
computer_use__list_apps,computer_use__get_app_state,computer_use__click,computer_use__perform_secondary_action,computer_use__scroll,computer_use__drag,computer_use__type_text,computer_use__press_key,computer_use__set_value— are registered as deferred built-ins. On first invocation, qwen-code transparently runsnpx -y open-computer-use mcp(after the standard tool-permission dialog), guides the user through macOS Accessibility / Screen Recording permissions via the upstreamdoctoronboarding window, and resumes the originating tool call. Subsequent calls go straight through. Controlled by a single new settingtools.computerUse.enabled(defaulttrue).There is no MCP terminology in user-facing strings: the tools are named
computer_use__<action>, the confirmation dialog talks about "Computer Use" only, and the install footprint stays under~/.qwen/computer-use/. The upstream binary itself is fetched and run throughnpx(default specopen-computer-use@latest, overridable viaQWEN_COMPUTER_USE_PACKAGE); we don't bundle it.The implementation also adds a small system-prompt improvement to the shared "Deferred Tools" section (
packages/core/src/core/prompts.ts): it now tells the model that calling a deferred tool without first loading its schema viatool_searchwill likely fail, and shows the multi-select syntaxselect:tool_a,tool_b,tool_cso it can batch-load related schemas in one call. This helps every deferred tool family in qwen-code (cron, monitor, MCP, computer-use), not just this one.Why it's needed
Today, qwen-code users who want desktop-app automation have to install a third-party extension (e.g. the existing computer-use-hybrid example), follow upstream
open-computer-useREADME to wire up MCP config, rundoctorto grant macOS permissions, then restart qwen-code. That is several manual steps that most users will not do — desktop automation effectively requires reading docs. This PR turns it into a single "yes" click: the model invokes a computer_use tool, qwen-code prompts the user once, and everything else (binary download vianpx, macOS permission guide, server lifecycle) is handled automatically.We chose to build on the upstream Swift implementation rather than reinventing it because: (a) the macOS Accessibility / Screen Recording /
.appbundle / TCC integration is non-trivial and upstream solves it correctly, (b) the same upstream binary covers Linux and Windows in addition to macOS, and (c) consuming upstream as annpxpackage keeps our integration small and lets users update the underlying binary by bumping a single pinned spec.Reviewer Test Plan
How to verify
Unit coverage: 74 tests across 7 files under
packages/core/src/tools/computer-use/. Runcd packages/core && npx vitest run src/tools/computer-use/— all green. Tests cover schemas, the MCP stdio client (singleton lifecycle, auto-reconnect onConnection closed/Not connected), the parameterized tool wrapper, bidirectional type coercion (string ↔ number based on schema), permission error detection, install-state persistence, and the full bootstrap state machine (install approval gate → spawn → macOS permission probe → doctor + poll → re-spawn on permission kind change).Manual smoke (macOS only, full from-zero path):
rm -rf ~/.qwen/computer-use ~/.npm/_npx/<hash-for-open-computer-use>. RevokeOpen Computer Use.appfrom System Settings → Privacy & Security → Accessibility AND Screen Recording.get_app_state+click.)~/.qwen/computer-use/installed.jsongets created, the model's tool call streams "Starting Computer Use..." then "Downloading Computer Use binary (~60s)..." whilenpxfetches the upstream binary.Open Computer Useonboarding window opens. Grant Accessibility. The tool-call stream updates: "Now waiting for screenRecording permission. Re-opening the onboarding window...". Grant Screen Recording — macOS asks to restartOpen Computer Use.app, click Restart. Our transport seesConnection closed, transparently reconnects, and retries the tool call.inlineDataParts) and click via element_index returned fromget_app_state.Behavioral sanity checks: try a
get_app_stateand verify the screenshot reaches the model (model can describe what's on screen). Try aclickwithelement_index— should succeed. Try invoking a deferred tool whose schema isn't loaded — the new prompt guidance should nudge the model to batch-load viaselect:.Evidence (Before & After)
Before: no built-in Computer Use; users must install an extension and read upstream README. After: model invokes
computer_use__*→ single confirm → automatic install + permission flow → working. Six bugs were caught and fixed inline during real-machine smoke testing (see commit log: image content drop, qwen3.6 string-typed-integer coercion, schema/upstream contract mismatch, Screen Recording probe via image-presence heuristic, doctor re-spawn on permission-kind transition, transport auto-reconnect after macOS app restart). The implementation plan committed atdocs/superpowers/plans/2026-05-28-computer-use-built-in.mddocuments the full journey.Tested on
macOS verified end-to-end on Apple Silicon. Windows and Linux paths exist in the upstream binary (Go runtimes targeting UI Automation and AT-SPI2 respectively) but I did not run them locally — the wrapper and bootstrap state machine treat platform symmetrically except for the macOS-only Screen Recording probe and doctor flow.
Environment (optional)
Local:
npm run buildthen run qwen-code interactively. Unit tests:cd packages/core && npx vitest run src/tools/computer-use/. Upstream binary fetched vianpx -y open-computer-use@latest mcpat runtime.Risk & Scope
tools.computerUse.enabled: falsefor opt-out. We chose default-on because the feature is built-in and discovery is the point.packages/core/src/tools/computer-use/schemas.tsmirror the upstreamopen-computer-use@latestMCPtools/listoutput. They are regenerated byscripts/sync-computer-use-schemas.ts <packageSpec>— should be run before each qwen-code release that bumps the upstream pin. Forgetting to sync after an upstream schema change would let model calls pass our local validation but fail at the upstream layer (already happened once during testing — see commit8bb0ee189).coerceTypesintool.ts) compensates for qwen3.6 occasionally passing JSON values with the wrong primitive type (string-typed integers or vice versa). The coercion only fires on clean numeric-string ↔ number transitions; garbage values still fail Ajv validation cleanly.Connection closed/Not connectederrors silently drops in-flightelement_indexstate from the prior upstream process. The model is instructed (via the upstream schema description) to callget_app_statebefore any element-targeted action, so a stale index produces a normal upstream error which the model recovers from naturally. No persistent state corruption.QWEN_COMPUTER_USE_AUTO_APPROVE=1exists for these but is not smoke-tested). Idle timeout for the spawned MCP server (resource savings) is deferred. Telemetry breakdown of bootstrap failures (network vs gatekeeper vs permission timeout) is deferred.true, the underlying binary download is interactive and bounded (one-time, after explicit user approval), no existing flags are renamed.Linked Issues
N/A — this PR was scoped via design conversation, not a tracked issue.
中文说明
这个 PR 做了什么
把 Computer Use 作为零配置的内置能力加入 qwen-code。
computer_use__list_apps、computer_use__get_app_state、computer_use__click、computer_use__perform_secondary_action、computer_use__scroll、computer_use__drag、computer_use__type_text、computer_use__press_key、computer_use__set_value这 9 个工具被注册为 deferred 内置工具。模型首次调用其中任何一个时,qwen-code 会在标准的 tool-permission dialog 之后透明地运行npx -y open-computer-use mcp,通过上游的doctoronboarding 窗口引导用户授予 macOS 的 Accessibility / Screen Recording 权限,然后恢复最初触发的那次 tool call。之后的调用直接通过。新增一个配置项tools.computerUse.enabled控制(默认true)。用户可见的字符串里没有 MCP 概念:工具名是
computer_use__<action>,确认对话框只提到 "Computer Use",所有安装产物在~/.qwen/computer-use/下。上游 binary 通过npx获取并运行(默认 spec 是open-computer-use@latest,可通过QWEN_COMPUTER_USE_PACKAGE覆盖);我们不打包它。实现里还顺带改了一处共用的系统提示词——
packages/core/src/core/prompts.ts里的 "Deferred Tools" 段落。现在它会告诉模型:不先通过tool_search加载 schema 就调用 deferred 工具大概率会失败,并演示了select:tool_a,tool_b,tool_c这种多工具批量加载语法。这对 qwen-code 里所有的 deferred 工具家族(cron / monitor / MCP / computer-use)都有帮助,不止这个 feature。为什么需要
目前 qwen-code 用户如果想做桌面应用自动化,得安装一个第三方 extension(比如已有的 computer-use-hybrid 示例)、读上游
open-computer-use的 README 配 MCP、运行doctor授予 macOS 权限、重启 qwen-code。这套流程足够长,大多数用户不会真的走完——desktop automation 实际上需要看文档。这个 PR 把它变成单次 "yes" 点击:模型调一个 computer_use 工具,qwen-code 弹一次确认,剩下的(binary 下载、macOS 权限引导、server 生命周期)全自动。我们选择在上游 Swift 实现上构建,而不是从头再写一遍,因为:(a) macOS 的 Accessibility / Screen Recording /
.appbundle / TCC 集成相当 tricky,上游做得是对的;(b) 同一个上游 binary 除 macOS 外还覆盖 Linux 和 Windows;(c) 用npx包形式消费上游让我们的集成保持小型,用户升级底层 binary 只需要 bump 单个 pinned spec。Reviewer 验证计划
如何验证
单元覆盖:
packages/core/src/tools/computer-use/下 7 个文件共 74 个测试。运行cd packages/core && npx vitest run src/tools/computer-use/——全绿。测试覆盖 schemas、MCP stdio client(singleton 生命周期、Connection closed/Not connected时的自动重连)、参数化的 tool wrapper、双向类型 coercion(按 schema 在 string 和 number 之间转换)、权限错误检测、install state 持久化,以及完整的 bootstrap 状态机(install approval gate → spawn → macOS permission probe → doctor + poll → permission kind 切换时重新 spawn doctor)。手动 smoke(仅 macOS,完整从零路径):
rm -rf ~/.qwen/computer-use ~/.npm/_npx/<open-computer-use 对应的 hash>。在 System Settings → 隐私与安全 → 辅助功能和屏幕录制里把Open Computer Use.app移除。get_app_state+click的 prompt。)~/.qwen/computer-use/installed.json被创建,模型的 tool call 流出现 "Starting Computer Use..." 然后 "Downloading Computer Use binary (~60s)...",这时npx在拉上游 binary。Open Computer Use的 onboarding 窗口弹出。授予 Accessibility。tool call 流更新:"Now waiting for screenRecording permission. Re-opening the onboarding window..."。授予 Screen Recording——macOS 会要求重启Open Computer Use.app,点 Restart。我们的 transport 看到Connection closed,透明重连并重试原 tool call。inlineDataParts 形式喂给它),能通过get_app_state返回的 element_index 点击。行为 sanity 检查:试一个
get_app_state验证截图传到了模型那里(模型能描述屏幕上的内容);试一个带element_index的click——应该成功;试着调一个 schema 没加载的 deferred 工具——新的 prompt 应该会引导模型用select:批量加载。证据(Before & After)
Before:没有内置 Computer Use;用户得装 extension 并读上游 README。After:模型调
computer_use__*→ 单次确认 → 自动安装 + 权限流程 → 工作。实测过程中暴露并修复了 6 个 bug(见 commit log:image content 丢失、qwen3.6 string-typed-integer coercion、schema 与上游契约不一致、用 image-presence 启发式检测 Screen Recording、permission kind 切换时重新 spawn doctor、macOS app 重启后 transport 自动重连)。docs/superpowers/plans/2026-05-28-computer-use-built-in.md里记录了完整实施 plan。测试覆盖
macOS 在 Apple Silicon 上端到端验证过。Windows 和 Linux 路径在上游 binary 里都有(分别是 Go 实现的 UI Automation 和 AT-SPI2 runtime),但我没本地跑——wrapper 和 bootstrap 状态机在跨平台时是对称的,唯一例外是 Screen Recording probe 和 doctor 流程是 macOS-only。
环境(可选)
本地:
npm run build后交互式运行 qwen-code。单元测试:cd packages/core && npx vitest run src/tools/computer-use/。上游 binary 通过npx -y open-computer-use@latest mcp在 runtime 获取。风险与范围
tools.computerUse.enabled: false关闭。选默认开是因为这是内置 feature,可发现性本身就是价值。packages/core/src/tools/computer-use/schemas.ts里硬编码的 schemas 与上游open-computer-use@latestMCPtools/list输出一致。它们由scripts/sync-computer-use-schemas.ts <packageSpec>重新生成——每次 bump 上游 pin 前应该跑一次。忘记 sync 而上游 schema 变了的话,模型调用会通过我们本地校验但在上游层失败(测试时已经发生过一次——见 commit8bb0ee189)。tool.ts里的coerceTypes)补偿了 qwen3.6 偶尔以错误 primitive type 传 JSON 值(string-typed integer 或反之)的问题。Coercion 只在干净的 numeric-string ↔ number 转换上 fire;garbage 值仍然干净地 fail Ajv 校验。Connection closed/Not connected错误时的自动重连会静默丢弃上游进程之前的element_index状态。模型被上游 schema description 提示要在元素操作前先调get_app_state,所以 stale index 触发的正常上游错误模型能自然恢复。无持久状态污染。QWEN_COMPUTER_USE_AUTO_APPROVE=1这个 env-var fallback 但没 smoke 过)。Spawn 出的 MCP server 的 idle timeout(资源节省)延后做。Bootstrap 失败原因的 telemetry 分解(network vs gatekeeper vs permission timeout)延后做。true,底层 binary 下载是交互式且有边界的(一次性,用户显式同意后),现有 flag 没有重命名。Linked Issues
N/A——本 PR 是通过设计讨论 scope 出来的,不是 tracked issue。