feat(cli): headless / non-interactive runaway-protection guardrails (#4103)#4502
Conversation
Adds two opt-in run-level budgets and a startup safety warning for non-interactive / CI / SDK runs. All defaults preserve existing behavior; the budgets only fire when the user explicitly sets a limit. Phase 1 — surface unsafe configs and fix doc drift - New `--yolo`-without-sandbox stderr warning at startup of every non-interactive run, emitted by `getHeadlessYoloSafetyWarning` in `packages/cli/src/utils/headlessSafetyWarnings.ts`. Suppressible via `QWEN_CODE_SUPPRESS_YOLO_WARNING=1` (strict `1`/`true` match so `=0` / `=false` don't silence it). Strict env match also applied to the `SANDBOX` check so values like `SANDBOX=0` don't accidentally bypass the warning. - Gated on `!config.isInteractive()` at the gemini.tsx call site so TUI users aren't nagged. - `docs/users/configuration/settings.md`: corrected `model.skipLoopDetection` default (`true`, not `false`) and reworded the `--yolo`/sandbox section — `--yolo` does NOT auto-enable a sandbox; sandboxing must still be opted into explicitly. Phase 2 — run-level budgets with distinct exit code - `--max-wall-time` / `model.maxWallTimeSeconds`: wall-clock duration for the whole run. Flag accepts `90` (s), `30s`, `5m`, `1h`, `500ms`. Settings is plain seconds. - `--max-tool-calls` / `model.maxToolCalls`: cumulative tool executions (success + failure). Ticked BEFORE each `executeToolCall` so a budget of N caps the run at exactly N executions. - New `FatalBudgetExceededError` (exit code 55), distinct from `FatalTurnLimitedError` (53) and `FatalCancellationError` (130) so CI scripts can branch on the reason. JSON output mirrors the `handleMaxTurnsExceededError` / `handleCancellationError` envelope convention. - Enforced via `RunBudgetEnforcer` in `packages/cli/src/utils/runBudget.ts`, wired to the same `AbortController` as SIGINT so existing cancellation plumbing carries the abort. A `routeAbort` helper distinguishes budget vs. SIGINT at the abort-check sites and at the outer catch. Critical correctness fixes (informed by the #4105 review pass) - Drain-loop fall-through: the inner drain-item `for await` previously exited via `finalizeAssistantMessage(); return;`, swallowing a budget abort that fires during the last drain item and surfacing exit code 0. Now routes through `routeAbort` so exit 55 is preserved. - Settings symmetry: `maxWallTimeSeconds: 0` in settings.json is now rejected (same as `--max-wall-time 0`); the enforcer treats `<=0` as "no timer" so silent disable would be a foot-gun. `validateMaxWallTimeSetting` also rejects `Infinity` / `NaN`. - `setTimeout` overflow: both parser paths reject durations above `Math.floor((2^31 - 1) / 1000)s` (~24.8 days). Node clamps oversized delays to 1ms and fires the timer almost immediately; fail loud at startup instead. - First-fence-wins + SIGINT race: `markExceeded` no-ops if the controller was already aborted by a third party, so a budget tick arriving after user SIGINT doesn't misattribute the abort to exit code 55. - Outer catch re-routes mid-stream `AbortError`s through the budget handler so users see "Run aborted: …" instead of raw "AbortError". Tests - `runBudget.test.ts` (32 tests): parser happy / reject paths, setting validator, post-increment off-by-one, `maxToolCalls=0` meaning "disallowed", `-1` meaning unlimited, wall-clock under fake timers, `stop()` cancels pending timer, idempotent `start()`, first-fence-wins, SIGINT-race protection. - `headlessSafetyWarnings.test.ts` (7 tests): YOLO + sandbox / env matrix; strict-truthy `SANDBOX` check; suppression env. - Pre-existing suites: `nonInteractiveCli.test.ts` (46), `gemini.test.tsx` (23), `config/config.test.ts` (220), `core/utils/errors.test.ts` (12), `core/config/config.test.ts` (172) all green after picking up the new config getters / CliArgs fields. Backward compatibility - All budgets default to `-1` (unlimited); existing CLI invocations behave identically. - New stderr warning only fires in the narrow YOLO-no-sandbox case, with an explicit suppress env. - New exit code 55 is purely additive; no existing exit codes change meaning.
📋 Review SummaryThis PR implements run-level budget enforcement for headless/unattended Qwen Code sessions, addressing issue #4103. It introduces two budget types ( 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Round-1 audit (3 angles × line-by-line + removed-behavior + cross-file) plus an open-ended design pass surfaced eight correctness issues. This commit lands all of them; the larger ACP / serve-mode structural items are documented for follow-up. Correctness fixes - headlessSafetyWarnings: `SANDBOX` env check reverted to plain truthy. The sandbox transport sets `SANDBOX` to `sandbox-exec` (macOS seatbelt) or the container name (`qwen-code-sandbox`), neither of which matches `isTruthyEnv`. The PR's strict-`1`/`true` check was emitting the "no sandbox" warning INSIDE real sandboxes. Match the rest of the codebase (sandboxConfig.ts, gemini.tsx, Footer.tsx, prompts.ts, …) which all treat any non-empty value as "sandboxed". - nonInteractiveCli main-loop abort: add `finalizeAssistantMessage()` before `routeAbort()`. The drain-item loop already had it (PR #4502 Critical bug #1); the main loop was asymmetric — stream-json consumers would see an unterminated `message_start` when a budget / SIGINT abort landed mid-stream. - nonInteractiveCli drain-loop `routeAbort`: also flush `flushQueuedNotificationsToSdk(localQueue)` and `finalizeOneShotMonitors()` before exiting. The old `return`-and- fall-through path went through the outer holdback loop, which did this flushing; switching to `routeAbort()` skipped it, so `task_started` envelopes lost their paired `task_notification`. - nonInteractiveCli catch handler: emit `adapter.emitResult({...})` BEFORE `handleBudgetExceededError`, with the budget message as `errorMessage` when budget tripped. Previously the budget handler `process.exit(55)`ed before the adapter could emit a terminal `result` envelope, so STREAM_JSON consumers never saw a stream terminator on budget exits and hung waiting for one. - runBudget: new `validateMaxToolCalls` mirrors `validateMaxWallTimeSetting`. yargs coerces non-numeric flag values (`--max-tool-calls abc`) to `NaN`, and the enforcer's `>= 0` gate treats `NaN` and negatives as "no limit", silently disabling the budget. Reject `NaN`, `Infinity`, fractional, and negative-other- than-`-1` values at both flag and settings layers. `0` remains legal (`first tick aborts`), unlike wall-time where 0 is fatal. - runBudget: new `MIN_WALL_TIME_SECONDS = 1` floor. Previously `--max-wall-time 500ms` parsed cleanly and aborted on the next event-loop tick before any model round-trip — almost certainly a typo (`5m`?) and not a useful guardrail at any rate. - nonInteractiveCli `tickToolCall`: exempt `ToolNames.STRUCTURED_OUTPUT`. Under `--json-schema` this is the terminal "I'm done" contract tool, not real work. Without the exemption a budget-edge completion is aborted as a false positive (model used N tools then emitted structured_output as call N+1 → exit 55 instead of success). - commands/serve.ts: emit the YOLO-no-sandbox warning at daemon startup when settings.json statically configures `tools.approvalMode: 'yolo'` with no `tools.sandbox` / `SANDBOX` env. The daemon can't use `getHeadlessYoloSafetyWarning` (no Config yet — sessions get their own) so we re-derive the predicate from settings. Per-session ACP override is documented as out of scope. Documentation - `docs/users/features/headless.md`: new "Scope" subsection under Run-level budgets explaining (a) `--max-tool-calls` counts top-level dispatches only — subagent / `agent` tool inner calls are not counted, (b) `structured_output` is exempt, (c) stream-json input mode resets budgets per user message, (d) `qwen serve` / ACP sessions do not currently consult budgets from settings.json. Tests - `runBudget.test.ts` grows from 32 → 41 tests: `validateMaxToolCalls` (NaN / Infinity / negatives / fractional), `parseDurationSeconds` sub-second rejection, `validateMaxWallTimeSetting` sub-second rejection. - `headlessSafetyWarnings.test.ts`: replaced the "still warns when SANDBOX is 0/false/no" case (which encoded the strict-check bug) with positive coverage for the real sandbox-set values (`sandbox-exec`, `qwen-code-sandbox`). All previously-green suites still green: cli/nonInteractiveCli (46), cli/gemini.test (23), cli/config/config.test (220), core/utils/errors (12), core/config/config.test (172). 337 tests across the touched suites. Won't-fix (out of scope, documented or pre-existing) - Unpaired `tool_use` in stream-json when a tool is aborted mid-execution — pre-existing structural gap (SIGINT mid-tool has the same outcome); PR amplifies it but doesn't introduce it. - Narrow SIGINT-vs-budget-timer race — already mitigated by `markExceeded`'s `signal.aborted` check. - `tickToolCall` increments past abort (cosmetic; only affects the `observed` value in the error envelope for a pathological caller).
Qwen Code Review (DEEP)CI-safe profile adapted from bundled Correctness / SecurityNo correctness or security issues found. The implementation is well-reasoned across all flagged focus areas:
Needs VerificationNo concerns requiring code outside the supplied context. Test Coverage
Needs Verification
Validation EvidenceMISSING — No reviewer-facing validation evidence is present in the PR body or author comments. The PR includes a "Manual smoke (recommended for reviewer)" section with suggested commands, but no observed output, command transcripts, logs, or before/after evidence from the author demonstrating that: (a) exit code 55 is actually emitted, (b) the drain-loop fall-through fix works end-to-end, (c) the YOLO-no-sandbox warning appears and is suppressible, or (d)
Maintainability / Performance
Needs Verification
Undirected Audit
Needs Verification
Qwen Code |
Round-2 audit (after round-1 commit 40ae6dd) surfaced two NEW correctness issues introduced by the round-1 catch-handler restructure, plus a handful of polish items from a parallel design pass. Correctness fixes (new bugs from R1) - nonInteractiveCli catch handler: wrap `adapter.emitResult` in try/catch. R1 moved the emit BEFORE `handleBudgetExceededError` so STREAM_JSON consumers see a terminal envelope first. But emitResult eventually hits `stdout.write`, which throws on EPIPE / ERR_STREAM_WRITE_AFTER_END when a piped consumer closes early (`qwen -p ... | head -n 1` is the common CI case). Letting that throw bubble out skipped both `handleBudgetExceededError` and `handleError`, dropping the documented exit-code-55 contract precisely when stdout was in trouble. Best-effort emit and continue to the exit handler. - nonInteractiveCli `structured_output` exemption: also require `config.getJsonSchema?.() !== undefined`. Without that guard, an MCP server registering an unrelated tool literally named `structured_output` would silently bypass `--max-tool-calls`. Also documents (in `headless.md` "Scope") the related caveat that failed Ajv-validation retries skip the tick too, so a malformed-output retry loop is NOT bounded by `--max-tool-calls` — combine with `--max-session-turns` or `--max-wall-time`. Polish - runBudget `validateMaxToolCalls` upper bound: cap at 1_000_000. `1e10` (typo for `1e1`) would otherwise parse cleanly, pass the `>= 0` gate forever, and silently disable the budget — the exact foot-gun `MAX_WALL_TIME_SECONDS` was built to prevent. Symmetry. - runBudget `parseDurationSeconds` sub-second hint: only append the "did you mean Ns?" suggestion when the input actually contained `ms`. Bare `0.5` would otherwise produce a useless "did you mean 0.5s?" suggestion. - nonInteractiveCli `routeAbort`: the `throw 'unreachable'` is only hit if `handleBudgetExceededError` / `handleCancellationError` ever becomes resumable (e.g. mocked `process.exit` in a test). Carry the original exceeded.message into the thrown Error so the outer catch's `errorMessage` field stays actionable instead of degrading to a literal "unreachable" string. - commands/serve.ts: compare `approvalMode` against `ApprovalMode.YOLO` enum instead of the string literal `'yolo'`. If the enum value is ever renamed, the startup warning stays in sync with the helper at `headlessSafetyWarnings.ts` instead of silently going dead. Documentation - `headless.md` "Scope": clarify the `structured_output` exemption is unconditional (including failed validations); add explicit note that `--max-session-turns` does NOT exempt `structured_output`, so size to `N+1` for `N` real-work turns under `--json-schema`. - `headless.md` flag table: add `1.5h` to the accepted-forms hint for `--max-wall-time` (the parser already accepts fractional units). Tests - `runBudget.test.ts`: new coverage for the `validateMaxToolCalls` ceiling. Total 42 tests across `runBudget.test.ts` (was 41), all green. cli/nonInteractiveCli, gemini.test, config/config all unchanged and still green. Won't-fix (documented above or out of scope) - ACP per-session approval-mode escalation (mid-session flip to YOLO) doesn't print the warning — daemon-level wiring; out of scope for this PR. - 1s wall-time floor vs higher (5–10s) — debatable, keeping 1s with loud sub-second rejection; can raise later without semver impact. - Integration test for the full budget-trip → catch → emitResult → exit 55 path — requires a process-exit-mocking harness; tracked as follow-up.
Round-3 audit caught two stale doc surfaces that R1's 1-second wall-time floor (and R2's `1.5h` fractional-unit addition) didn't update: - `docs/users/features/headless.md` budget table: replace stale `500ms` example with `1.5h`, add explicit "minimum 1s — sub-second values are rejected as typos" note. - `docs/users/configuration/settings.md` `model.maxWallTimeSeconds` row: same fix. Also extend `model.maxToolCalls` row with the structured_output exemption note, the `0` semantic, and the 1,000,000 ceiling that R2 added. A user copying the documented `--max-wall-time 500ms` example from either surface would hit a startup error after R1. Known follow-up (not addressed in this commit) - No test exercises the R2 `isStructuredOutputExempt` predicate end-to-end. Adding one needs the same process-exit-mocking harness called out in the R2 commit as a separate follow-up.
Round-4 final-pass audit caught four schema/help-text/JSDoc surfaces
that drifted from the validators introduced in R1 (1s wall-time floor,
24-day ceiling) and R2 (1M tool-call ceiling, structured_output
exemption, `0` sentinel).
- `runBudget.ts` `parseDurationSeconds` JSDoc: replace stale claim
that `500ms` is accepted and "sub-second precision is preserved"
with the actual contract — `[MIN_WALL_TIME_SECONDS, MAX_WALL_TIME_SECONDS]`,
ms suffix only legal when value resolves to >= 1s. Adds `1.5h` to
the accepted-forms list.
- `settingsSchema.ts` `model.maxWallTimeSeconds` description: now
documents the 1s minimum and ~24-day ceiling.
- `settingsSchema.ts` `model.maxToolCalls` description: documents the
structured_output exemption, the `0` sentinel ("no tool calls
allowed"), and the 1,000,000 ceiling.
- `vscode-ide-companion/schemas/settings.schema.json`: mirrors both
schema descriptions above so the VS Code settings UI auto-completion
matches.
- `config.ts` yargs `--max-wall-time` description: documents the 1s
floor and the ~24-day max.
- `config.ts` yargs `--max-tool-calls` description: documents the
structured_output exemption, the `0` sentinel, and the 1M ceiling.
`qwen --help` is the most-read surface for these flags; matches the
prose docs in headless.md and settings.md.
No code changes — pure doc/help-text alignment.
Post-PR multi-round self-audit completeRan 5 rounds of open-ended audits (3 parallel angles per code round + open-ended design review). Convergence reached on R5 ( 4 fix commits added on top of the original
Notable correctness fixes
Documentation work
Test state
Known follow-ups (not addressed in this PR)
PR ready for re-review. |
|
Qwen Code automated PR review did not complete successfully. See the workflow logs for details: https://github.com/QwenLM/qwen-code/actions/runs/26394607203 This is an automated message; please retry by commenting |
|
Let me verify a few cross-file concerns before producing the review. Qwen Code Review (STANDARD)What this PR does (2–3 sentences synthesized from the diff and PR title/body): 🟡 P2 — Medium
🟢 P3 — Low
Cross-file notes
✅ Highlights
Validation Evidence
Qwen Code |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review
本地验证报告(maintainer review)把 PR head 直接 结论✅ 建议 MERGE。PR 自带的全部单测 + 真机 5 个 manual smoke 场景在本机全部复现通过,预算超限退出码、解析校验、YOLO 警告矩阵、settings/CLI 对称都符合 PR 描述。仅有一处 PR 描述与代码 comment 的细节不一致(详见下 §1)。 验证矩阵
真机 E2E(PR 提到的 5 个 manual smoke 全部跑通)
§1. PR 描述与代码的小不一致(提示性,不阻塞)PR 描述中 "What was fixed vs. #4105 review" 列出 "Plus several [Suggestion] items: strict-truthy 但
代码实际是 实测 代码评审要点
风险与遗留
验证环境:macOS Darwin 25.4.0,Node v22.17.0,npm 11.8.0。tmux 会话 |
Re-submits a scoped subset of #4105 — split per maintainer feedback (#4103 thread) into "Phase 1 (doc + warning)" plus the smallest budget surface that's still useful (
--max-wall-time+--max-tool-calls). Closes #4103 for everything except--max-api-calls/--max-tokens/--max-budget-usd, which are deferred to follow-ups.What's in
Phase 1 — surface unsafe configs, fix doc drift
--yolo(or--approval-mode=yolo) without a sandbox. Suppressible viaQWEN_CODE_SUPPRESS_YOLO_WARNING=1(strict1/truematch —=0/=falsedon't silence). Strict env match also applied to theSANDBOXcheck.!config.isInteractive()at the call site so TUI users aren't nagged.docs/users/configuration/settings.md:model.skipLoopDetectiondefault corrected totrue(wasfalse— schema & CLI fallback have beentruesince the LLM detector was removed).--yolo↔ sandbox section rewritten:--yolodoes NOT auto-enable a sandbox.Phase 2 — opt-in run-level budgets with a distinct exit code
--max-wall-time/model.maxWallTimeSeconds: wall-clock duration for the whole run. Flag accepts90(s),30s,5m,1h,500ms. Settings is plain seconds.--max-tool-calls/model.maxToolCalls: cumulative tool executions (success + failure). Ticked before eachexecuteToolCallso a budget of N caps at exactly N executions.FatalBudgetExceededError(exit code 55), distinct fromFatalTurnLimitedError(53) andFatalCancellationError(130) so CI scripts can branch on the reason.RunBudgetEnforcerwired to the sameAbortControlleras SIGINT; arouteAborthelper distinguishes budget vs. user-cancel at every abort-check site and at the outer catch.What was fixed vs. #4105 review
All four [Critical] findings from the prior review pass:
for awaitpreviously exited viafinalizeAssistantMessage(); return;, silently swallowing a budget abort that fires during the last drain item (exit 0 instead of 55). Now routes throughrouteAbort.maxWallTimeSeconds: 0in settings.json — was silently disabling the budget (enforcer treats<= 0as "no timer") while the CLI flag was fatal.validateMaxWallTimeSettingnow mirrors the CLI rejection. Also rejectsInfinity/NaN.setTimeoutoverflow at large durations — Node clamps delays >= 2^31 ms to 1 ms (fires almost immediately). Both parser paths reject anything aboveMath.floor((2^31 - 1) / 1000)s (~24.8 days).--max-tokens(deferred).Plus several [Suggestion] items: strict-truthy
SANDBOXenv match, explicitthrow 'unreachable'afterhandleBudgetExceededErrorinrouteAbort, JSDoc file paths.What's deliberately out of scope
--max-api-calls— useful but adds risk surface (drain-turn semantics, retries) without delivering qualitatively different protection beyond--max-wall-time. Follow-up.--max-tokens— retrospective enforcement only (counts arrive after the model speaks), plus naming collision withsessionTokenLimit. Follow-up.handleCancellationError/handleMaxTurnsExceededErroris "structured only onOutputFormat.JSON". A wider change should adjust all three handlers together; not budget-specific.--max-budget-usd— needs per-provider price tables that don't exist in-tree. Explicitly deferred in Headless / non-interactive mode lacks runaway protection guardrails #4103.Test plan
runBudget.test.ts— 32 tests covering parser happy/reject paths (incl.0, negatives, garbage,>24.8d), setting validator (incl.Infinity/NaN),maxToolCalls=0= disallowed,-1= unlimited, wall-clock under fake timers,stop()cancels pending timer, idempotentstart(), first-fence-wins, SIGINT-race protection.headlessSafetyWarnings.test.ts— 7 tests covering YOLO + sandbox / env matrix, strict-truthySANDBOX, suppression env.nonInteractiveCli.test.ts(46),gemini.test.tsx(23),config/config.test.ts(220),core/utils/errors.test.ts(12),core/config/config.test.ts(172).npm run lint— cleantsc --noEmitonpackages/coreandpackages/cli— cleanManual smoke (recommended for reviewer)
qwen -p "loop forever" --yolo --max-wall-time 5s→ exit 55 after ~5s with friendly stderr message.qwen -p "do many things" --max-tool-calls 3→ exit 55 BEFORE the 4th tool runs.qwen -p "hi" --yolo→ YOLO-no-sandbox warning on stderr;QWEN_SANDBOX=1 qwen -p "hi" --yolo→ no warning.qwen -p "hi" --max-wall-time bogus→ fatal at startup, not a silent disable.echo '{"model": {"maxWallTimeSeconds": 0}}' > ~/.qwen/settings.json && qwen -p "hi"→ fatal at startup (settings symmetry).Backward compatibility
-1(unlimited); existing CLI invocations behave identically.Refs: #4103, #4105