fix(cli): headless / non-interactive runaway-protection guardrails (#4103)#4105
fix(cli): headless / non-interactive runaway-protection guardrails (#4103)#4105BZ-D wants to merge 7 commits into
Conversation
) Phase 1 of #4103: fix the two doc/behavior mismatches and surface an observable signal when an unattended run is configured unsafely. - Update `model.skipLoopDetection` default in settings.md to `true`, matching the schema and CLI fallback set in 396248e. - Rewrite the sandbox section so it no longer claims `--yolo` / `--approval-mode=yolo` auto-enables a sandbox — it doesn't; YOLO only auto-approves tool calls. - Print a one-line stderr warning at startup when a non-interactive run has approval-mode=yolo and no sandbox configured. Suppress with `QWEN_CODE_SUPPRESS_YOLO_WARNING=1`. - Add a "Safety in unattended runs" section to the Headless docs with recommended flag combinations for CI / SDK runs. Wall-clock / token / tool-call budgets are tracked separately as phase 2 of #4103.
… text Self-review pass on the warning helper and adjacent docs: - Make `QWEN_CODE_SUPPRESS_YOLO_WARNING` accept only `1` / `true`, matching the project's existing env-flag parsing in `isUnattendedMode`. Previously any truthy string (including `0` / `false`) would have suppressed the warning, which is a known foot-gun for boolean env vars. - Cover the suppression edge cases (`1`, `true`, vs `0`, `false`, `''`, `no`) in the unit test. - Reword the headless docs' phase-2 reference so it no longer claims the budget work is tracked "in #4103" — #4103 is this issue; budgets are the next phase of it. - Drop the stale "(streaming and LLM)" parenthetical from the `skipLoopDetection` schema description: the LLM-based detector was removed in 396248e, only the streaming detector remains.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
…ax-tool-calls / --max-api-calls) (#4103) Phase 2 of #4103 — the actual unattended-run guardrails the issue called out as missing. All three budgets are opt-in (-1 default = unlimited), enforced cooperatively against the same AbortController that already carries SIGINT, and surface a structured FatalBudgetExceededError with exit code 55 (distinct from FatalTurnLimitedError's 53 and SIGINT's 130 so CI scripts can branch on the reason). Plumbing: - Config (core): adds maxWallTimeSeconds / maxToolCalls / maxApiCalls to ConfigParameters and three getters, mirroring the maxSessionTurns pattern. Default -1 everywhere. - CLI (yargs): adds --max-wall-time (duration string: 90 / 30s / 5m / 1h / 500ms), --max-tool-calls, --max-api-calls. Garbage durations are fatal at startup rather than silently disabling the guardrail. - Settings: model.maxWallTimeSeconds / model.maxToolCalls / model.maxApiCalls join the schema, all defaulting to -1. Enforcement (packages/cli/src/utils/runBudget.ts): - RunBudgetEnforcer wraps the abortController and exposes tickApiCall / tickToolCall plus a wall-clock setTimeout (unref'd so it doesn't keep the event loop alive). First-fence-wins: subsequent overruns don't clobber the original reason record. - Wired into nonInteractiveCli's main + drain turn paths: tickApiCall fires before each sendMessageStream, tickToolCall fires after each executeToolCall. A routeAbort helper distinguishes budget aborts from SIGINT cancellation at every abort-detection site, and the outer catch block re-routes mid-stream AbortErrors through the budget handler so users see "Run aborted: …" instead of a raw "AbortError". - Stop() is called from the existing finally block so daemons reusing a single process across runs don't accumulate stale timers. Tests: - 24 unit tests for runBudget.ts: parseDurationSeconds (happy + garbage cases), counter semantics (post-increment, off-by-one, =0 means disallowed), wall-clock firing under fake timers, stop() cancellation, first-fence-wins, idempotent start(). - nonInteractiveCli.test.ts config mock + gemini.test.tsx CliArgs stub pick up the new fields. No regressions in either suite. Docs: - docs/users/features/headless.md: new "Run-level budgets" table and expanded "Recommended combinations" section with concrete CI recipes. Token / dollar budgets explicitly deferred (need provider price tables) with a pointer back to the issue. - docs/users/configuration/settings.md: three new rows in the model.* table, each linking the per-invocation CLI flag. Defers `--max-tokens` and `--max-budget-usd` to a follow-up — both require provider-specific accounting and a tokens-mid-run observability hook that doesn't exist yet.
Multi-round audit pass on phase 2 (#4103). Findings + fixes: - **Off-by-one for tool calls.** `tickToolCall` ran *after* `executeToolCall`, so with `--max-tool-calls=N` the (N+1)th tool was already executed before the abort fired (the loop saw `count > limit` on the (N+1)th tick, not the Nth). Moved the tick to before the call so a budget of N caps the run at exactly N executions. Mirrors how `tickApiCall` was already wired. - **`parseDurationSeconds("0")` was a foot-gun.** It silently disabled the budget (the enforcer treats `<=0` as "no timer"), opposite of what a user typing `0` likely means. Now rejected with a friendly message pointing users to omit the flag for unlimited. Tests cover `0` / `0s` / `0ms`. - **Dead `value < 0` check** in the parser was unreachable (the regex disallows a leading sign) — removed, and the error wording clarified. - **Race in `markExceeded`.** If SIGINT aborted the shared `AbortController` first and a budget then fired in the narrow window before the abort check, `getExceeded()` would have returned a budget reason — causing `routeAbort` to emit exit code 55 ("budget exceeded") when the real cause was exit 130 (cancellation). Now `markExceeded` short-circuits if the controller was already aborted by a third party. Added a regression test. - **Removed unused `hasAnyLimit()`** — speculative API surface with no production caller; tests covered it, but no code paths skipped enforcer instantiation based on its result. - **Docs:** settings.md clarifies the CLI flag requires a positive duration; runBudget.ts docstrings updated to match the new pre-call tick semantics and the `0`-rejection rule.
…#4103) Two of the three remaining items from the issue's 中期 checklist. The third (--max-budget-usd) needs per-provider price tables and is explicitly deferred per the issue's own framing. **--max-tokens N** Cumulative total-token budget (input + output across every model call). Enforced retrospectively by polling `uiTelemetryService.getMetrics()` after each `for await` of a model stream — the run may overshoot the budget by at most one final response, which is inherent to "tokens are only counted after the model emits them". Distinct from the existing `sessionTokenLimit`, which caps next-prompt context size rather than cumulative spend. - Plumbed end-to-end: core Config (`maxTokens` param + field + getter), settings.model.maxTokens (default -1), CLI flag, RunBudgetOptions. - `RunBudgetEnforcer.tickTokens(cumulativeTotal)` checks the running total against the budget; aborts via the shared AbortController on overrun. Source of truth for the count stays in uiTelemetryService — threading a parallel counter would risk drift. - Wired into both the main turn and drain-turn loops, right after the for-await drains, before any tool dispatch or next stream. - 2 new unit tests cover the exceeded and unlimited cases. **YOLO exit-time dangerous-tool audit** Single stderr line summarising how many `shell` / `write` / `edit` tool calls were auto-approved during a YOLO run, e.g.: YOLO audit: executed 4 shell, 2 write, 1 edit tool call(s) during this run. Read-only tools (grep, glob, read_file) and meta tools (todo_write, save_memory) are intentionally out of scope — widening the list dilutes the signal. - `getDangerousToolAuditLine(config, metrics, env)` lives next to the startup YOLO warning in `headlessSafetyWarnings.ts`. Shares the `QWEN_CODE_SUPPRESS_YOLO_WARNING=1` env var so consumers opt out of both with one knob. - Emission is idempotent (`auditEmitted` latch) and fires at every exit path: success return (finally), budget abort / SIGINT (routeAbort), max-turns hits, catch-path budget re-route, catch-path generic handleError. Each handler calls `process.exit()` which bypasses the surrounding finally, hence the per-site placement. - 6 new unit tests cover the YOLO gate, the no-dangerous-call short circuit, partial-coverage counts, and suppression-env edge cases. **Docs** - `headless.md` gains a `--max-tokens` row in the flag table, a row in the run-level-budgets table, and a new "YOLO exit-time audit" subsection. - `settings.md` gains a `model.maxTokens` row. - The "Cumulative token and dollar-cost budgets are deferred" note is trimmed to just the dollar-cost / `--max-budget-usd` deferral. **What's left from the issue** Only `--max-budget-usd N` / provider-neutral budget hook remains — explicitly deferred in the issue itself ("能接入不同 provider 的价格模型时 再精确计费") and now the only outstanding 中期 item. All four 短期 items and four of five 中期 items are delivered in this branch.
Self-review pass on phase 3. `uiTelemetryService` is a process-global singleton (confirmed via inspection of `packages/core/src/telemetry/uiTelemetry.ts`), so the raw cumulative metrics it exposes leak across runs when the same process invokes `runNonInteractive` more than once (the `qwen serve` daemon, the stream-JSON SDK session loop, anything calling `runNonInteractive` multiple times). Before this fix, the second run of a daemon would see the first run's tokens / shell-write-edit calls counted against its `--max-tokens` budget and reported in its YOLO audit line. Wrong semantic — the guardrails are documented as **per-run**, not per-process. Fix: - Snapshot `(tokens, shell/write/edit counts)` at the very top of `runNonInteractive`, right after the enforcer is constructed. - `tickTokens` is now passed `(currentTotal - baselineTokens)` from each polling site; the enforcer's parameter is renamed `perRunTokens` and its abort message says "in this run" instead of "cumulative". - The YOLO audit helper now takes a `DangerousToolCounts` delta object rather than raw `SessionMetrics`; `readDangerousToolCounts` extracts the canonical shell/write/edit triple from a metrics blob and the caller subtracts the baseline before invoking the audit. Tests: - 4 new unit tests for `readDangerousToolCounts` + the daemon multi-run delta scenario in `headlessSafetyWarnings.test.ts`. - Existing audit tests rewritten to pass the new `DangerousToolCounts` shape. - `runBudget.test.ts` updated to reflect the renamed parameter; all 26 tests still pass. - 101 tests pass across the four touched suites; lint clean. CLI mode (one process per run) is unaffected — baseline is 0 there, so delta == cumulative.
wenshao
left a comment
There was a problem hiding this comment.
Additional test coverage gaps: handleBudgetExceededError, resolveMaxWallTimeSeconds, and the budget enforcement integration in nonInteractiveCli.ts all lack tests — the existing tests only add -1 (unlimited) stubs so none of the new enforcement code paths are exercised within the run loop.
| const itemToolCallRequests: ToolCallRequestInfo[] = []; | ||
| const itemApiStartTime = Date.now(); | ||
| budgetEnforcer.tickApiCall(); | ||
| if (abortController.signal.aborted) await routeAbort(); |
There was a problem hiding this comment.
[Critical] The inner drain-loop for await abort handler (in unchanged code just below this hunk) does adapter.finalizeAssistantMessage(); return; instead of await routeAbort(). This is inconsistent with the routeAbort() call added here (line 908) and with the main-turn stream loop (line 797).
If a wall-clock or other budget timer fires during the last drain item's stream, and localQueue is empty, the hold-back loop breaks without re-checking abortController.signal.aborted — the run falls through to emitResult and exits with code 0, silently swallowing the budget abort.
The same pattern should apply: replace return; with await routeAbort(); in the drain inner-loop abort handler.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| * otherwise we fall through to the SIGINT / user-cancel path so existing | ||
| * behavior is preserved. | ||
| */ | ||
| const routeAbort = async (): Promise<never> => { |
There was a problem hiding this comment.
[Suggestion] routeAbort calls handleBudgetExceededError (which exits the process) then unconditionally falls through to handleCancellationError. The correctness depends on handleBudgetExceededError always terminating via process.exit(). If a future refactor makes budget errors resumable, the fall-through will silently misattribute budget aborts as cancellations (exit code 55 → 130), breaking CI scripts that branch on exit codes.
Consider adding an explicit return after await handleBudgetExceededError(config, exceeded) to make the mutual exclusivity obvious at the syntax level.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| // `baselineTokens` accounts for the process-global telemetry | ||
| // singleton (see baseline snapshot comment above). | ||
| budgetEnforcer.tickTokens( | ||
| (computeUsageFromMetrics(uiTelemetryService.getMetrics()) |
There was a problem hiding this comment.
[Suggestion] computeUsageFromMetrics(uiTelemetryService.getMetrics()) is called on the hot path after every model response (including every drain turn, see line 962). Internally it computes 10+ stats fields (apiTimePercent, toolTimePercent, cacheEfficiency, etc.) that are all discarded — only .total_tokens is used. For runs with many drain turns, this adds unnecessary CPU and GC pressure.
Extract a lightweight helper that just sums tokens:
| (computeUsageFromMetrics(uiTelemetryService.getMetrics()) | |
| function getTotalTokens(metrics: SessionMetrics): number { | |
| return Object.values(metrics.models).reduce( | |
| (acc, m) => acc + m.tokens.total, 0, | |
| ); | |
| } |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| const budgetExceeded = budgetEnforcer.getExceeded(); | ||
| if (budgetExceeded) { | ||
| emitYoloAudit(); | ||
| await handleBudgetExceededError(config, budgetExceeded); |
There was a problem hiding this comment.
[Suggestion] In the catch block, when budgetEnforcer.getExceeded() is truthy, handleBudgetExceededError is called directly (triggering process.exit(55)), bypassing adapter.emitResult(). This means --output-format stream-json and --output-format json consumers receive no structured result event on budget abort — the stream just ends with a bare exit code and a single stderr line.
Consider emitting the result through the adapter before calling handleBudgetExceededError, so structured consumers get a closing result event with usage stats and error context.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| // execute shell / write / edit tools at the current process's | ||
| // privilege level. Surface a one-line stderr warning so unattended | ||
| // CI / SDK runs have at least an observable signal. See issue #4103. | ||
| const yoloWarning = getHeadlessYoloSafetyWarning(config); |
There was a problem hiding this comment.
[Suggestion] getHeadlessYoloSafetyWarning(config) is called unconditionally — without guarding on !config.isInteractive(). An interactive --yolo user sees the stderr warning "running headless with --yolo / approval-mode=yolo and no sandbox" even though they are not running headless. The function name and warning text both imply headless-only scope, but the call site applies to all modes.
Wrap in a !config.isInteractive() guard or rename the function to getYoloNoSandboxWarning to reflect its actual scope.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| if (config.getSandbox()) return null; | ||
| // SANDBOX is set by the sandbox transport itself (container / seatbelt | ||
| // wrapper) so any non-empty value means we're already inside a sandbox. | ||
| if (env['SANDBOX']) return null; |
There was a problem hiding this comment.
[Suggestion] The sandbox-in-container check uses if (env['SANDBOX']) — a plain truthy check. SANDBOX=false, SANDBOX=0, or SANDBOX=no would all suppress the warning even though the process is not sandboxed. This is inconsistent with the QWEN_CODE_SUPPRESS_YOLO_WARNING check above which uses isTruthyEnv (only 1 / true). Use the same strict matching pattern for consistency and to prevent trivial bypass.
| if (env['SANDBOX']) return null; | |
| if (isTruthyEnv(env['SANDBOX'])) return null; |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| env: NodeJS.ProcessEnv = process.env, | ||
| ): string | null { | ||
| if (config.getApprovalMode() !== ApprovalMode.YOLO) return null; | ||
| if (config.getSandbox()) return null; |
There was a problem hiding this comment.
[Suggestion] config.getSandbox() returns the configured sandbox (user intent), not whether the process is actually running inside one. If the sandbox fails to start but the config entry persists, this short-circuits the warning — giving a false sense of containment. The SANDBOX env var check on line 50 is the actual runtime signal, but it runs second. Consider prioritizing the runtime SANDBOX env check before the config check, or adding a comment noting this limitation.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| /** | ||
| * Wall-clock budget for an unattended run, in seconds. `-1` (default) | ||
| * means no limit. Enforced by the CLI's non-interactive run loop — | ||
| * see RunBudgetEnforcer in `packages/cli/src/utils/runBudgetEnforcer.ts`. |
There was a problem hiding this comment.
[Suggestion] JSDoc references packages/cli/src/utils/runBudgetEnforcer.ts but the actual file is packages/cli/src/utils/runBudget.ts (no Enforcer suffix). A grep for the filename would return nothing.
| * see RunBudgetEnforcer in `packages/cli/src/utils/runBudgetEnforcer.ts`. | |
| * @see RunBudgetEnforcer in `packages/cli/src/utils/runBudget.ts`. |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
Auto-derived from `packages/cli/src/config/settingsSchema.ts` via `scripts/generate-settings-schema.ts`. Picks up the four new model.* budget keys added in this branch (maxWallTimeSeconds, maxToolCalls, maxApiCalls, maxTokens) plus the rewritten skipLoopDetection description. Without this regeneration, settings.json autocomplete / validation in the VS Code companion would diverge from the actual CLI schema and silently miss the new keys.
| * `model.maxSessionTurns`). A malformed flag is fatal — we'd rather | ||
| * fail at startup than silently disable a CI guardrail. | ||
| */ | ||
| function resolveMaxWallTimeSeconds(argv: CliArgs, settings: Settings): number { |
There was a problem hiding this comment.
[Critical] resolveMaxWallTimeSeconds allows settings maxWallTimeSeconds: 0 to pass through unvalidated. RunBudgetEnforcer.start() treats <= 0 as unlimited — so 0 silently disables the wall-clock guardrail. Meanwhile, the CLI flag path (parseDurationSeconds) explicitly rejects 0 as a fatal error. This asymmetry means users who set maxWallTimeSeconds: 0 in settings.json expecting "zero seconds = abort immediately" get the opposite behavior.
| function resolveMaxWallTimeSeconds(argv: CliArgs, settings: Settings): number { | |
| const fromSettings = settings.model?.maxWallTimeSeconds; | |
| if (typeof fromSettings === 'number') { | |
| if (fromSettings === 0) { | |
| throw new Error( | |
| 'settings model.maxWallTimeSeconds: 0 would disable the budget.' | |
| ); | |
| } | |
| return fromSettings; | |
| } |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| start(): void { | ||
| if (this.wallTimer !== null) return; | ||
| if (this.maxWallTimeSeconds <= 0) return; | ||
| this.wallTimer = setTimeout(() => { |
There was a problem hiding this comment.
[Critical] setTimeout(cb, this.maxWallTimeSeconds * SECOND) will overflow when maxWallTimeSeconds * 1000 > 2,147,483,647 (~24.8 days). Node.js clamps oversized delays to 1 ms, causing the timer to fire almost immediately — aborting the run before it begins. Consider adding an upper-bound check.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| // `baselineTokens` accounts for the process-global telemetry | ||
| // singleton (see baseline snapshot comment above). | ||
| budgetEnforcer.tickTokens( | ||
| (computeUsageFromMetrics(uiTelemetryService.getMetrics()) |
There was a problem hiding this comment.
[Critical] Token budget silently bypassed: computeUsageFromMetrics only sets total_tokens when > 0.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| requiresRestart: false, | ||
| default: true, | ||
| description: 'Disable all loop detection checks (streaming and LLM).', | ||
| description: |
There was a problem hiding this comment.
[Suggestion] skipLoopDetection default changed from false to true — loop detection is now off by default. This is a silent behavioral regression for unattended-run users. Consider calling it out prominently in the release notes.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| exceeded: BudgetExceeded, | ||
| ): Promise<never> { | ||
| const fatal = new FatalBudgetExceededError(exceeded.message); | ||
| if (config.getOutputFormat() === OutputFormat.JSON) { |
There was a problem hiding this comment.
[Suggestion] handleBudgetExceededError only emits structured JsonFormatter error for OutputFormat.JSON. When format is STREAM_JSON, the function falls through to plain-text stderr. Consider also including STREAM_JSON so SDK/CI consumers receive the structured envelope.
| if (config.getOutputFormat() === OutputFormat.JSON) { | |
| if ( | |
| config.getOutputFormat() === OutputFormat.JSON || | |
| config.getOutputFormat() === OutputFormat.STREAM_JSON | |
| ) { |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| if (env['SANDBOX']) return null; | ||
| // Explicit user opt-out. Match the project convention (cf. isUnattendedMode | ||
| // in core/utils/retry.ts) so `=0` / `=false` don't accidentally suppress. | ||
| if (isTruthyEnv(env['QWEN_CODE_SUPPRESS_YOLO_WARNING'])) return null; |
There was a problem hiding this comment.
[Suggestion] QWEN_CODE_SUPPRESS_YOLO_WARNING=1 suppresses both the startup YOLO safety warning AND the exit-time YOLO audit line. An operator silencing the startup noise loses the per-run audit summary. Consider separate env vars for warning vs. audit.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
…4103) (#4502) * feat(cli): headless runaway-protection guardrails (#4103) Adds two opt-in run-level budgets and a startup safety warning for non-interactive / CI / SDK runs. All defaults preserve existing behavior; the budgets only fire when the user explicitly sets a limit. Phase 1 — surface unsafe configs and fix doc drift - New `--yolo`-without-sandbox stderr warning at startup of every non-interactive run, emitted by `getHeadlessYoloSafetyWarning` in `packages/cli/src/utils/headlessSafetyWarnings.ts`. Suppressible via `QWEN_CODE_SUPPRESS_YOLO_WARNING=1` (strict `1`/`true` match so `=0` / `=false` don't silence it). Strict env match also applied to the `SANDBOX` check so values like `SANDBOX=0` don't accidentally bypass the warning. - Gated on `!config.isInteractive()` at the gemini.tsx call site so TUI users aren't nagged. - `docs/users/configuration/settings.md`: corrected `model.skipLoopDetection` default (`true`, not `false`) and reworded the `--yolo`/sandbox section — `--yolo` does NOT auto-enable a sandbox; sandboxing must still be opted into explicitly. Phase 2 — run-level budgets with distinct exit code - `--max-wall-time` / `model.maxWallTimeSeconds`: wall-clock duration for the whole run. Flag accepts `90` (s), `30s`, `5m`, `1h`, `500ms`. Settings is plain seconds. - `--max-tool-calls` / `model.maxToolCalls`: cumulative tool executions (success + failure). Ticked BEFORE each `executeToolCall` so a budget of N caps the run at exactly N executions. - New `FatalBudgetExceededError` (exit code 55), distinct from `FatalTurnLimitedError` (53) and `FatalCancellationError` (130) so CI scripts can branch on the reason. JSON output mirrors the `handleMaxTurnsExceededError` / `handleCancellationError` envelope convention. - Enforced via `RunBudgetEnforcer` in `packages/cli/src/utils/runBudget.ts`, wired to the same `AbortController` as SIGINT so existing cancellation plumbing carries the abort. A `routeAbort` helper distinguishes budget vs. SIGINT at the abort-check sites and at the outer catch. Critical correctness fixes (informed by the #4105 review pass) - Drain-loop fall-through: the inner drain-item `for await` previously exited via `finalizeAssistantMessage(); return;`, swallowing a budget abort that fires during the last drain item and surfacing exit code 0. Now routes through `routeAbort` so exit 55 is preserved. - Settings symmetry: `maxWallTimeSeconds: 0` in settings.json is now rejected (same as `--max-wall-time 0`); the enforcer treats `<=0` as "no timer" so silent disable would be a foot-gun. `validateMaxWallTimeSetting` also rejects `Infinity` / `NaN`. - `setTimeout` overflow: both parser paths reject durations above `Math.floor((2^31 - 1) / 1000)s` (~24.8 days). Node clamps oversized delays to 1ms and fires the timer almost immediately; fail loud at startup instead. - First-fence-wins + SIGINT race: `markExceeded` no-ops if the controller was already aborted by a third party, so a budget tick arriving after user SIGINT doesn't misattribute the abort to exit code 55. - Outer catch re-routes mid-stream `AbortError`s through the budget handler so users see "Run aborted: …" instead of raw "AbortError". Tests - `runBudget.test.ts` (32 tests): parser happy / reject paths, setting validator, post-increment off-by-one, `maxToolCalls=0` meaning "disallowed", `-1` meaning unlimited, wall-clock under fake timers, `stop()` cancels pending timer, idempotent `start()`, first-fence-wins, SIGINT-race protection. - `headlessSafetyWarnings.test.ts` (7 tests): YOLO + sandbox / env matrix; strict-truthy `SANDBOX` check; suppression env. - Pre-existing suites: `nonInteractiveCli.test.ts` (46), `gemini.test.tsx` (23), `config/config.test.ts` (220), `core/utils/errors.test.ts` (12), `core/config/config.test.ts` (172) all green after picking up the new config getters / CliArgs fields. Backward compatibility - All budgets default to `-1` (unlimited); existing CLI invocations behave identically. - New stderr warning only fires in the narrow YOLO-no-sandbox case, with an explicit suppress env. - New exit code 55 is purely additive; no existing exit codes change meaning. * fix(cli): address audit findings for headless guardrails (#4103, #4502) Round-1 audit (3 angles × line-by-line + removed-behavior + cross-file) plus an open-ended design pass surfaced eight correctness issues. This commit lands all of them; the larger ACP / serve-mode structural items are documented for follow-up. Correctness fixes - headlessSafetyWarnings: `SANDBOX` env check reverted to plain truthy. The sandbox transport sets `SANDBOX` to `sandbox-exec` (macOS seatbelt) or the container name (`qwen-code-sandbox`), neither of which matches `isTruthyEnv`. The PR's strict-`1`/`true` check was emitting the "no sandbox" warning INSIDE real sandboxes. Match the rest of the codebase (sandboxConfig.ts, gemini.tsx, Footer.tsx, prompts.ts, …) which all treat any non-empty value as "sandboxed". - nonInteractiveCli main-loop abort: add `finalizeAssistantMessage()` before `routeAbort()`. The drain-item loop already had it (PR #4502 Critical bug #1); the main loop was asymmetric — stream-json consumers would see an unterminated `message_start` when a budget / SIGINT abort landed mid-stream. - nonInteractiveCli drain-loop `routeAbort`: also flush `flushQueuedNotificationsToSdk(localQueue)` and `finalizeOneShotMonitors()` before exiting. The old `return`-and- fall-through path went through the outer holdback loop, which did this flushing; switching to `routeAbort()` skipped it, so `task_started` envelopes lost their paired `task_notification`. - nonInteractiveCli catch handler: emit `adapter.emitResult({...})` BEFORE `handleBudgetExceededError`, with the budget message as `errorMessage` when budget tripped. Previously the budget handler `process.exit(55)`ed before the adapter could emit a terminal `result` envelope, so STREAM_JSON consumers never saw a stream terminator on budget exits and hung waiting for one. - runBudget: new `validateMaxToolCalls` mirrors `validateMaxWallTimeSetting`. yargs coerces non-numeric flag values (`--max-tool-calls abc`) to `NaN`, and the enforcer's `>= 0` gate treats `NaN` and negatives as "no limit", silently disabling the budget. Reject `NaN`, `Infinity`, fractional, and negative-other- than-`-1` values at both flag and settings layers. `0` remains legal (`first tick aborts`), unlike wall-time where 0 is fatal. - runBudget: new `MIN_WALL_TIME_SECONDS = 1` floor. Previously `--max-wall-time 500ms` parsed cleanly and aborted on the next event-loop tick before any model round-trip — almost certainly a typo (`5m`?) and not a useful guardrail at any rate. - nonInteractiveCli `tickToolCall`: exempt `ToolNames.STRUCTURED_OUTPUT`. Under `--json-schema` this is the terminal "I'm done" contract tool, not real work. Without the exemption a budget-edge completion is aborted as a false positive (model used N tools then emitted structured_output as call N+1 → exit 55 instead of success). - commands/serve.ts: emit the YOLO-no-sandbox warning at daemon startup when settings.json statically configures `tools.approvalMode: 'yolo'` with no `tools.sandbox` / `SANDBOX` env. The daemon can't use `getHeadlessYoloSafetyWarning` (no Config yet — sessions get their own) so we re-derive the predicate from settings. Per-session ACP override is documented as out of scope. Documentation - `docs/users/features/headless.md`: new "Scope" subsection under Run-level budgets explaining (a) `--max-tool-calls` counts top-level dispatches only — subagent / `agent` tool inner calls are not counted, (b) `structured_output` is exempt, (c) stream-json input mode resets budgets per user message, (d) `qwen serve` / ACP sessions do not currently consult budgets from settings.json. Tests - `runBudget.test.ts` grows from 32 → 41 tests: `validateMaxToolCalls` (NaN / Infinity / negatives / fractional), `parseDurationSeconds` sub-second rejection, `validateMaxWallTimeSetting` sub-second rejection. - `headlessSafetyWarnings.test.ts`: replaced the "still warns when SANDBOX is 0/false/no" case (which encoded the strict-check bug) with positive coverage for the real sandbox-set values (`sandbox-exec`, `qwen-code-sandbox`). All previously-green suites still green: cli/nonInteractiveCli (46), cli/gemini.test (23), cli/config/config.test (220), core/utils/errors (12), core/config/config.test (172). 337 tests across the touched suites. Won't-fix (out of scope, documented or pre-existing) - Unpaired `tool_use` in stream-json when a tool is aborted mid-execution — pre-existing structural gap (SIGINT mid-tool has the same outcome); PR amplifies it but doesn't introduce it. - Narrow SIGINT-vs-budget-timer race — already mitigated by `markExceeded`'s `signal.aborted` check. - `tickToolCall` increments past abort (cosmetic; only affects the `observed` value in the error envelope for a pathological caller). * fix(cli): round-2 audit fixes for headless guardrails (#4103, #4502) Round-2 audit (after round-1 commit 40ae6dd) surfaced two NEW correctness issues introduced by the round-1 catch-handler restructure, plus a handful of polish items from a parallel design pass. Correctness fixes (new bugs from R1) - nonInteractiveCli catch handler: wrap `adapter.emitResult` in try/catch. R1 moved the emit BEFORE `handleBudgetExceededError` so STREAM_JSON consumers see a terminal envelope first. But emitResult eventually hits `stdout.write`, which throws on EPIPE / ERR_STREAM_WRITE_AFTER_END when a piped consumer closes early (`qwen -p ... | head -n 1` is the common CI case). Letting that throw bubble out skipped both `handleBudgetExceededError` and `handleError`, dropping the documented exit-code-55 contract precisely when stdout was in trouble. Best-effort emit and continue to the exit handler. - nonInteractiveCli `structured_output` exemption: also require `config.getJsonSchema?.() !== undefined`. Without that guard, an MCP server registering an unrelated tool literally named `structured_output` would silently bypass `--max-tool-calls`. Also documents (in `headless.md` "Scope") the related caveat that failed Ajv-validation retries skip the tick too, so a malformed-output retry loop is NOT bounded by `--max-tool-calls` — combine with `--max-session-turns` or `--max-wall-time`. Polish - runBudget `validateMaxToolCalls` upper bound: cap at 1_000_000. `1e10` (typo for `1e1`) would otherwise parse cleanly, pass the `>= 0` gate forever, and silently disable the budget — the exact foot-gun `MAX_WALL_TIME_SECONDS` was built to prevent. Symmetry. - runBudget `parseDurationSeconds` sub-second hint: only append the "did you mean Ns?" suggestion when the input actually contained `ms`. Bare `0.5` would otherwise produce a useless "did you mean 0.5s?" suggestion. - nonInteractiveCli `routeAbort`: the `throw 'unreachable'` is only hit if `handleBudgetExceededError` / `handleCancellationError` ever becomes resumable (e.g. mocked `process.exit` in a test). Carry the original exceeded.message into the thrown Error so the outer catch's `errorMessage` field stays actionable instead of degrading to a literal "unreachable" string. - commands/serve.ts: compare `approvalMode` against `ApprovalMode.YOLO` enum instead of the string literal `'yolo'`. If the enum value is ever renamed, the startup warning stays in sync with the helper at `headlessSafetyWarnings.ts` instead of silently going dead. Documentation - `headless.md` "Scope": clarify the `structured_output` exemption is unconditional (including failed validations); add explicit note that `--max-session-turns` does NOT exempt `structured_output`, so size to `N+1` for `N` real-work turns under `--json-schema`. - `headless.md` flag table: add `1.5h` to the accepted-forms hint for `--max-wall-time` (the parser already accepts fractional units). Tests - `runBudget.test.ts`: new coverage for the `validateMaxToolCalls` ceiling. Total 42 tests across `runBudget.test.ts` (was 41), all green. cli/nonInteractiveCli, gemini.test, config/config all unchanged and still green. Won't-fix (documented above or out of scope) - ACP per-session approval-mode escalation (mid-session flip to YOLO) doesn't print the warning — daemon-level wiring; out of scope for this PR. - 1s wall-time floor vs higher (5–10s) — debatable, keeping 1s with loud sub-second rejection; can raise later without semver impact. - Integration test for the full budget-trip → catch → emitResult → exit 55 path — requires a process-exit-mocking harness; tracked as follow-up. * docs: align headless guardrails examples with R1 sub-second floor Round-3 audit caught two stale doc surfaces that R1's 1-second wall-time floor (and R2's `1.5h` fractional-unit addition) didn't update: - `docs/users/features/headless.md` budget table: replace stale `500ms` example with `1.5h`, add explicit "minimum 1s — sub-second values are rejected as typos" note. - `docs/users/configuration/settings.md` `model.maxWallTimeSeconds` row: same fix. Also extend `model.maxToolCalls` row with the structured_output exemption note, the `0` semantic, and the 1,000,000 ceiling that R2 added. A user copying the documented `--max-wall-time 500ms` example from either surface would hit a startup error after R1. Known follow-up (not addressed in this commit) - No test exercises the R2 `isStructuredOutputExempt` predicate end-to-end. Adding one needs the same process-exit-mocking harness called out in the R2 commit as a separate follow-up. * docs: align JSDoc / schema / CLI help with R1+R2 validation rules Round-4 final-pass audit caught four schema/help-text/JSDoc surfaces that drifted from the validators introduced in R1 (1s wall-time floor, 24-day ceiling) and R2 (1M tool-call ceiling, structured_output exemption, `0` sentinel). - `runBudget.ts` `parseDurationSeconds` JSDoc: replace stale claim that `500ms` is accepted and "sub-second precision is preserved" with the actual contract — `[MIN_WALL_TIME_SECONDS, MAX_WALL_TIME_SECONDS]`, ms suffix only legal when value resolves to >= 1s. Adds `1.5h` to the accepted-forms list. - `settingsSchema.ts` `model.maxWallTimeSeconds` description: now documents the 1s minimum and ~24-day ceiling. - `settingsSchema.ts` `model.maxToolCalls` description: documents the structured_output exemption, the `0` sentinel ("no tool calls allowed"), and the 1,000,000 ceiling. - `vscode-ide-companion/schemas/settings.schema.json`: mirrors both schema descriptions above so the VS Code settings UI auto-completion matches. - `config.ts` yargs `--max-wall-time` description: documents the 1s floor and the ~24-day max. - `config.ts` yargs `--max-tool-calls` description: documents the structured_output exemption, the `0` sentinel, and the 1M ceiling. `qwen --help` is the most-read surface for these flags; matches the prose docs in headless.md and settings.md. No code changes — pure doc/help-text alignment. --------- Co-authored-by: 克竟 <dingbingzhi.dbz@alibaba-inc.com>
Closes #4103 (all 短期 items + four of five 中期 items; only
--max-budget-usdremains, explicitly deferred — see below).This PR delivers the full set of runaway-protection guardrails the issue identified as missing in headless / non-interactive runs (
--prompt, stdin piping, CI, SDK). Three phases, all included here so reviewers can see how the warnings, docs, and budget enforcement compose into a single story.Phase 1 — fix the doc/behavior mismatches and surface unsafe configs
model.skipLoopDetectiondocs corrected totrue(default)truesince 396248e (LLM detector removed). The settings reference still claimedfalse, misleading anyone trying to reason about the guardrail.--yolo/ sandbox section rewrittenloadSandboxConfigdoes no such thing. Replaced with an explicit warning callout.getHeadlessYoloSafetyWarninghelper inpackages/cli/src/utils/headlessSafetyWarnings.tsfires once per non-interactive run. Suppressible withQWEN_CODE_SUPPRESS_YOLO_WARNING=1(parser matches theisUnattendedModeconvention —1/trueonly, so=0doesn't accidentally silence it).skipLoopDetection's(streaming and LLM)parenthetical referenced a detector deleted in 396248e.Phase 2 — run-level budgets (three guardrails)
Opt-in budgets that abort the run with a structured
FatalBudgetExceededError(exit code 55, distinct fromFatalTurnLimitedError's 53 and SIGINT's 130 so CI scripts can branch on the reason). All default to-1(unlimited).--max-wall-timemodel.maxWallTimeSeconds90(s),30s,5m,1h,500ms. Garbage /0are fatal at startup.--max-tool-callsmodel.maxToolCalls--max-api-callsmodel.maxApiCalls--max-session-turns, every retry / drain-turn request counts.Enforcement design —
packages/cli/src/utils/runBudget.tsRunBudgetEnforcerwraps the sameAbortControllerthat already carries SIGINT, so a budget abort flows through the existing cancellation plumbing.--max-tool-calls=Ncaps the run at exactly N executions — the (N+1)th tick aborts before the work is performed.setTimeout(...).unref()so it doesn't keep the event loop alive past a successful run;stop()runs in the existingfinallyblock so daemons (e.g.qwen serve) reusing one process across many runs don't accumulate stale timers.routeAborthelper at each abort-check site checksgetExceeded()first and emitsFatalBudgetExceededErrorwhen relevant — otherwise it falls through tohandleCancellationErrorso user SIGINT still produces exit code 130.AbortErrors through the budget handler so the user seesRun aborted: …instead of a rawAbortError.markExceeded: if the controller was already aborted by a third party, the enforcer doesn't retroactively claim the abort as a budget event, preventing a wrong exit code.Phase 3 — token budget + dangerous-tool audit
--max-tokens N/model.maxTokensuiTelemetryService.getMetrics()after eachfor awaitof a model stream) — the run may overshoot by at most one final response, which is inherent to "tokens are only counted after the model emits them". Distinct fromsessionTokenLimit, which caps next-prompt context size.shell/write/edittool counts on every YOLO run, e.g.YOLO audit: executed 4 shell, 2 write, 1 edit tool call(s) during this run.Emission is idempotent (auditEmittedlatch) and wired at every exit path becauseprocess.exit()bypasses surroundingfinallyblocks. Shares theQWEN_CODE_SUPPRESS_YOLO_WARNING=1env var so consumers opt out of both warning + audit with one knob.Deliberately deferred
--max-budget-usd— requires per-provider price tables that don't exist in-tree yet. Explicitly framed as a follow-up in the issue itself (能接入不同 provider 的价格模型时再精确计费). The only outstanding 中期 item.Test plan
packages/cli/src/utils/runBudget.test.ts— 26 unit testsparseDurationSeconds: 8 happy-path inputs; 9 reject cases including0,0s,0ms, negatives, garbage.RunBudgetEnforcer: post-increment off-by-one for api/tool counters,maxToolCalls=0means disallowed,-1means unlimited, wall-clock fires under fake timers,stop()cancels pending timer, first-fence-wins, idempotentstart(), SIGINT-race protection, token budget happy / unlimited paths.packages/cli/src/utils/headlessSafetyWarnings.test.ts— 12 unit testsgetHeadlessYoloSafetyWarning(sandbox / SANDBOX / suppression / edge cases)getDangerousToolAuditLine(YOLO gate, no-dangerous-call short circuit, partial-coverage counts, suppression env edge cases)nonInteractiveCli.test.ts(42 + 1 skipped),gemini.test.tsx(14),core/client.test.ts(99) all green after picking up the new config getters / CliArgs fields.npm run lint— cleannpm run typecheck— clean modulo a pre-existingserve/server.test.tssupertest types error that also reproduces onmain.qwen -p "loop forever" --yolo --max-wall-time 5s→ exit 55 after ~5s with friendly message on stderr.qwen -p "do many things" --max-tool-calls 3→ exit 55 after the 4th tool tick (BEFORE the 4th tool runs).qwen -p "summarize this huge corpus" --max-tokens 5000→ exits with budget error after the response that pushes total past 5000.qwen -p "edit a file" --yolo→ on exit, stderr hasYOLO audit: executed 0 shell, 0 write, 1 edit tool call(s) during this run.qwen -p "hi" --yolo→ YOLO-no-sandbox warning on stderr;QWEN_SANDBOX=1 qwen -p "hi" --yolo→ no warning.qwen -p "hi" --max-wall-time bogus→ fatal at startup, not a silent disable.Backward compatibility
-1(unlimited); existing CLI invocations behave identically.