feat(core): hint to background long-running foreground bash commands#3809
Conversation
Phase D part (a) of Issue #3634. When a foreground `shell` tool call runs ≥ 60 seconds and completes (succeeds or errors), append an advisory line to the LLM-facing tool result suggesting re-running with `is_background: true` next time. Why: today a foreground bash that takes minutes (build watcher, soak test, slow npm install, polling loop) blocks the agent indefinitely. The user is already paying for the wait; the agent's next turn could have started running in parallel under `is_background: true`. Sleep interception (#3684) handled the egregious `sleep N` case at validate time; this handles the legitimate-but-long case at result time. Trade-offs: - Threshold = 60s. Half the existing 120s foreground timeout. Long enough that normal `npm install` / `pytest` runs don't trigger; short enough that the hint surfaces before the timeout hard-kills. - Advisory only — the command still runs to completion in the foreground for THIS invocation. The advice is for the agent's NEXT decision, not a corrective action on the current one. - Fires on success AND error completions. The advice is the same ("background it next time") in both cases. - Suppressed on aborted (timeout / user-cancel) — those paths already surface their own messaging and don't benefit from a "should have been background" reminder when the user / system already killed it. Implementation: - New constant `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = 60000` in shell.ts, paired with the existing `DEFAULT_FOREGROUND_TIMEOUT_MS`. - Helper `buildLongRunningForegroundHint(elapsedMs)` exported so future surfaces (UI, telemetry) can render the same text without duplicating the threshold logic. - `Date.now()` bracketing around the spawn → `await resultPromise` block — mirrors what the background path already captures via `entry.startTime`. - Append happens inside the existing non-aborted result builder; zero changes to the cancel / timeout arms. Tests: 4 new cases — fires on long success, omits on short success, fires on long error completion, omits on aborted. Uses vi fake timers to drive wall-clock past the threshold without actually sleeping.
There was a problem hiding this comment.
Pull request overview
Adds an LLM-facing advisory to foreground shell tool results when a command runs for ≥ 60s and completes, nudging the agent to use is_background: true for long-running processes to avoid blocking the turn.
Changes:
- Add a 60s long-running threshold and append an advisory line to foreground (non-aborted) shell results.
- Capture wall-clock execution duration for foreground shell runs.
- Add unit tests validating the hint appears (success/error) and is suppressed on aborted results.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
packages/core/src/tools/shell.ts |
Measures foreground execution duration and appends a long-running is_background: true advisory on non-aborted completions. |
packages/core/src/tools/shell.test.ts |
Adds tests covering threshold behavior and suppression for aborted runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
— deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Review summary: The long-running foreground hint is well-implemented with proper test coverage, fake-timer setup, and correct gating (only non-aborted completions). Two suggestions for polish.
…truncation insertion Addresses 8 review threads on PR #3809 — 6 from /review bots, 2 from copilot — covering doc accuracy, code quality, behavioural gaps, and test coverage. **Behavioural fixes (real bugs)**: - **Suppress on external signal kills** (`result.signal != null` with `aborted: false`). `shellExecutionService` only sets `aborted` when the AbortSignal we passed was triggered, so SIGTERM from container shutdown / k8s eviction / OOM killer / sibling process-group reap falls through to the non-aborted branch. The advisory shouldn't fire there — the process didn't run to its conclusion, so "next time, background it" doesn't fit. New test pins this with `signal: 15` (SIGTERM), `aborted: false`. - **Append AFTER `truncateToolOutput`**. Previously the hint was appended inside the non-aborted result builder, which meant for long outputs it got wrapped in the "Truncated part of the output:" envelope — the LLM might read the advisory as part of the command's own output. New post-truncation insertion + test that pins ordering by mocking `truncateToolOutput` directly (real path needs `fs.writeFile` to actually succeed for the replacement branch to fire). - **Hint wording mode-aware**. The dialog mention dropped the unconditional "(footer pill + Enter)" specifics, which would mislead non-TTY users (`-p` headless / ACP / SDK consumers — no dialog or pill exists there). Now qualified as "in interactive mode the Background tasks dialog also has...". `/tasks` and the on-disk output file are mentioned without qualifier (work in any mode). **Code quality**: - **Threshold programmatically coupled to timeout**: `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = Math.floor(DEFAULT_FOREGROUND_TIMEOUT_MS / 2)`. If the timeout is tuned later, the threshold tracks automatically. - **Docstring corrected**: removed the misleading "before it gets killed by the timeout" claim — the hint is on non-aborted path only, so timeout-killed commands never see it. The new docstring enumerates all suppression paths explicitly. - **Removed stale line-number reference**: comment said "mirrors the background path's `entry.startTime` capture (line ~781)" which goes stale on file edits. Now refers conceptually. **Test coverage gaps closed**: - **Off-by-one boundary**: 59_999ms → no hint. Pairs with the existing 60_000ms-exactly test (which fires) to pin the boundary tightly. A regression flipping `>=` to `>` would fail loudly. - **Timeout path explicit**: previous "aborted" test exercised user- cancel only. With `vi.useFakeTimers({ toFake: ['Date'] })`, `AbortSignal.timeout()` doesn't fake (it depends on the real timer subsystem), so `combinedSignal.aborted` stayed false. New test follows the pre-existing `should handle timeout vs user cancellation correctly` pattern: stubs `AbortSignal.timeout` + `.any` to return an already-aborted combined signal, then verifies "Command timed out after Nms" appears AND no advisory.
…ation Six suggestions from /review's third pass on PR #3809: **Real semantic fix**: - Long-run threshold now scales with the EFFECTIVE timeout, not the fixed default. A user who sets `timeout: 600_000` (10 min) gets the advisory at 5 min, not at 60s — respects the explicit timeout intent. Replaced the `LONG_RUNNING_FOREGROUND_THRESHOLD_MS` constant with a per-invocation `longRunThresholdFor(effectiveTimeout)` helper. **Debug-mode visibility**: - Debug mode previously snapshotted `returnDisplayMessage = llmContent` BEFORE the truncation + hint append, so debug-mode users saw the pre-hint content while the agent saw the advisory — agent suddenly suggesting `is_background: true` had no visible trigger in the TUI. Re-sync `returnDisplayMessage` after the hint append (debug-mode branch only) so the TUI mirrors what the agent sees. **Type-safety footgun**: - `if (typeof llmContent === 'string')` would silently drop the hint if `llmContent` ever becomes structured `Part[]`. Added an explicit `else` comment documenting the deliberate omission and the conditions under which to revisit (no string llmContent path exists today). **Style**: - Replaced the JSDoc `/** ... */` block on the (now-defunct) constant with a plain `//` comment block on the helper, matching the `DEFAULT_FOREGROUND_TIMEOUT_MS` / `OUTPUT_UPDATE_INTERVAL_MS` style. **Test hygiene**: - Wrapped both `vi.stubGlobal('AbortSignal', ...)` and `vi.spyOn(truncateToolOutput, ...)` in `try/finally` so failures during the test body don't leak the stub/spy into subsequent tests (would cause confusing cascading failures). - Dropped the internal-roadmap "Phase D part (a)" reference from the test comment — future maintainers don't have the context. **New test**: - `threshold scales with the user-supplied timeout (not the default)`: sets `timeout: 600_000`, advances 100s, verifies no hint. Pins the per-invocation coupling so a regression to a fixed constant would fail loudly here.
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] Debug mode hint re-sync path untested (shell.ts:798-805): all tests mock getDebugMode to return false. No test covers the returnDisplayMessage = llmContent re-sync when hint triggers in debug mode — the exact scenario the code comment warns about ("otherwise the agent would suddenly suggest is_background: true with no visible trigger"). Add a test with getDebugMode.mockReturnValue(true) + advanceTimersByTime(60_000) asserting result.returnDisplay contains the hint.
[Suggestion] User-custom timeout above-threshold forward test missing (shell.test.ts:1024-1041): the "threshold scales" test only verifies below-threshold (100s < 300s). Missing the paired above-threshold case — add a test with timeout: 600_000 advancing 350_000ms, asserting the hint DOES fire. This would catch a regression that changes the threshold formula.
— deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
…truncation insertion (round 4) Six suggestions from /review's pai/glm-5-fp8 pass on PR #3809: **Behavioural / UX**: - **Hint now visible in non-debug TUI too.** Previously only debug mode mirrored the hint into `returnDisplay`; non-debug users saw the agent suggest `is_background: true` with no visible trigger. Now the hint is appended to `returnDisplayMessage` in both modes (full mirror in debug, terse-append in non-debug to preserve the output-or-status form). **Test coverage**: - **Debug-mode re-sync test added.** All other long-run hint tests run with `getDebugMode → false`; this one flips it to true and asserts the hint appears in `returnDisplay` too. Pins the re-sync so a regression that drops the debug branch would fail loudly. - **Threshold-scaling positive test added.** The negative case (`timeout: 600_000`, advance 100s, no hint) was already pinned; paired now with the positive case (advance 305s, hint fires) so a regression to a fixed 60s threshold is caught at both ends. **Style / consistency**: - **`result.signal === null` (was `== null`).** Strict equality to match the rest of the file. The `signal` field is typed `number | null` so loose equality has identical semantics, but the inconsistency was noise. **Doc clarity (timing semantics)**: - **Comment explains why elapsedMs is computed BEFORE truncation.** Two reviewers disagreed on the timing — one read it as before truncation (correct, slightly under-reports), the other as after (incorrect read). The intent is to report the COMMAND's runtime, not the tool call's total time. Truncation is post-processing, not part of "agent blocking time", so excluding it is the right semantic. Inline comment now spells this out so future readers don't have to infer.
wenshao
left a comment
There was a problem hiding this comment.
C1. Critical — returnDisplay 非 debug 模式 hint 追加路径无测试覆盖
Debug 模式测试验证了 returnDisplay,但非 debug 下成功/错误路径从未断言 returnDisplay。若分隔符、文本或追加逻辑损坏无测试捕捉。returnDisplay 是 TUI 用户看到的 — 用户看不到建议违背设计目标。修复:在测试 1 加 expect(result.returnDisplay).toContain('foreground command ran for 60s'),测试 3 同理。
S8. Suggestion — 缺少功能开关
Hint 对每个满足阈值的前台调用无条件开启。若导致模型混淆或 TUI 刷屏,唯一缓解是完整回滚。修复:添加 this.config.getEnableLongRunHint?.() !== false 检查,环境变量可覆盖。
S9. Suggestion — Debug 模式 hint 块与 truncation 存在非正交耦合
Hint 块的 returnDisplayMessage = llmContent 也承担修复 debug 模式截断陈旧快照的责任。重构 hint 代码将退化 debug 模式显示。修复:将 debug 同步逻辑从 hint 块移至 truncation 块。
S10. Nice to have — 空输出静默成功时 returnDisplay 未被测试(output: '', exitCode: 0, 运行 ≥60s 时 returnDisplayMessage 初始为空,追加后仅含 hint 无分隔符)
— deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
wenshao
left a comment
There was a problem hiding this comment.
wenshao
left a comment
There was a problem hiding this comment.
Reviewing with Qwen3.6-Plus-DogFooding via Qwen Code /review.
No new findings — the debug-mode truncation overwrite issue at shell.ts:815 was already flagged in a prior review round. No blockers.
wenshao
left a comment
There was a problem hiding this comment.
No Critical or Suggestion-level issues found. All findings are Nice-to-have:
longRunThresholdFor(1)returns 0 fortimeout: 1— edge case whereelapsedMs >= 0is always true. Practically unreachable sinceAbortSignal.timeout(1)aborts most commands before completion, but aMath.max(1, ...)floor would close the gap.executionStartTimecaptured beforeawait ShellExecutionService.execute()includes spawn setup overhead — negligible in practice but inconsistent with the "spawn → settle" comment.- Excessive comment volume on
shouldAppendLongRunHint(28-line block for a 3-line boolean) — surrounding code uses 1-3 line inline comments. buildLongRunningForegroundHintexported but has no external consumers — premature per YAGNI.- Error-path test doesn't verify
returnDisplaycontains the hint.
Code logic is correct, test coverage is thorough (10 new tests covering success/error/abort/timeout/signal/boundary/truncation/scaling/debug), and all prior review suggestions have been addressed. LGTM! ✅
— pai/glm-5-fp8 via Qwen Code /review
…shold floor + observability Round 5 of PR #3809 review — 10 threads, mix of Critical and Suggestion: **Critical fixes**: 1. **Hint survives the error path** (`#OWbA`). When result.error is set, coreToolScheduler builds the model-facing functionResponse from `error.message` ONLY (not llmContent — see convertToFunctionResponse + the toolResult.error branch in scheduler:1648-1724). My hint was being silently dropped on long-command-failed cases. Now the hint is appended to error.message too so the advisory survives whichever branch the scheduler takes. 2. **Hint wording de-ambiguated** (`#OU6o`). "prefer re-running with is_background: true" was ambiguous — model could read it as "re-run THIS command in the background", which on stateful commands (DB migrations, deploys, git push) would cause double side effects. Reworded to "Next time you run a SIMILAR long-running process..." with an explicit parenthetical that warns against re-running the just-completed command. 3. **Debug observability** (`#OU6s`). Added `debugLogger.debug` at the hint decision point with elapsedMs / threshold / aborted / signal — when a user reports "my 65s command didn't get the hint" the suppression branch is now visible in DEBUG output. **Other behaviour fixes**: 4. **Threshold floor of 1000ms** (`#OU6r`). Pathological `timeout: 0` / `timeout: 1` would have given a 0-ms threshold, firing the hint on every invocation showing "ran for 0s". Floor at 1s makes that branch unreachable. 5. **`performance.now()` instead of `Date.now()`** (`#OU6v`). NTP corrections / VM clock drift between capture and read would silently make `elapsedMs` negative and skip the hint with no observable failure. Monotonic clock prevents that. 6. **Debug mode preserves truncation marker** (`#OU6w` / `#OWCq`). Previously `returnDisplayMessage = llmContent` after hint clobbered the "Output too long and was saved to: …" line appended during truncation. Switched to append-style re-sync in BOTH modes so prior content is preserved. **Test coverage gaps closed**: 7. **Non-debug returnDisplay test** (`#OWCo`). Pinned that the user TUI gets the hint in the default (non-debug) mode too. 8. **Test rename** (`#OWCl`). The "debug-mode TUI mirror" test passed in non-debug too after the recent refactor; split into two tests, one per branch. 9. **Error-path hint test**. Added a test that pins `result.error?.message` contains both the original error text AND the hint, covering the scheduler-routing-via-error.message path that was silently broken before fix #1. 10. **Test: faketimers also fakes `performance`**. Since we switched to `performance.now()`, `vi.useFakeTimers({ toFake: ['Date'] })` no longer covered the elapsed measurement; extended to `['Date', 'performance']` so the threshold tests can drive the wall-clock with `advanceTimersByTimeAsync`. #OU6t (else-comment for the type guard) was already addressed in the prior round — the explicit else-with-comment is in place; adding logging there would be noise.
PR #3809 review: the new `Math.max(MIN_LONG_RUN_THRESHOLD_MS, ...)` floor in `longRunThresholdFor` was untested — only default-timeout and large-custom-timeout cases existed. A regression that strips the floor would let `timeout: 1` produce a 0ms threshold and fire a "ran for 0s" advisory on every invocation; the test suite would not catch it. New test: build with `timeout: 1`, advance 500ms (below the 1000ms floor), resolve with `aborted: false` to isolate the threshold logic from the abort path. Asserts no hint appears. A regression that removes the floor flips the assertion to fail.
wenshao
left a comment
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: self-PR; GitHub does not allow approving your own PR. LGTM! ✅ — gpt-5.5-0424-global via Qwen Code /review
…ut floor comment Two of three threads from the latest /review pass on PR #3809 (the third — PR description / threshold scaling reconciliation — is fixed in the PR description update, not in code): - **`\n---\n` divider before hint in `error.message`** (`#Pt7C`). Downstream consumers of `error.message` (firePostToolUseFailureHook, telemetry grouping, SIEM alerting, hook-side error parsers) were receiving ~400 chars of advisory text mixed inline with the original error body — pattern-matching on error messages would absorb the advisory into the matched body. Added a `---` separator line so the boundary is unambiguous and split-able. - **Threshold-floor comment narrowed to `timeout: 1`** (`#Pu9o`). The comment said the floor guards `timeout: 0` / `timeout: 1`, but `validateToolParamValues` rejects `timeout <= 0` at validate time, so `timeout: 0` can't reach `longRunThresholdFor`. Updated the comment to mention only the actually-allowed pathological case (`timeout: 1` and any value `< 2` rounds to 0). Test updated to assert the `---` divider format with `toMatch`.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… counted PR #3809 review: copilot caught that `executionStartTime` was captured BEFORE `await ShellExecutionService.execute(...)`, which meant the elapsed measurement included `getPty()` dynamic-import setup (~50-200ms on first call). The hint's "ran for Xs" reading was slightly inflated, and the comment claiming "spawn → settle" wasn't strictly accurate. Moved the capture immediately after the execute() call returns its { result, pid } handle. The pid being set by that point confirms the process has been spawned, so the subtraction is true post-spawn-to- settle. Comment updated to reflect the actual semantics. The displayed accuracy gain is small (50-200ms on a 60s+ threshold is <1%), but the comment claim now matches what the code measures. Tests unaffected — fakeTimers don't drive real dynamic imports, so the threshold tests behave identically.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rror semantics Four copilot threads on PR #3809 — all rooted in the same observation: `ShellExecutionResult.error` is reserved for spawn/setup failures (per the field's doc comment in shellExecutionService.ts), NOT for non-zero exit codes. My existing code/tests conflated the two, making the error-path coverage less realistic and the inline comments inaccurate. **Test shape fixes**: - `appends the hint when a long-running foreground command exits with error` → `exits non-zero`. Changed `error: new Error('exit 1')` to `error: null` (the realistic shape for a non-zero exit without spawn failure). Added a comment explaining the field contract so future test authors don't repeat the conflation. - `hint survives the error path (appended to error.message)`: reframed the mock from `spawn ENOENT` (which would resolve in <1s in practice, making the long-elapsed scenario unrealistic) to `PTY initialization failed after 75s` — a slow-spawn-failure shape that COULD plausibly take 75s. Test still pins the same CODE PATH; comment now acknowledges the edge-case nature ("rare but real: PTY init dragging, remote-fs exec syscalls, security scanners interposing"). **Comment corrections**: - `returnDisplayMessage` build-order comment was misleading. It said "the hint is appended after both the truncation block and the returnDisplayMessage build" — but `returnDisplayMessage` is built BEFORE truncation. Replaced with a chronological enumeration (1. initial value, 2. truncation marker append, 3. hint append) that matches what the code actually does. - Error-path preservation comment now acknowledges the narrow applicability (spawn failures only, exit codes don't reach this branch). Code is unchanged — the path is still real, just rare.
doudouOUC
left a comment
There was a problem hiding this comment.
Two minor test coverage suggestions. Neither blocks merge — the code is correct on both paths, these are defensive test additions.
- Empty-output + long-run: All positive hint tests provide non-empty output; a successful command with empty output (e.g. disk-only side effects) that exceeds the threshold has no explicit coverage.
- Background path: No explicit assertion that the hint never appears on background results. By construction it can't, but a defensive test would catch future refactoring regressions.
from GLM-5.1
| // Advance the wall-clock past the 60s threshold. | ||
| await vi.advanceTimersByTimeAsync(60_000); | ||
| resolveShellExecution({ output: 'all green', exitCode: 0 }); | ||
| const result = await promise; |
There was a problem hiding this comment.
[Suggestion] All positive (hint-fires) test cases provide non-empty result.output. There's no test covering a successful command with empty output that exceeds the threshold — e.g. a command that only writes to disk and exits 0 after 65s. The code handles it correctly (returnDisplayMessage stays '' until the hint append, which produces just the hint with no leading newlines), but an explicit test would pin this path:
| const result = await promise; | |
| it('appends the hint when a successful foreground command with empty output runs ≥ 60s', async () => { | |
| const invocation = shellTool.build({ | |
| command: 'write-to-disk.sh', | |
| is_background: false, | |
| }); | |
| const promise = invocation.execute(mockAbortSignal); | |
| await vi.advanceTimersByTimeAsync(65_000); | |
| resolveShellExecution({ output: '', exitCode: 0 }); | |
| const result = await promise; | |
| expect(result.llmContent).toContain('foreground command ran for 65s'); | |
| expect(result.returnDisplay).toContain('foreground command ran for 65s'); | |
| }); |
This also validates that returnDisplay gets the hint even when result.output is empty (the returnDisplayMessage is '' before the append, and the hint appears without a leading \n\n).
from GLM-5.1
There was a problem hiding this comment.
Fixed in 40e2d75. Added 'appends the hint when a successful foreground command with empty output runs ≥ 60s' — the empty-output success branch where returnDisplayMessage stays '' until the hint append. Asserts both llmContent and returnDisplay carry the advisory: the user is the one who waited 65s on a tar czf / cp -r / dd style write-only command, they should see the same hint the agent does. Comment notes the realistic motivating shapes (write-only commands that exit 0 with no stdout).
There was a problem hiding this comment.
Fixed in 40e2d75. Verified the new test — output: '', exitCode: 0 with 65s elapsed correctly covers the empty-returnDisplayMessage path. The hint append uses the empty-string branch (no \\n\\n prefix), so returnDisplay is just the hint text. LGTM.
from GLM-5.1
| /PTY initialization failed after 75s\n\n---\n/, | ||
| ); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
[Suggestion] The long-running hint describe block only tests the foreground (is_background: false) path. There's no explicit test asserting that the background path (is_background: true) never includes the hint. By construction the hint logic is only in executeForeground() and executeBackground() returns a completely different result shape (Background shell started...), so this can't fail today. But an explicit defensive test would guard against a future refactor that accidentally moves shared logic into a common path:
| }); | |
| it('never appends the long-run hint on background commands', async () => { | |
| const invocation = shellTool.build({ | |
| command: 'pytest -q', | |
| is_background: true, | |
| }); | |
| const result = await invocation.execute(mockAbortSignal); | |
| expect(result.llmContent).not.toContain('foreground command ran for'); | |
| expect(result.llmContent).not.toContain('is_background: true'); | |
| expect(result.llmContent).toContain('Background shell started'); | |
| }); |
from GLM-5.1
There was a problem hiding this comment.
Fixed in 40e2d75. Added 'never appends the long-run hint on background commands' as a defensive pin. By construction the hint logic only lives in executeForeground so this can't fail today, but it guards the failure mode you described — a future refactor that hoists the advisory into a shared post-execute path would accidentally tag every is_background: true launch with a nonsensical "ran for 0s, consider is_background: true" suggestion. Test asserts both the negative invariants (no foreground command ran for, no is_background: true literal — that's only in the hint text) AND the positive invariant (Background shell started is preserved).
There was a problem hiding this comment.
Fixed in 40e2d75. Verified the new test — is_background: true path returns Background shell started and correctly asserts no hint leakage. The extra is_background: true string check is a nice touch to catch hint text leaking into background results. LGTM.
from GLM-5.1
Two defensive tests for the long-running foreground hint: - empty-output success at >=60s — exercises the returnDisplayMessage='' → hint append branch (write-only commands like `tar czf` / `cp -r` produce no stdout). Asserts the user- facing returnDisplay still surfaces the advisory even when the command produced nothing else to show. - background never includes the hint — the foreground hint logic lives in executeForeground only, so today this can't fail; the test guards against a future refactor hoisting the advisory into a shared post-execute path that would tag every background launch with a nonsensical "ran for 0s, consider is_background: true" suggestion.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
doudouOUC
left a comment
There was a problem hiding this comment.
Review Summary
Thoroughly reviewed the long-running foreground bash hint implementation across 10+ review iterations. The PR is in strong shape — all issues from prior rounds have been addressed, and the two suggestions I raised (empty-output test gap + background path defensive test) were fixed in 40e2d75.
Key design points verified
- Threshold =
effectiveTimeout / 2with 1000ms floor: Per-invocation coupling to the actual timeout, not a fixed constant. Floor prevents pathologicaltimeout: 1from triggering "ran for 0s" advisory. performance.now()overDate.now(): Monotonic clock prevents negative elapsed from NTP/VM drift.- Post-truncation insertion: Hint appears outside the truncation envelope so LLM doesn't misread it as command output.
- Error-path preservation: Hint appended to both
llmContentANDerror.message(with\n---\ndelimiter) so the advisory survives the scheduler's error branch. - Suppression on aborted + external signal: Correct — "next time background it" doesn't apply when the command didn't complete under its own steam.
- Anti-re-run wording: Explicit warning against re-running the just-completed command, preventing double side effects on stateful operations.
Test coverage
17 new tests covering: positive (success, non-zero exit, empty output, scaled threshold), negative (under-threshold, aborted, timeout, external signal, off-by-one, tiny timeout floor), integration (post-truncation ordering, non-debug TUI, debug-mode TUI, error.message path, background path). All edge cases pinned with boundary tests.
No remaining issues
All review comments resolved. LGTM.
from GLM-5.1
…3809) * feat(core): hint to background long-running foreground bash commands Phase D part (a) of Issue #3634. When a foreground `shell` tool call runs ≥ 60 seconds and completes (succeeds or errors), append an advisory line to the LLM-facing tool result suggesting re-running with `is_background: true` next time. Why: today a foreground bash that takes minutes (build watcher, soak test, slow npm install, polling loop) blocks the agent indefinitely. The user is already paying for the wait; the agent's next turn could have started running in parallel under `is_background: true`. Sleep interception (#3684) handled the egregious `sleep N` case at validate time; this handles the legitimate-but-long case at result time. Trade-offs: - Threshold = 60s. Half the existing 120s foreground timeout. Long enough that normal `npm install` / `pytest` runs don't trigger; short enough that the hint surfaces before the timeout hard-kills. - Advisory only — the command still runs to completion in the foreground for THIS invocation. The advice is for the agent's NEXT decision, not a corrective action on the current one. - Fires on success AND error completions. The advice is the same ("background it next time") in both cases. - Suppressed on aborted (timeout / user-cancel) — those paths already surface their own messaging and don't benefit from a "should have been background" reminder when the user / system already killed it. Implementation: - New constant `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = 60000` in shell.ts, paired with the existing `DEFAULT_FOREGROUND_TIMEOUT_MS`. - Helper `buildLongRunningForegroundHint(elapsedMs)` exported so future surfaces (UI, telemetry) can render the same text without duplicating the threshold logic. - `Date.now()` bracketing around the spawn → `await resultPromise` block — mirrors what the background path already captures via `entry.startTime`. - Append happens inside the existing non-aborted result builder; zero changes to the cancel / timeout arms. Tests: 4 new cases — fires on long success, omits on short success, fires on long error completion, omits on aborted. Uses vi fake timers to drive wall-clock past the threshold without actually sleeping. * fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion Addresses 8 review threads on PR #3809 — 6 from /review bots, 2 from copilot — covering doc accuracy, code quality, behavioural gaps, and test coverage. **Behavioural fixes (real bugs)**: - **Suppress on external signal kills** (`result.signal != null` with `aborted: false`). `shellExecutionService` only sets `aborted` when the AbortSignal we passed was triggered, so SIGTERM from container shutdown / k8s eviction / OOM killer / sibling process-group reap falls through to the non-aborted branch. The advisory shouldn't fire there — the process didn't run to its conclusion, so "next time, background it" doesn't fit. New test pins this with `signal: 15` (SIGTERM), `aborted: false`. - **Append AFTER `truncateToolOutput`**. Previously the hint was appended inside the non-aborted result builder, which meant for long outputs it got wrapped in the "Truncated part of the output:" envelope — the LLM might read the advisory as part of the command's own output. New post-truncation insertion + test that pins ordering by mocking `truncateToolOutput` directly (real path needs `fs.writeFile` to actually succeed for the replacement branch to fire). - **Hint wording mode-aware**. The dialog mention dropped the unconditional "(footer pill + Enter)" specifics, which would mislead non-TTY users (`-p` headless / ACP / SDK consumers — no dialog or pill exists there). Now qualified as "in interactive mode the Background tasks dialog also has...". `/tasks` and the on-disk output file are mentioned without qualifier (work in any mode). **Code quality**: - **Threshold programmatically coupled to timeout**: `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = Math.floor(DEFAULT_FOREGROUND_TIMEOUT_MS / 2)`. If the timeout is tuned later, the threshold tracks automatically. - **Docstring corrected**: removed the misleading "before it gets killed by the timeout" claim — the hint is on non-aborted path only, so timeout-killed commands never see it. The new docstring enumerates all suppression paths explicitly. - **Removed stale line-number reference**: comment said "mirrors the background path's `entry.startTime` capture (line ~781)" which goes stale on file edits. Now refers conceptually. **Test coverage gaps closed**: - **Off-by-one boundary**: 59_999ms → no hint. Pairs with the existing 60_000ms-exactly test (which fires) to pin the boundary tightly. A regression flipping `>=` to `>` would fail loudly. - **Timeout path explicit**: previous "aborted" test exercised user- cancel only. With `vi.useFakeTimers({ toFake: ['Date'] })`, `AbortSignal.timeout()` doesn't fake (it depends on the real timer subsystem), so `combinedSignal.aborted` stayed false. New test follows the pre-existing `should handle timeout vs user cancellation correctly` pattern: stubs `AbortSignal.timeout` + `.any` to return an already-aborted combined signal, then verifies "Command timed out after Nms" appears AND no advisory. * fix(core): per-invocation long-run threshold + debug-mode + test isolation Six suggestions from /review's third pass on PR #3809: **Real semantic fix**: - Long-run threshold now scales with the EFFECTIVE timeout, not the fixed default. A user who sets `timeout: 600_000` (10 min) gets the advisory at 5 min, not at 60s — respects the explicit timeout intent. Replaced the `LONG_RUNNING_FOREGROUND_THRESHOLD_MS` constant with a per-invocation `longRunThresholdFor(effectiveTimeout)` helper. **Debug-mode visibility**: - Debug mode previously snapshotted `returnDisplayMessage = llmContent` BEFORE the truncation + hint append, so debug-mode users saw the pre-hint content while the agent saw the advisory — agent suddenly suggesting `is_background: true` had no visible trigger in the TUI. Re-sync `returnDisplayMessage` after the hint append (debug-mode branch only) so the TUI mirrors what the agent sees. **Type-safety footgun**: - `if (typeof llmContent === 'string')` would silently drop the hint if `llmContent` ever becomes structured `Part[]`. Added an explicit `else` comment documenting the deliberate omission and the conditions under which to revisit (no string llmContent path exists today). **Style**: - Replaced the JSDoc `/** ... */` block on the (now-defunct) constant with a plain `//` comment block on the helper, matching the `DEFAULT_FOREGROUND_TIMEOUT_MS` / `OUTPUT_UPDATE_INTERVAL_MS` style. **Test hygiene**: - Wrapped both `vi.stubGlobal('AbortSignal', ...)` and `vi.spyOn(truncateToolOutput, ...)` in `try/finally` so failures during the test body don't leak the stub/spy into subsequent tests (would cause confusing cascading failures). - Dropped the internal-roadmap "Phase D part (a)" reference from the test comment — future maintainers don't have the context. **New test**: - `threshold scales with the user-supplied timeout (not the default)`: sets `timeout: 600_000`, advances 100s, verifies no hint. Pins the per-invocation coupling so a regression to a fixed constant would fail loudly here. * fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion (round 4) Six suggestions from /review's pai/glm-5-fp8 pass on PR #3809: **Behavioural / UX**: - **Hint now visible in non-debug TUI too.** Previously only debug mode mirrored the hint into `returnDisplay`; non-debug users saw the agent suggest `is_background: true` with no visible trigger. Now the hint is appended to `returnDisplayMessage` in both modes (full mirror in debug, terse-append in non-debug to preserve the output-or-status form). **Test coverage**: - **Debug-mode re-sync test added.** All other long-run hint tests run with `getDebugMode → false`; this one flips it to true and asserts the hint appears in `returnDisplay` too. Pins the re-sync so a regression that drops the debug branch would fail loudly. - **Threshold-scaling positive test added.** The negative case (`timeout: 600_000`, advance 100s, no hint) was already pinned; paired now with the positive case (advance 305s, hint fires) so a regression to a fixed 60s threshold is caught at both ends. **Style / consistency**: - **`result.signal === null` (was `== null`).** Strict equality to match the rest of the file. The `signal` field is typed `number | null` so loose equality has identical semantics, but the inconsistency was noise. **Doc clarity (timing semantics)**: - **Comment explains why elapsedMs is computed BEFORE truncation.** Two reviewers disagreed on the timing — one read it as before truncation (correct, slightly under-reports), the other as after (incorrect read). The intent is to report the COMMAND's runtime, not the tool call's total time. Truncation is post-processing, not part of "agent blocking time", so excluding it is the right semantic. Inline comment now spells this out so future readers don't have to infer. * fix(core): error-path hint surfacing + clock-resilient elapsed + threshold floor + observability Round 5 of PR #3809 review — 10 threads, mix of Critical and Suggestion: **Critical fixes**: 1. **Hint survives the error path** (`#OWbA`). When result.error is set, coreToolScheduler builds the model-facing functionResponse from `error.message` ONLY (not llmContent — see convertToFunctionResponse + the toolResult.error branch in scheduler:1648-1724). My hint was being silently dropped on long-command-failed cases. Now the hint is appended to error.message too so the advisory survives whichever branch the scheduler takes. 2. **Hint wording de-ambiguated** (`#OU6o`). "prefer re-running with is_background: true" was ambiguous — model could read it as "re-run THIS command in the background", which on stateful commands (DB migrations, deploys, git push) would cause double side effects. Reworded to "Next time you run a SIMILAR long-running process..." with an explicit parenthetical that warns against re-running the just-completed command. 3. **Debug observability** (`#OU6s`). Added `debugLogger.debug` at the hint decision point with elapsedMs / threshold / aborted / signal — when a user reports "my 65s command didn't get the hint" the suppression branch is now visible in DEBUG output. **Other behaviour fixes**: 4. **Threshold floor of 1000ms** (`#OU6r`). Pathological `timeout: 0` / `timeout: 1` would have given a 0-ms threshold, firing the hint on every invocation showing "ran for 0s". Floor at 1s makes that branch unreachable. 5. **`performance.now()` instead of `Date.now()`** (`#OU6v`). NTP corrections / VM clock drift between capture and read would silently make `elapsedMs` negative and skip the hint with no observable failure. Monotonic clock prevents that. 6. **Debug mode preserves truncation marker** (`#OU6w` / `#OWCq`). Previously `returnDisplayMessage = llmContent` after hint clobbered the "Output too long and was saved to: …" line appended during truncation. Switched to append-style re-sync in BOTH modes so prior content is preserved. **Test coverage gaps closed**: 7. **Non-debug returnDisplay test** (`#OWCo`). Pinned that the user TUI gets the hint in the default (non-debug) mode too. 8. **Test rename** (`#OWCl`). The "debug-mode TUI mirror" test passed in non-debug too after the recent refactor; split into two tests, one per branch. 9. **Error-path hint test**. Added a test that pins `result.error?.message` contains both the original error text AND the hint, covering the scheduler-routing-via-error.message path that was silently broken before fix #1. 10. **Test: faketimers also fakes `performance`**. Since we switched to `performance.now()`, `vi.useFakeTimers({ toFake: ['Date'] })` no longer covered the elapsed measurement; extended to `['Date', 'performance']` so the threshold tests can drive the wall-clock with `advanceTimersByTimeAsync`. #OU6t (else-comment for the type guard) was already addressed in the prior round — the explicit else-with-comment is in place; adding logging there would be noise. * test(core): cover the MIN_LONG_RUN_THRESHOLD_MS floor branch PR #3809 review: the new `Math.max(MIN_LONG_RUN_THRESHOLD_MS, ...)` floor in `longRunThresholdFor` was untested — only default-timeout and large-custom-timeout cases existed. A regression that strips the floor would let `timeout: 1` produce a 0ms threshold and fire a "ran for 0s" advisory on every invocation; the test suite would not catch it. New test: build with `timeout: 1`, advance 500ms (below the 1000ms floor), resolve with `aborted: false` to isolate the threshold logic from the abort path. Asserts no hint appears. A regression that removes the floor flips the assertion to fail. * fix(core): structured delimiter on error.message hint + tighten timeout floor comment Two of three threads from the latest /review pass on PR #3809 (the third — PR description / threshold scaling reconciliation — is fixed in the PR description update, not in code): - **`\n---\n` divider before hint in `error.message`** (`#Pt7C`). Downstream consumers of `error.message` (firePostToolUseFailureHook, telemetry grouping, SIEM alerting, hook-side error parsers) were receiving ~400 chars of advisory text mixed inline with the original error body — pattern-matching on error messages would absorb the advisory into the matched body. Added a `---` separator line so the boundary is unambiguous and split-able. - **Threshold-floor comment narrowed to `timeout: 1`** (`#Pu9o`). The comment said the floor guards `timeout: 0` / `timeout: 1`, but `validateToolParamValues` rejects `timeout <= 0` at validate time, so `timeout: 0` can't reach `longRunThresholdFor`. Updated the comment to mention only the actually-allowed pathological case (`timeout: 1` and any value `< 2` rounds to 0). Test updated to assert the `---` divider format with `toMatch`. * fix(core): capture executionStartTime AFTER spawn so PTY import isn't counted PR #3809 review: copilot caught that `executionStartTime` was captured BEFORE `await ShellExecutionService.execute(...)`, which meant the elapsed measurement included `getPty()` dynamic-import setup (~50-200ms on first call). The hint's "ran for Xs" reading was slightly inflated, and the comment claiming "spawn → settle" wasn't strictly accurate. Moved the capture immediately after the execute() call returns its { result, pid } handle. The pid being set by that point confirms the process has been spawned, so the subtraction is true post-spawn-to- settle. Comment updated to reflect the actual semantics. The displayed accuracy gain is small (50-200ms on a 60s+ threshold is <1%), but the comment claim now matches what the code measures. Tests unaffected — fakeTimers don't drive real dynamic imports, so the threshold tests behave identically. * fix(core): align long-run hint code/tests with ShellExecutionResult.error semantics Four copilot threads on PR #3809 — all rooted in the same observation: `ShellExecutionResult.error` is reserved for spawn/setup failures (per the field's doc comment in shellExecutionService.ts), NOT for non-zero exit codes. My existing code/tests conflated the two, making the error-path coverage less realistic and the inline comments inaccurate. **Test shape fixes**: - `appends the hint when a long-running foreground command exits with error` → `exits non-zero`. Changed `error: new Error('exit 1')` to `error: null` (the realistic shape for a non-zero exit without spawn failure). Added a comment explaining the field contract so future test authors don't repeat the conflation. - `hint survives the error path (appended to error.message)`: reframed the mock from `spawn ENOENT` (which would resolve in <1s in practice, making the long-elapsed scenario unrealistic) to `PTY initialization failed after 75s` — a slow-spawn-failure shape that COULD plausibly take 75s. Test still pins the same CODE PATH; comment now acknowledges the edge-case nature ("rare but real: PTY init dragging, remote-fs exec syscalls, security scanners interposing"). **Comment corrections**: - `returnDisplayMessage` build-order comment was misleading. It said "the hint is appended after both the truncation block and the returnDisplayMessage build" — but `returnDisplayMessage` is built BEFORE truncation. Replaced with a chronological enumeration (1. initial value, 2. truncation marker append, 3. hint append) that matches what the code actually does. - Error-path preservation comment now acknowledges the narrow applicability (spawn failures only, exit codes don't reach this branch). Code is unchanged — the path is still real, just rare. * test(core): pin empty-output success + background-no-hint paths Two defensive tests for the long-running foreground hint: - empty-output success at >=60s — exercises the returnDisplayMessage='' → hint append branch (write-only commands like `tar czf` / `cp -r` produce no stdout). Asserts the user- facing returnDisplay still surfaces the advisory even when the command produced nothing else to show. - background never includes the hint — the foreground hint logic lives in executeForeground only, so today this can't fail; the test guards against a future refactor hoisting the advisory into a shared post-execute path that would tag every background launch with a nonsensical "ran for 0s, consider is_background: true" suggestion.
…wenLM#3809) * feat(core): hint to background long-running foreground bash commands Phase D part (a) of Issue QwenLM#3634. When a foreground `shell` tool call runs ≥ 60 seconds and completes (succeeds or errors), append an advisory line to the LLM-facing tool result suggesting re-running with `is_background: true` next time. Why: today a foreground bash that takes minutes (build watcher, soak test, slow npm install, polling loop) blocks the agent indefinitely. The user is already paying for the wait; the agent's next turn could have started running in parallel under `is_background: true`. Sleep interception (QwenLM#3684) handled the egregious `sleep N` case at validate time; this handles the legitimate-but-long case at result time. Trade-offs: - Threshold = 60s. Half the existing 120s foreground timeout. Long enough that normal `npm install` / `pytest` runs don't trigger; short enough that the hint surfaces before the timeout hard-kills. - Advisory only — the command still runs to completion in the foreground for THIS invocation. The advice is for the agent's NEXT decision, not a corrective action on the current one. - Fires on success AND error completions. The advice is the same ("background it next time") in both cases. - Suppressed on aborted (timeout / user-cancel) — those paths already surface their own messaging and don't benefit from a "should have been background" reminder when the user / system already killed it. Implementation: - New constant `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = 60000` in shell.ts, paired with the existing `DEFAULT_FOREGROUND_TIMEOUT_MS`. - Helper `buildLongRunningForegroundHint(elapsedMs)` exported so future surfaces (UI, telemetry) can render the same text without duplicating the threshold logic. - `Date.now()` bracketing around the spawn → `await resultPromise` block — mirrors what the background path already captures via `entry.startTime`. - Append happens inside the existing non-aborted result builder; zero changes to the cancel / timeout arms. Tests: 4 new cases — fires on long success, omits on short success, fires on long error completion, omits on aborted. Uses vi fake timers to drive wall-clock past the threshold without actually sleeping. * fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion Addresses 8 review threads on PR QwenLM#3809 — 6 from /review bots, 2 from copilot — covering doc accuracy, code quality, behavioural gaps, and test coverage. **Behavioural fixes (real bugs)**: - **Suppress on external signal kills** (`result.signal != null` with `aborted: false`). `shellExecutionService` only sets `aborted` when the AbortSignal we passed was triggered, so SIGTERM from container shutdown / k8s eviction / OOM killer / sibling process-group reap falls through to the non-aborted branch. The advisory shouldn't fire there — the process didn't run to its conclusion, so "next time, background it" doesn't fit. New test pins this with `signal: 15` (SIGTERM), `aborted: false`. - **Append AFTER `truncateToolOutput`**. Previously the hint was appended inside the non-aborted result builder, which meant for long outputs it got wrapped in the "Truncated part of the output:" envelope — the LLM might read the advisory as part of the command's own output. New post-truncation insertion + test that pins ordering by mocking `truncateToolOutput` directly (real path needs `fs.writeFile` to actually succeed for the replacement branch to fire). - **Hint wording mode-aware**. The dialog mention dropped the unconditional "(footer pill + Enter)" specifics, which would mislead non-TTY users (`-p` headless / ACP / SDK consumers — no dialog or pill exists there). Now qualified as "in interactive mode the Background tasks dialog also has...". `/tasks` and the on-disk output file are mentioned without qualifier (work in any mode). **Code quality**: - **Threshold programmatically coupled to timeout**: `LONG_RUNNING_FOREGROUND_THRESHOLD_MS = Math.floor(DEFAULT_FOREGROUND_TIMEOUT_MS / 2)`. If the timeout is tuned later, the threshold tracks automatically. - **Docstring corrected**: removed the misleading "before it gets killed by the timeout" claim — the hint is on non-aborted path only, so timeout-killed commands never see it. The new docstring enumerates all suppression paths explicitly. - **Removed stale line-number reference**: comment said "mirrors the background path's `entry.startTime` capture (line ~781)" which goes stale on file edits. Now refers conceptually. **Test coverage gaps closed**: - **Off-by-one boundary**: 59_999ms → no hint. Pairs with the existing 60_000ms-exactly test (which fires) to pin the boundary tightly. A regression flipping `>=` to `>` would fail loudly. - **Timeout path explicit**: previous "aborted" test exercised user- cancel only. With `vi.useFakeTimers({ toFake: ['Date'] })`, `AbortSignal.timeout()` doesn't fake (it depends on the real timer subsystem), so `combinedSignal.aborted` stayed false. New test follows the pre-existing `should handle timeout vs user cancellation correctly` pattern: stubs `AbortSignal.timeout` + `.any` to return an already-aborted combined signal, then verifies "Command timed out after Nms" appears AND no advisory. * fix(core): per-invocation long-run threshold + debug-mode + test isolation Six suggestions from /review's third pass on PR QwenLM#3809: **Real semantic fix**: - Long-run threshold now scales with the EFFECTIVE timeout, not the fixed default. A user who sets `timeout: 600_000` (10 min) gets the advisory at 5 min, not at 60s — respects the explicit timeout intent. Replaced the `LONG_RUNNING_FOREGROUND_THRESHOLD_MS` constant with a per-invocation `longRunThresholdFor(effectiveTimeout)` helper. **Debug-mode visibility**: - Debug mode previously snapshotted `returnDisplayMessage = llmContent` BEFORE the truncation + hint append, so debug-mode users saw the pre-hint content while the agent saw the advisory — agent suddenly suggesting `is_background: true` had no visible trigger in the TUI. Re-sync `returnDisplayMessage` after the hint append (debug-mode branch only) so the TUI mirrors what the agent sees. **Type-safety footgun**: - `if (typeof llmContent === 'string')` would silently drop the hint if `llmContent` ever becomes structured `Part[]`. Added an explicit `else` comment documenting the deliberate omission and the conditions under which to revisit (no string llmContent path exists today). **Style**: - Replaced the JSDoc `/** ... */` block on the (now-defunct) constant with a plain `//` comment block on the helper, matching the `DEFAULT_FOREGROUND_TIMEOUT_MS` / `OUTPUT_UPDATE_INTERVAL_MS` style. **Test hygiene**: - Wrapped both `vi.stubGlobal('AbortSignal', ...)` and `vi.spyOn(truncateToolOutput, ...)` in `try/finally` so failures during the test body don't leak the stub/spy into subsequent tests (would cause confusing cascading failures). - Dropped the internal-roadmap "Phase D part (a)" reference from the test comment — future maintainers don't have the context. **New test**: - `threshold scales with the user-supplied timeout (not the default)`: sets `timeout: 600_000`, advances 100s, verifies no hint. Pins the per-invocation coupling so a regression to a fixed constant would fail loudly here. * fix(core): tighten long-run hint suppression + boundary tests + post-truncation insertion (round 4) Six suggestions from /review's pai/glm-5-fp8 pass on PR QwenLM#3809: **Behavioural / UX**: - **Hint now visible in non-debug TUI too.** Previously only debug mode mirrored the hint into `returnDisplay`; non-debug users saw the agent suggest `is_background: true` with no visible trigger. Now the hint is appended to `returnDisplayMessage` in both modes (full mirror in debug, terse-append in non-debug to preserve the output-or-status form). **Test coverage**: - **Debug-mode re-sync test added.** All other long-run hint tests run with `getDebugMode → false`; this one flips it to true and asserts the hint appears in `returnDisplay` too. Pins the re-sync so a regression that drops the debug branch would fail loudly. - **Threshold-scaling positive test added.** The negative case (`timeout: 600_000`, advance 100s, no hint) was already pinned; paired now with the positive case (advance 305s, hint fires) so a regression to a fixed 60s threshold is caught at both ends. **Style / consistency**: - **`result.signal === null` (was `== null`).** Strict equality to match the rest of the file. The `signal` field is typed `number | null` so loose equality has identical semantics, but the inconsistency was noise. **Doc clarity (timing semantics)**: - **Comment explains why elapsedMs is computed BEFORE truncation.** Two reviewers disagreed on the timing — one read it as before truncation (correct, slightly under-reports), the other as after (incorrect read). The intent is to report the COMMAND's runtime, not the tool call's total time. Truncation is post-processing, not part of "agent blocking time", so excluding it is the right semantic. Inline comment now spells this out so future readers don't have to infer. * fix(core): error-path hint surfacing + clock-resilient elapsed + threshold floor + observability Round 5 of PR QwenLM#3809 review — 10 threads, mix of Critical and Suggestion: **Critical fixes**: 1. **Hint survives the error path** (`#OWbA`). When result.error is set, coreToolScheduler builds the model-facing functionResponse from `error.message` ONLY (not llmContent — see convertToFunctionResponse + the toolResult.error branch in scheduler:1648-1724). My hint was being silently dropped on long-command-failed cases. Now the hint is appended to error.message too so the advisory survives whichever branch the scheduler takes. 2. **Hint wording de-ambiguated** (`#OU6o`). "prefer re-running with is_background: true" was ambiguous — model could read it as "re-run THIS command in the background", which on stateful commands (DB migrations, deploys, git push) would cause double side effects. Reworded to "Next time you run a SIMILAR long-running process..." with an explicit parenthetical that warns against re-running the just-completed command. 3. **Debug observability** (`#OU6s`). Added `debugLogger.debug` at the hint decision point with elapsedMs / threshold / aborted / signal — when a user reports "my 65s command didn't get the hint" the suppression branch is now visible in DEBUG output. **Other behaviour fixes**: 4. **Threshold floor of 1000ms** (`#OU6r`). Pathological `timeout: 0` / `timeout: 1` would have given a 0-ms threshold, firing the hint on every invocation showing "ran for 0s". Floor at 1s makes that branch unreachable. 5. **`performance.now()` instead of `Date.now()`** (`#OU6v`). NTP corrections / VM clock drift between capture and read would silently make `elapsedMs` negative and skip the hint with no observable failure. Monotonic clock prevents that. 6. **Debug mode preserves truncation marker** (`#OU6w` / `#OWCq`). Previously `returnDisplayMessage = llmContent` after hint clobbered the "Output too long and was saved to: …" line appended during truncation. Switched to append-style re-sync in BOTH modes so prior content is preserved. **Test coverage gaps closed**: 7. **Non-debug returnDisplay test** (`#OWCo`). Pinned that the user TUI gets the hint in the default (non-debug) mode too. 8. **Test rename** (`#OWCl`). The "debug-mode TUI mirror" test passed in non-debug too after the recent refactor; split into two tests, one per branch. 9. **Error-path hint test**. Added a test that pins `result.error?.message` contains both the original error text AND the hint, covering the scheduler-routing-via-error.message path that was silently broken before fix QwenLM#1. 10. **Test: faketimers also fakes `performance`**. Since we switched to `performance.now()`, `vi.useFakeTimers({ toFake: ['Date'] })` no longer covered the elapsed measurement; extended to `['Date', 'performance']` so the threshold tests can drive the wall-clock with `advanceTimersByTimeAsync`. #OU6t (else-comment for the type guard) was already addressed in the prior round — the explicit else-with-comment is in place; adding logging there would be noise. * test(core): cover the MIN_LONG_RUN_THRESHOLD_MS floor branch PR QwenLM#3809 review: the new `Math.max(MIN_LONG_RUN_THRESHOLD_MS, ...)` floor in `longRunThresholdFor` was untested — only default-timeout and large-custom-timeout cases existed. A regression that strips the floor would let `timeout: 1` produce a 0ms threshold and fire a "ran for 0s" advisory on every invocation; the test suite would not catch it. New test: build with `timeout: 1`, advance 500ms (below the 1000ms floor), resolve with `aborted: false` to isolate the threshold logic from the abort path. Asserts no hint appears. A regression that removes the floor flips the assertion to fail. * fix(core): structured delimiter on error.message hint + tighten timeout floor comment Two of three threads from the latest /review pass on PR QwenLM#3809 (the third — PR description / threshold scaling reconciliation — is fixed in the PR description update, not in code): - **`\n---\n` divider before hint in `error.message`** (`#Pt7C`). Downstream consumers of `error.message` (firePostToolUseFailureHook, telemetry grouping, SIEM alerting, hook-side error parsers) were receiving ~400 chars of advisory text mixed inline with the original error body — pattern-matching on error messages would absorb the advisory into the matched body. Added a `---` separator line so the boundary is unambiguous and split-able. - **Threshold-floor comment narrowed to `timeout: 1`** (`#Pu9o`). The comment said the floor guards `timeout: 0` / `timeout: 1`, but `validateToolParamValues` rejects `timeout <= 0` at validate time, so `timeout: 0` can't reach `longRunThresholdFor`. Updated the comment to mention only the actually-allowed pathological case (`timeout: 1` and any value `< 2` rounds to 0). Test updated to assert the `---` divider format with `toMatch`. * fix(core): capture executionStartTime AFTER spawn so PTY import isn't counted PR QwenLM#3809 review: copilot caught that `executionStartTime` was captured BEFORE `await ShellExecutionService.execute(...)`, which meant the elapsed measurement included `getPty()` dynamic-import setup (~50-200ms on first call). The hint's "ran for Xs" reading was slightly inflated, and the comment claiming "spawn → settle" wasn't strictly accurate. Moved the capture immediately after the execute() call returns its { result, pid } handle. The pid being set by that point confirms the process has been spawned, so the subtraction is true post-spawn-to- settle. Comment updated to reflect the actual semantics. The displayed accuracy gain is small (50-200ms on a 60s+ threshold is <1%), but the comment claim now matches what the code measures. Tests unaffected — fakeTimers don't drive real dynamic imports, so the threshold tests behave identically. * fix(core): align long-run hint code/tests with ShellExecutionResult.error semantics Four copilot threads on PR QwenLM#3809 — all rooted in the same observation: `ShellExecutionResult.error` is reserved for spawn/setup failures (per the field's doc comment in shellExecutionService.ts), NOT for non-zero exit codes. My existing code/tests conflated the two, making the error-path coverage less realistic and the inline comments inaccurate. **Test shape fixes**: - `appends the hint when a long-running foreground command exits with error` → `exits non-zero`. Changed `error: new Error('exit 1')` to `error: null` (the realistic shape for a non-zero exit without spawn failure). Added a comment explaining the field contract so future test authors don't repeat the conflation. - `hint survives the error path (appended to error.message)`: reframed the mock from `spawn ENOENT` (which would resolve in <1s in practice, making the long-elapsed scenario unrealistic) to `PTY initialization failed after 75s` — a slow-spawn-failure shape that COULD plausibly take 75s. Test still pins the same CODE PATH; comment now acknowledges the edge-case nature ("rare but real: PTY init dragging, remote-fs exec syscalls, security scanners interposing"). **Comment corrections**: - `returnDisplayMessage` build-order comment was misleading. It said "the hint is appended after both the truncation block and the returnDisplayMessage build" — but `returnDisplayMessage` is built BEFORE truncation. Replaced with a chronological enumeration (1. initial value, 2. truncation marker append, 3. hint append) that matches what the code actually does. - Error-path preservation comment now acknowledges the narrow applicability (spawn failures only, exit codes don't reach this branch). Code is unchanged — the path is still real, just rare. * test(core): pin empty-output success + background-no-hint paths Two defensive tests for the long-running foreground hint: - empty-output success at >=60s — exercises the returnDisplayMessage='' → hint append branch (write-only commands like `tar czf` / `cp -r` produce no stdout). Asserts the user- facing returnDisplay still surfaces the advisory even when the command produced nothing else to show. - background never includes the hint — the foreground hint logic lives in executeForeground only, so today this can't fail; the test guards against a future refactor hoisting the advisory into a shared post-execute path that would tag every background launch with a nonsensical "ran for 0s, consider is_background: true" suggestion.
Summary
Phase D part (a) of Issue #3634. When a foreground
shelltool call runs past a duration threshold and completes (succeeds or errors), the LLM-facing tool result gets an advisory line suggestingis_background: truefor similar long-running commands next time. The threshold is half the effective timeout (per-invocation, not a fixed constant), with a 1000ms floor — so a default-timeout (120s) call gets the advisory at 60s, an explicittimeout: 600_000call gets it at 300s, and pathological tiny timeouts (timeout: 1) don't surface a "ran for 0s" advisory. The advisory itself explicitly warns against re-running the just-completed command (matters for stateful operations like deploys, migrations,git push).Why this matters: today a foreground bash that takes minutes (build watcher, soak test, slow
npm install, polling loop) blocks the agent indefinitely. The user is already paying for the wait; the agent's next turn could have been running in parallel underis_background: true. Sleep interception (#3684) handled the egregioussleep Ncase at validate time; this handles the legitimate-but-long case at result time.Size: ~5 commits, 2 files (
shell.ts+shell.test.ts), 91 tests after additions. Pure additive — no existing behaviour changed.Before / After
Before (foreground command that took 90 seconds):
After:
Design notes
DEFAULT_FOREGROUND_TIMEOUT_MS = 120sthe threshold is 60s; for an explicittimeout: 600_000call it's 5 min — respects the user's signalled expectation that the command will take long. The 1000ms floor guardstimeout: 1(smallest non-rejected pathological value;timeout <= 0is rejected at validate time) so the advisory doesn't fire showing "ran for 0s".git push), retrying would cause double side effects. This guard came out of review (gpt-5.5 flagged the original "prefer re-running" wording as ambiguous).error.message.coreToolSchedulerbuilds the model-facing functionResponse fromerror.message(NOT fromllmContent) whentoolResult.erroris set. So the hint is appended to BOTHllmContentANDerror.message— with a\n---\ndivider inerror.messageso downstream consumers (firePostToolUseFailureHook, telemetry grouping, SIEM, hook parsers) have an unambiguous boundary they can split on instead of getting ~400 chars of advisory mixed inline.result.signal !== nullwithaborted: false, e.g. SIGTERM from container shutdown / k8s eviction / OOM killer). Their own messaging is enough; the process didn't run to its conclusion, so "next time, background it" doesn't apply.performance.now()instead ofDate.now()for the elapsed bracket — monotonic high-res clock, NTP corrections / VM clock drift between capture and read can't makeelapsedMsgo negative and silently skip the hint.truncateToolOutputso the hint isn't wrapped in the "Truncated part of the output:" envelope (which the LLM might misread as part of the command's own output).is_background: truewith no visible trigger). Append-style re-sync preserves the truncation marker line if both fire together.debugLogger.debugat the decision point logselapsed=Nms threshold=Mms aborted=X signal=Y → fire/suppressso support reports like "my 65s command didn't get the hint" can be diagnosed via DEBUG output.buildLongRunningForegroundHint(elapsedMs)is exported so a future UI surface or telemetry consumer can render the same text without duplicating the threshold-driven logic. Today only the LLM result uses it.Why this is foreground-only
Background commands return immediately with a shell ID — they're never blocking the agent for the threshold duration by construction. The hint is meaningful only on the path that actually blocks.
Test plan
vitest run packages/core/src/tools/shell.test.ts: 91 / 91 pass (covers threshold scaling positive + negative, threshold floor, debug + non-debug returnDisplay, error.message hint surfacing, aborted / timeout / external-signal suppression, off-by-one boundary, post-truncation insertion)tsc --build packages/cli(CI-equivalent full mode): cleannpm run build --workspace=@qwen-code/qwen-code-core: cleansleep 65 && echo done)→ verify the advisory line appears in the LLM-facing tool result; run a 5-second command → verify no advisory; run withtimeout: 600_000for 100s → verify no advisory (scaled threshold).中文版
Issue #3634 Phase D 第 (a) 部分。前台
shell工具调用运行超过阈值并完成(成功或出错)后,给 LLM 看的工具结果末尾加一行建议,提示下次用is_background: true。阈值是有效 timeout 的一半(按调用计算,不是固定常量),下限 1000ms — 默认 timeout(120s)的调用 60s 触发;显式timeout: 600_000调用 5 分钟触发;病态小 timeout(timeout: 1)不会"跑了 0 秒"误触发。建议文案显式警告不要重跑刚完成的命令(DB 迁移、部署、git push等有副作用操作要紧)。为什么:今天前台 bash 跑几分钟(build watcher、长测、慢
npm install、轮询循环)会无限阻塞 agent。用户已经在等了;agent 的下一轮其实可以在后台并行跑。Sleep 拦截(#3684)处理了恶劣的sleep N在 validate 时拒绝;这个处理合理但慢的情况在 result 时建议。体量:~5 commits(含多轮 review 修补),2 个文件,最终 91 测试。
设计要点:
DEFAULT_FOREGROUND_TIMEOUT_MS=120s→ 阈值 60s;显式timeout: 600_000→ 阈值 5 分钟,尊重用户对长命令的预期。下限 1000ms 守timeout: 1(最小未被 validate 拒绝的病态值;timeout <= 0在 validate 阶段已拒),避免"跑 0s"虚警。error.message单独追加(coreToolScheduler错误分支用error.message不用 llmContent),并加\n---\n分隔符让下游消费者(hook / 遥测 / SIEM / parser)有明确边界。result.aborted或result.signal !== null都跳过 — 命令没自己跑完)。performance.now()不用Date.now(),monotonic 高精度,NTP 校时 / VM 漂移不会让 elapsed 变负静默丢 hint。truncateToolOutput的"Truncated part of the output:" 信壳包裹,避免 LLM 误读为命令输出)。debugLogger.debug在决策点输出elapsed/threshold/aborted/signal/fire-or-suppress,线上"为啥没出现 hint"问题可调试。buildLongRunningForegroundHint— 未来 UI / 遥测复用同样文本。Related