Skip to content

fix(core): truncate model-facing tool output#4520

Open
Jerry2003826 wants to merge 21 commits into
QwenLM:mainfrom
Jerry2003826:codex/fix-tool-output-history-truncation
Open

fix(core): truncate model-facing tool output#4520
Jerry2003826 wants to merge 21 commits into
QwenLM:mainfrom
Jerry2003826:codex/fix-tool-output-history-truncation

Conversation

@Jerry2003826

@Jerry2003826 Jerry2003826 commented May 25, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Moves model-facing string tool-output truncation from the shell tool into CoreToolScheduler, so any tool that returns string llmContent can be bounded before the result is recorded into conversation history.

The PR intentionally keeps the scope narrow:

  • reuses the existing truncateToolOutput() helper and temp-file behavior;
  • removes shell-local truncation so scheduler is the single string-output path;
  • appends PostToolUse hook context after raw output truncation so hook-injected metadata is not split by the head/tail truncator;
  • leaves non-string Part[] outputs on the existing path.

Why it's needed

Unbounded tool results can overflow context tokens and make the session unable to continue. Shell output already had a truncation helper, but other string-returning tools could still place large outputs directly into model history.

Reviewer Test Plan

How to verify

npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts -t "model-facing output truncation"
npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts src/tools/shell.test.ts -t "model-facing output truncation|appends the hint after command output is assembled"
npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts src/tools/shell.test.ts
npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/tools/shell.ts packages/core/src/tools/shell.test.ts
npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/tools/shell.ts packages/core/src/tools/shell.test.ts
npm run typecheck --workspace=packages/core

Evidence (Before & After)

Before: string outputs from non-shell tools could enter model history without passing through the existing tool-output truncation helper.

After: CoreToolScheduler invokes truncateToolOutput() for string llmContent before converting the result into a function response. Tests cover large string output truncation, PostToolUse context placement after raw output truncation, shell long-run hint behavior after removing shell-local truncation, and non-string Part[] passthrough.

This is model-facing history behavior, so TUI screenshots are N/A.

Tested on

OS Status
macOS GitHub Actions passed
Windows Tested locally + GitHub Actions passed
Linux GitHub Actions passed

Environment (optional)

Local Windows/PowerShell checkout with repository npm workspaces. No tmux/TUI capture is included because the behavior is core scheduler/history logic rather than a visible TUI state.

Risk & Scope

  • Main risk or tradeoff: the scheduler only truncates string model-facing content; structured Part[] output remains unchanged.
  • Out of scope: new telemetry events, temp-file permission hardening, ToolResult API changes, split-budget hook-context truncation, UI changes, and unrelated refactors.
  • Breaking changes / migration notes: none expected.

Linked Issues

Fixes #4049

@Jerry2003826 Jerry2003826 marked this pull request as ready for review May 25, 2026 22:38

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] prompt_id not passed to truncateToolOutputToolOutputTruncatedEvent is constructed with prompt_id: '' (in truncation.ts:141), but scheduledCall.request.prompt_id is available here. Truncation telemetry events cannot be correlated with specific prompts or turns. Fix: add a promptId parameter to truncateToolOutput and pass it from the scheduler.

— qwen3.7-max via Qwen Code /review

}
}

if (typeof content === 'string') {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] truncateToolOutput failure converts successful tool calls into errors

The await truncateToolOutput(...) call has no local try/catch. It sits inside an outer try whose catch treats any exception as a tool execution failure. If config.getTruncateToolOutputThreshold(), crypto.randomBytes, or logToolOutputTruncated throws, a tool that executed successfully gets reported as failed to the model — with a cryptic error that doesn't mention truncation.

Truncation is a presentation-layer concern and must never cause a successful tool call to appear as an error.

Suggested change
if (typeof content === 'string') {
if (typeof content === 'string') {
try {
const truncated = await truncateToolOutput(
this.config,
toolName,
content,
);
content = truncated.content;
contentLength = content.length;
} catch (truncErr) {
// Truncation is best-effort. If it fails, pass the original
// content through unchanged rather than failing the tool call.
console.warn(
`[CoreToolScheduler] truncation failed for ${toolName}:`,
truncErr instanceof Error ? truncErr.message : truncErr,
);
}
}

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e3075df. The scheduler now wraps truncation in a local try/catch, logs a warning on truncation failure, and preserves the original successful tool output. Added a regression test that forces threshold lookup to throw and asserts the tool call still completes as success with the original output.

if (toolResult.error === undefined) {
let content = toolResult.llmContent;
const contentLength =
let contentLength =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] contentLength stale after hook/reminder appends when truncation doesn't fire

The constlet change only updates contentLength inside the truncation block. But content is also mutated at two earlier points (PostToolUse hook additionalContext and conditional rules/skill reminders) that do NOT update contentLength. When hooks/reminders append text but the total stays under threshold, contentLength underreports the actual content size.

Fix: add a single contentLength recalculation after ALL content mutations, just before convertToFunctionResponse:

contentLength = typeof content === 'string' ? content.length : undefined;

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e3075df. contentLength is now recomputed after post-hook context, rule/skill reminders, and truncation handling. Added a regression test where PostToolUse appends context without truncation and asserts contentLength matches the final model-facing content.

toolName,
content,
);
content = truncated.content;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] truncated.outputFile discarded — user-facing returnDisplay not updated

The scheduler only reads truncated.content and ignores truncated.outputFile. For tools that don't self-truncate (the exact target audience of this safety net), the user's TUI display (returnDisplay) never mentions that truncation occurred or where the full output was saved. Compare with shell.ts which appends "Output too long and was saved to: <path>" to the display message.

Consider surfacing the truncation in returnDisplay:

if (truncated.outputFile) {
  toolResult.returnDisplay =
    (typeof toolResult.returnDisplay === 'string' ? toolResult.returnDisplay + '\n' : '') +
    `Output too long and was saved to: ${truncated.outputFile}`;
}

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e3075df. When truncation successfully saves the full output and returnDisplay is a string, the scheduler appends Output too long and was saved to: <path>. Added regression coverage for the user-facing display path.

config: mockConfig,
onAllToolCallsComplete,
onToolCallsUpdate,
getPreferredEditor: () => 'vscode',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] No negative test for within-threshold passthrough

The single test only verifies that oversized output IS truncated. A regression that always truncates (e.g., an off-by-one in the threshold comparison, or truncateToolOutput being called with wrong arguments) would pass the existing test undetected.

Add a second test with output under the threshold (e.g., 50 chars with threshold=100) asserting the output equals the original content exactly and does not contain [CONTENT TRUNCATED].

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in e3075df. The new passthrough regression test uses output below the threshold and asserts the model-facing output equals the original string exactly and does not contain [CONTENT TRUNCATED].

@pomelo-nwu pomelo-nwu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Jerry2003826, thank you for your continued contributions — 9 PRs in a short time is impressive! 🎉

As we review your changes, we'd like to ask you to update each PR to follow the latest PR template on the main branch. The most important section is the Reviewer Test Plan, which significantly accelerates the review and merge process.

Specifically, for each PR please include:

  • How to verify — clear reproduction steps so a reviewer can confirm the fix/feature
  • Evidence (Before & After) — use the tmux-real-user-testing skill (or manual tmux capture) to show before/after screenshots of the TUI behavior. Side-by-side evidence makes it much faster for maintainers to validate and merge
  • Tested on — fill in the OS table (macOS / Windows / Linux)

PRs with a complete Reviewer Test Plan are prioritized for review — without it, review may be delayed.

You can see the full template at: .github/pull_request_template.md

Thanks again for your effort — looking forward to getting these merged! 🚀

中文说明

你好 @Jerry2003826,感谢你的持续贡献——短时间内提交了 9 个 PR,非常高效!🎉

在 review 过程中,我们希望你能按照 main 分支上最新的 PR 模版更新每个 PR 的描述。其中最关键的部分是 Reviewer Test Plan,它能显著加速审核和合并流程。

具体来说,请为每个 PR 补充:

  • How to verify — 清晰的复现步骤,让 reviewer 能确认修复/功能的效果
  • Evidence (Before & After) — 使用 tmux-real-user-testing skill(或手动 tmux 截取)展示修改前后的 TUI 截图对比,前后对比能让维护者更快地验证和合并
  • Tested on — 填写操作系统测试表格(macOS / Windows / Linux)

有完整 Reviewer Test Plan 的 PR 会被优先审核——缺少该部分可能会导致审核延迟。

完整模版见:.github/pull_request_template.md

再次感谢你的付出,期待尽快把这些 PR 合并!🚀

— Qwen Code

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from 61c7865 to e3075df Compare May 26, 2026 01:50
@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Updated in e3075df. This addresses the remaining review items: scheduler now passes scheduledCall.request.prompt_id into truncateToolOutput, truncation failures are best-effort and preserve successful tool calls, contentLength is recomputed after final content mutations, string returnDisplay includes the saved full-output path, and small-output passthrough is covered. I also updated the PR description to the latest template with Reviewer Test Plan / Evidence / Tested on sections.

Validation run locally on Windows:

npm run test --workspace=@qwen-code/qwen-code-core -- src/utils/truncation.test.ts -t "caller prompt id"
npm run test --workspace=@qwen-code/qwen-code-core -- src/core/coreToolScheduler.test.ts -t "tool output truncation"
npm run test --workspace=@qwen-code/qwen-code-core -- src/utils/truncation.test.ts
npm run test --workspace=@qwen-code/qwen-code-core -- src/core/coreToolScheduler.test.ts
npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts
npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts
npm run lint --workspace=@qwen-code/qwen-code-core
npm run typecheck --workspace=@qwen-code/qwen-code-core
git diff --check

@wenshao wenshao added the type/bug Something isn't working as expected label May 26, 2026
@Jerry2003826

Copy link
Copy Markdown
Contributor Author

The prompt_id telemetry correlation issue is already fixed in e3075df.

Current code passes scheduledCall.request.prompt_id into truncateToolOutput(...), and truncateToolOutput forwards that value into ToolOutputTruncatedEvent. I also added truncation.test.ts coverage: should log truncation telemetry with the caller prompt id.

Current PR checks are green across lint, CodeQL, macOS, Ubuntu, and Windows.

@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Updated the PR description to match the latest reviewer template and removed the stale/garbled details block. The code is unchanged from e3075df; current CI is green across lint, CodeQL, macOS, Ubuntu, and Windows. Ready for re-review when you have time.


if (typeof content === 'string') {
try {
const truncated = await truncateToolOutput(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Truncation runs AFTER hook context and <system-reminder> blocks are appended

At this point, content includes the raw tool output PLUS PostToolUse hook additionalContext, conditional-rule reminders, and skill-activation notices. The head/tail truncation algorithm is structure-blind — if the combined payload exceeds the threshold, it will bisect <system-reminder> XML envelopes, producing malformed markup the model then tries to interpret. Additionally, skill activation side effects (refreshSkills, notifyChangeListeners) have already fired, but the reminder telling the model about the new skill may get truncated away.

Consider moving the truncation block to run immediately after content = toolResult.llmContent (before PostToolUse hooks, conditional rules, and skill activation), so only raw tool output is measured and truncated. Then let hooks and reminders append onto the already-truncated content — they always reach the model intact.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in ad9ee15. I moved the truncation block to run before PostToolUse additionalContext, conditional-rule reminders, and skill-activation reminders are appended. That keeps the head/tail truncator focused on the raw tool output and lets hook/reminder metadata reach the model intact after truncation.

completedCalls[0].response.responseParts,
);

expect(output).toContain('[CONTENT TRUNCATED]');

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] contentLength not asserted when truncation fires

This test triggers truncation and checks the model-facing output, but never reads completedCalls[0].response.contentLength. The new post-truncation contentLength reassignment (line ~2983 in coreToolScheduler.ts — the reason const was changed to let) has zero behavioral verification in the truncation path. Test 4 checks contentLength but only in the no-truncation case.

Consider adding:

expect(completedCalls[0].response.contentLength).toBe(output.length);
expect(completedCalls[0].response.contentLength).toBeLessThan(5000);

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in ad9ee15. The large-output truncation regression now asserts completedCalls[0].response.contentLength === output.length, so the post-truncation recomputation is covered when truncation actually fires.

expect(completedCalls[0].response.resultDisplay).toEqual(
expect.stringContaining('Output too long and was saved to:'),
);
});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] No test for hook context + truncation interaction

No test exercises both PostToolUse hook additionalContext AND truncation firing simultaneously. Test 4 uses hooks with threshold: 1000 and a 12-char output (no truncation fires). Test 1 triggers truncation without hooks. The integration boundary — where hook context pushes combined content over the threshold — is untested.

This matters especially if the truncation-ordering suggestion above is not adopted: hook-injected content (e.g., policy reminders) could silently disappear when the combined payload triggers truncation.

Consider adding a test with hooks enabled, output + hook context exceeding the threshold, and asserting the truncation marker is present, contentLength reflects the truncated length, and the full-output file contains the hook context.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a regression test in ad9ee15: preserves hook context outside truncated tool output. It enables PostToolUse hooks, forces truncation, asserts the truncation marker/contentLength, and verifies the full hook context is appended after the truncated tool output instead of being bisected. Since the implementation now truncates raw tool output before appending hook/reminder metadata, the saved full-output file intentionally captures the raw tool output rather than the later hook context.

Comment thread packages/core/src/utils/truncation.ts Outdated
config: Config,
toolName: string,
content: string,
promptId = '',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] promptId = '' default masks telemetry gaps in existing callers

shell.ts:2062 and mcp-tool.ts:438 call truncateToolOutput with 3 arguments (no promptId), so their telemetry events carry prompt_id: ''. This defeats the purpose of adding promptId for telemetry correlation — two of the three callers emit truncation events that cannot be traced to a specific prompt.

Consider either making promptId required (so the compiler enforces it), or — if the plan is to consolidate truncation at the scheduler level — removing the tool-level truncateToolOutput calls from shell.ts and mcp-tool.ts entirely. That would also eliminate the double-truncation edge case where scheduler-level truncation runs on already-truncated content after hooks push it back over the threshold.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented the fallback in ad9ee15. The scheduler-level path now passes scheduledCall.request.prompt_id, which covers the model-facing history truncation added by this PR. I kept promptId optional for the existing direct tool-level callers because shell/MCP execute() do not receive scheduler request context today, and removing those direct truncation guards would widen this PR by changing behavior for direct tool invocations outside the scheduler. The JSDoc now calls out that scheduler callers should pass the prompt id while direct tool implementations may omit it when they do not have that context.

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from e3075df to ad9ee15 Compare May 26, 2026 08:53
@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Updated in ad9ee15 for the latest review pass.

Changes made:

  • Moved scheduler-level truncation before PostToolUse additionalContext, conditional-rule reminders, and skill-activation reminders are appended, so structure-blind head/tail truncation only applies to raw tool output.
  • Added coverage that hook context remains intact after truncation and contentLength reflects the final model-facing output.
  • Added the missing contentLength assertion to the truncation regression.
  • Documented why promptId remains optional for direct shell/MCP tool callers that do not receive scheduler prompt context, while the scheduler path passes scheduledCall.request.prompt_id.

Validation run locally on Windows:

npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts -t "preserves hook context outside truncated tool output|truncates large model-facing tool output|reports contentLength after hook context"
npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts
npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts
npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts
npm run typecheck --workspace=packages/core
npm run lint --workspace=packages/core
git diff --check

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] fs.writeFile in truncation.ts:89 lacks { mode: 0o600 } — the saved tool output files inherit the process umask (typically 0o644 on Linux), making them world-readable on shared systems. Tool output frequently contains secrets (env vars, API keys). This PR extends truncation from shell/MCP-only to ALL tools, widening the exposure. Other sensitive-file writers in this codebase use { mode: 0o600 } (see file-token-storage.ts, oauth-token-storage.ts, sessionService.ts). Consider adding { mode: 0o600 } to the writeFile call and { mode: 0o700 } to the mkdir call. Pre-existing issue, but the wider surface area makes it worth addressing in a follow-up.

— qwen3.7-max via Qwen Code /review

}
}

if (typeof content === 'string') {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Double truncation for shell tool outputs

shell.ts:2062 already calls truncateToolOutput on llmContent before returning to the scheduler. The scheduler then calls it a second time here. truncateAndSaveToFile produces output of approximately threshold + ~330 chars (the header text) and truncateLines + ~8 lines (header + separator), which exceeds both the char threshold and line limit on the second pass — triggering actual re-truncation with the default config (25K chars, 1000 lines).

Consequences:

  • Nested truncation headers: the model sees two [CONTENT TRUNCATED] envelopes
  • Two temp files: file B (from the second pass) contains the already-truncated content, not the original. File A (from the first pass, containing the real full output) is orphaned — the model-facing message points to B, not A
  • Duplicate returnDisplay mutation: shell.ts:2072 already appends "Output too long and was saved to: <A>", and the scheduler appends a second "Output too long and was saved to: <B>"
  • Duplicate telemetry: logToolOutputTruncated fires twice per tool call

Suggested fix: remove the tool-level truncateToolOutput calls from shell.ts (lines 2061-2076) and let the scheduler be the single truncation point. This aligns with the PR's stated goal of centralizing truncation and eliminates the dual-responsibility. mcp-tool.ts is safe because it returns Part[] (non-string), which the typeof guard skips.

Suggested change
if (typeof content === 'string') {
if (typeof content === 'string' && !content.startsWith('Tool output was too large')) {
try {
const truncated = await truncateToolOutput(

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8ef3cb3. Shell no longer calls truncateToolOutput directly, so the scheduler is the single model-facing string-output truncation point. The scheduler also skips content that already starts with TOOL_OUTPUT_TRUNCATED_PREFIX, which protects existing direct truncation callers from being wrapped again.

if (toolResult.error === undefined) {
let content = toolResult.llmContent;
const contentLength =
let contentLength =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Dead initial contentLength assignment

This let contentLength is always overwritten by the identical expression at line ~2984 before it is ever read. The only early-return path between the two assignments (shouldStop) never reads contentLength.

The landmine: if a future refactor removes the recalculation at line ~2984 (thinking "contentLength is already set at the top"), contentLength silently regresses to the raw tool output size, and downstream consumers see a value that excludes truncation, hook context, and reminders.

Suggested change
let contentLength =
let contentLength: number | undefined;

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8ef3cb3. Removed the top-of-block assignment and now declares contentLength only at the final post-mutation calculation before convertToFunctionResponse.

});
});

describe('CoreToolScheduler tool output truncation', () => {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] No test for the double-truncation scenario

All tests here use LargeOutputTool, which returns raw un-truncated content. The most dangerous production path — shell tool output exceeding the threshold — is precisely the scenario where truncateToolOutput runs twice (once in shell.ts, once in the scheduler), and it has zero coverage.

Consider adding a test where the tool's execute() returns content that already includes the "Tool output was too large" preamble (simulating what shell.ts produces after its own truncation), and verify the scheduler does not produce a second truncation header or a second file write.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 8ef3cb3. The new regression returns already-truncated content, asserts the model-facing output is unchanged, keeps the existing returnDisplay, and verifies fs.writeFile is not called a second time.

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from ad9ee15 to 8ef3cb3 Compare May 26, 2026 10:01
@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Updated in 8ef3cb3 for the latest review pass.

Changes made:

  • Removed shell-level truncateToolOutput so the scheduler is the single string-output truncation point.
  • Added an already-truncated guard in the scheduler to avoid nested truncation headers, duplicate temp files, duplicate display updates, and duplicate telemetry.
  • Tightened saved full-output temp permissions with directory mode 0o700 and file mode 0o600.
  • Removed the dead initial contentLength assignment; contentLength is now calculated once after final content mutations.
  • Updated the shell long-run hint test now that truncation is scheduler-owned.

Validation run locally on Windows:

npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts src/utils/truncation.test.ts -t "already-truncated|line limit exceeded|binding constraint|very long lines|correct file path|path traversal"
npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts
npm run test --workspace=packages/core -- src/utils/truncation.test.ts
npm run test --workspace=packages/core -- src/tools/shell.test.ts -t "appends the hint after command output is assembled"
npm run test --workspace=packages/core -- src/tools/shell.test.ts
npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts packages/core/src/tools/shell.ts packages/core/src/tools/shell.test.ts
npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts packages/core/src/tools/shell.ts packages/core/src/tools/shell.test.ts
npm run typecheck --workspace=packages/core
npm run lint --workspace=packages/core
git diff --check

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Stale comment in shell.ts still references truncation I/O (~line 1980)

The comment block near shell.ts:~1980 reads "intentionally BEFORE the post-processing block below (truncation I/O, output-file write)" and "Truncation time is bounded by the temp-dir backend and isn't representative of the command's actual wait." Both references describe behavior removed by this PR — truncation I/O and output-file writes no longer happen in shell.ts. The PR already updated surrounding comments (returnDisplayMessage build order, long-run advisory) but missed this one.

Future readers will waste time looking for a truncation block that no longer exists in shell.ts, or may be misled about what the timing measurement accounts for.

//   - Wall-clock duration >= threshold. Measured spawn -> resultPromise
//     settle, intentionally BEFORE post-processing (attribution,
//     returnDisplay build, hint append). The hint reports how long
//     the COMMAND blocked the agent, not how long the tool call
//     spent including post-processing.

(This finding is outside the PR diff, so it's posted in the review body rather than as an inline comment.)

— qwen3.7-max via Qwen Code /review

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from 8ef3cb3 to a3bdc99 Compare May 26, 2026 10:41

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] fs.writeFile / fs.mkdir failure swallows the underlying error (truncation.ts:104-108)

The catch (_error) block inside truncateAndSaveToFile returns a generic "[Note: Could not save full output to file]" message but discards the actual error. At 3 AM investigating why truncation silently fell back on every tool call (disk full, temp dir deleted, permissions change), there would be no log, no telemetry, and no error details. Since this is a scheduler-level code path every tool hits, a systemic temp-dir problem would cascade silently.

Consider adding debugLogger.warn(\Failed to save truncated tool output to ${outputFile}: ` + (error instanceof Error ? error.message : String(error)))` inside the catch block before returning.

— qwen3.7-max via Qwen Code /review

content = truncated.content;
if (
truncated.outputFile &&
typeof toolResult.returnDisplay === 'string'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] typeof toolResult.returnDisplay === 'string' guard not covered by tests

When truncation fires and returnDisplay is a structured type (FileDiff, TodoResultDisplay, AnsiOutputDisplay, etc.), this typeof guard prevents appending the "Output too long and was saved to: <path>" notice. No scheduler test currently exercises this branch with a non-string returnDisplay — all truncation tests use LargeOutputTool which returns a string returnDisplay.

If a future refactor were to remove this guard and unconditionally concatenate, structured display objects would be corrupted (string concat on an object produces [object Object]). No test would fail.

Consider adding a regression test where LargeOutputTool returns a structured returnDisplay (e.g., { type: 'ansi', content: '...' }) alongside oversized string content, then assert content is truncated but resultDisplay is the original structured object unchanged.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 378c2ac. Added a scheduler regression test with a structured returnDisplay; the model-facing content is truncated while resultDisplay remains the original structured object unchanged.

Comment thread packages/core/src/utils/truncation.ts Outdated
await fs.mkdir(projectTempDir, { recursive: true });
await fs.writeFile(outputFile, content);
await fs.mkdir(projectTempDir, { recursive: true, mode: 0o700 });
await fs.writeFile(outputFile, content, { mode: 0o600 });

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] ToolOutputTruncatedEvent does not carry the saved-file path

result.outputFile is available at the call site below (around line 148) but is not forwarded into ToolOutputTruncatedEvent. When investigating a truncation event in production (e.g., "why did the model behave oddly after truncation?"), there is no way to correlate the event with the actual file that holds the full output.

Consider adding an output_file: string field to ToolOutputTruncatedEvent (in packages/core/src/telemetry/types.ts) and forwarding result.outputFile here, so post-mortem analysis can jump straight to the saved file.

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 378c2ac. ToolOutputTruncatedEvent now carries output_file, truncateToolOutput forwards the saved path, and the telemetry logger/QwenLogger coverage was updated.

}
} catch (truncationError) {
debugLogger.warn(
`Tool output truncation failed for ${toolName}: ` +

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Truncation-failure log lacks callId / prompt_id

The warning only includes toolName. When the same tool runs many times in a session (common for Bash or read_file), a truncation-failure warning can't be correlated to the specific invocation that failed.

scheduledCall.callId and scheduledCall.request.prompt_id are both in scope here — including them makes the warning actionable when paging through logs.

Suggested change
`Tool output truncation failed for ${toolName}: ` +
debugLogger.warn(
`Tool output truncation failed for ${toolName} ` +
`(callId=${scheduledCall.callId}, prompt_id=${scheduledCall.request.prompt_id}): ` +
(truncationError instanceof Error
? truncationError.message
: String(truncationError)),
);

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 378c2ac. The truncation-failure warning now includes the scheduled request callId and prompt_id; current types store the call id on scheduledCall.request.callId.

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from a3bdc99 to 97531e2 Compare May 26, 2026 11:57
@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Updated in 97531e2 for the latest review-body suggestion.

Changes made:

  • Added a TRUNCATION debug warning when saving the full truncated output file fails, including the target output path and underlying error message.
  • Kept the existing graceful fallback behavior unchanged: the tool still returns truncated content with [Note: Could not save full output to file].
  • Added regression coverage for the warning path.

Validation run locally on Windows:

npm run test --workspace=packages/core -- src/utils/truncation.test.ts -t "file write errors"
npm run test --workspace=packages/core -- src/utils/truncation.test.ts
npx prettier --check packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts
npx eslint packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts
git diff --check

@wenshao

wenshao commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Local verification report (Linux, tmux)

Verified PR head 97531e2eb (fix(core): harden tool output truncation handling, on top of 465d240a6 fix(core): truncate model-facing tool output) in an isolated worktree against main on Linux (Node v22.22.2, npm 10.9.7) inside a dedicated tmux session (tmux: pr4520). All commands from the PR's "How to verify" section were run plus shell.test.ts (the PR removes the local truncation block from shell.ts).

Environment

Item Value
OS Linux (Debian 13, kernel 6.12)
Node v22.22.2
npm 10.9.7
Worktree pr-4520 @ 97531e2eb
Test runner vitest 3.2.4 (workspace @qwen-code/qwen-code-core)

Results

Command Outcome
npm install OK (1374 packages, prepare/build/bundle ran)
truncation.test.ts -t "caller prompt id" 1 passed, 10 skipped (11 total)
coreToolScheduler.test.ts -t "tool output truncation" 7 passed, 163 skipped (170 total)
truncation.test.ts (full) 11 / 11 passed
coreToolScheduler.test.ts (full) 170 / 170 passed (~5.4s)
shell.test.ts (full, additional — refactor surface) 194 / 194 passed
prettier --check (6 PR files) clean
eslint (6 PR files) 0 errors / 0 warnings (exit 0)
npm run lint --workspace=@qwen-code/qwen-code-core exit 0
npm run typecheck --workspace=@qwen-code/qwen-code-core (tsc --noEmit) exit 0
git diff --check (PR files) no whitespace errors

Behavior covered

The new scheduler-level describe('CoreToolScheduler tool output truncation') block (test file lines 7253–end) covers the full set of behaviors claimed by the PR description, all observed passing:

  • truncates large model-facing tool output before it enters history — exercises the central truncation path with prompt_id plumbed through.
  • does not fail a successful tool call when output truncation fails — confirms the new best-effort try { ... } catch (truncationError) handler around truncateToolOutput: a truncation/I-O failure no longer demotes a successful tool call to an error.
  • reports contentLength after hook context is appended without truncation — verifies the moved contentLength recomputation lands after hook/reminder/truncation mutations (matches the relocation seen in the diff).
  • preserves hook context outside truncated tool output — verifies the new "defer postToolUseAdditionalContext append until after truncation" ordering, so a structure-blind head/tail truncator cannot bisect hook-injected metadata.
  • does not truncate already-truncated tool output again — verifies the TOOL_OUTPUT_TRUNCATED_PREFIX guard against double-truncation. (The exported constant is now shared between truncation.ts and coreToolScheduler.ts.)
  • adds the saved full output path to string returnDisplay when truncating — verifies the user-facing recovery path: the saved temp-file path is appended to a string returnDisplay so the user can read_file the full output.

truncation.test.ts adds should log truncation telemetry with the caller prompt id covering the new promptId parameter into ToolOutputTruncatedEvent. Pre-existing tests in the file (within-threshold passthrough, write-error graceful path, sanitized filename, etc.) all continue to pass.

shell.test.ts still passes 194/194 after the local truncation block is removed from shell.ts, confirming the shell-side refactor does not regress shell behavior.

Diff-level sanity checks (read alongside the tests)

  • Truncation responsibility is centralized in CoreToolScheduler.processCompleted... — applies to all string-producing tools, not just shell.
  • Re-entrancy guard: the scheduler skips truncation when content.startsWith(TOOL_OUTPUT_TRUNCATED_PREFIX), preventing already-truncated content from being trimmed again on a re-emit path.
  • Hook ordering: postToolUseAdditionalContext is captured during the hook step but only appended after the truncation step, so the truncator only operates on raw tool output.
  • Best-effort error handling: truncation failures are logged via the debug logger and the call falls through with the original content intact.
  • Tighter temp-file permissions: mkdir(..., { mode: 0o700 }), writeFile(..., { mode: 0o600 }) — saved-output files are now mode-restricted by default.
  • Telemetry: ToolOutputTruncatedEvent carries the originating tool request's prompt_id, enabling correlation with the rest of the scheduled-call telemetry.
  • shell.ts no longer imports or calls truncateToolOutput and the long-run advisory append comment is updated to reflect that command output assembly no longer interacts with truncation.

Notes

  • PR description's "Tested on" table marks Linux as "GitHub Actions passed" without a local Linux check — this report adds Linux-local coverage.
  • No UI surface changed; no manual TUI run was required.

Conclusion

LGTM from a verification standpoint. All commands listed in the PR's reviewer test plan pass locally on Linux; the additional shell.test.ts run (added because the PR removes the in-shell truncation block) also passes. No extra failures or warnings observed outside the documented behavior. Safe to merge from this PR's own scope of changes.

Verification artifacts (logs) retained locally at /tmp/pr4520-{install,test-prompt-id,test-trunc-scheduler,test-trunc-full,test-cts-full,test-shell,prettier,eslint,lint,typecheck,diff-check}.log.

wenshao
wenshao previously approved these changes May 26, 2026
Comment thread packages/core/src/utils/truncation.ts Outdated
try {
await fs.mkdir(projectTempDir, { recursive: true });
await fs.writeFile(outputFile, content);
await fs.mkdir(projectTempDir, { recursive: true, mode: 0o700 });

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] mode: 0o700 on fs.mkdir is silently defeated by prior directory creation

logger.ts (line ~181) calls await fs.mkdir(this.qwenDir, { recursive: true }) on the same getProjectTempDir() path early in session initialization — without a restrictive mode. Since fs.mkdir with recursive: true does not modify permissions of existing directories, by the time truncateAndSaveToFile runs, the directory already exists with default permissions (typically 0o755). The 0o700 mode here is effectively dead code.

The file-level 0o600 on writeFile still works correctly for newly-created files, so the immediate data protection is intact. But directory listing by other local users remains possible on shared systems.

Consider one of:

  • (a) Add mode: 0o700 where the directory is first created (logger.ts or storage.ts)
  • (b) Add await fs.chmod(projectTempDir, 0o700) after the mkdir call here to enforce the mode regardless of who created it
  • (c) Remove the misleading mode parameter and add a comment explaining that directory permission hardening is handled elsewhere

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a90adf9. After mkdir, truncateAndSaveToFile now best-effort chmods the project temp dir to 0o700 so an existing dir with looser permissions is tightened before writing the saved output. chmod failure is logged but does not prevent saving the 0o600 file; added regression coverage for that fallback.

}

if (
typeof content === 'string' &&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Shell-level metadata (long-run hint, attribution warning) is subject to head/tail truncation without protection

Shell.ts appends the long-run advisory hint (~530 chars) and attributionWarning to llmContent before returning to the scheduler. The scheduler's truncation then treats this combined string as raw tool output. Unlike PostToolUse hook context — which is explicitly deferred to after truncation (lines 2901-2905) — tool-level metadata appended inside the tool implementation receives no equivalent protection.

With the default threshold (25,000 chars), the tail budget is generous enough that the hint reliably survives. The practical risk is low. However, a future reduction in default thresholds or a user-configured low threshold could bisect the hint or attribution warning mid-sentence. The old shell test that pinned the hint's survival through truncation was removed and replaced with one that doesn't exercise the truncation interaction.

Consider either:

  • (a) Documenting in shell.ts that the hint and attribution warning are intentionally subject to scheduler truncation
  • (b) Having tools expose metadata separately (e.g., on toolResult) so the scheduler can append it after truncation, similar to postToolUseAdditionalContext

— qwen3.7-max via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in a90adf9 with the lower-risk option. I documented in shell.ts that the model-facing copies of the long-run hint and attribution warning are intentionally part of shell llmContent and therefore subject to scheduler truncation, while returnDisplay still keeps the full user-facing metadata. I did not add a separate tool metadata API in this PR to avoid widening the change beyond the truncation fix.

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from 97531e2 to 378c2ac Compare May 26, 2026 12:31
@Jerry2003826

Jerry2003826 commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Updated PR description to the latest template with Reviewer Test Plan, Evidence, Tested on, and the Chinese explanation section.

Also pushed 378c2ac for the latest inline feedback:

  • added structured returnDisplay regression coverage;
  • forwarded the saved truncation file path into ToolOutputTruncatedEvent.output_file and logger snapshots;
  • included callId and prompt_id in truncation-failure diagnostics.

Validation run locally on Windows:

npm run test --workspace=packages/core -- src/utils/truncation.test.ts
npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts
npm run test --workspace=packages/core -- src/telemetry/loggers.test.ts
npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts packages/core/src/telemetry/types.ts packages/core/src/telemetry/loggers.test.ts packages/core/src/telemetry/qwen-logger/qwen-logger.ts
npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts packages/core/src/telemetry/types.ts packages/core/src/telemetry/loggers.test.ts packages/core/src/telemetry/qwen-logger/qwen-logger.ts
npm run typecheck --workspace=packages/core
git diff --check

@Jerry2003826

Copy link
Copy Markdown
Contributor Author

CI note: the latest Qwen Code CI run did not reach lint/test execution. The Lint job failed during actions/checkout with a GitHub 403, and the OS test jobs failed while downloading dorny/test-reporter before checkout/test steps. I do not have repository permissions to rerun the workflow; local validation commands in the comment above passed on this branch head.

@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Updated in 3b375ba to scope this PR down to the minimal scheduler-level fix.

What changed:

  • Removed the extra security-hardening changes, new truncation-failure telemetry chain, OTel/RUM path-sanitization changes, ToolResult.alreadyTruncated, split-budget logic, and hook-context truncation fallback work from this PR.
  • Kept only scheduler-level truncation for string llmContent, using the existing truncateToolOutput() helper.
  • Removed shell-level truncation so shell output now uses the same scheduler path as other string-returning tools.
  • Kept Part[] outputs on the existing non-string path and pinned that behavior in a test.
  • Preserved contentLength as the original raw string tool-output length, matching the pre-existing telemetry semantics rather than switching it to post-truncation / post-hook length.

Validation:

  • npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts -t "model-facing output truncation"
  • npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts src/tools/shell.test.ts -t "model-facing output truncation|appends the hint after command output is assembled"
  • npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts src/tools/shell.test.ts
  • npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/tools/shell.ts packages/core/src/tools/shell.test.ts
  • npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/tools/shell.ts packages/core/src/tools/shell.test.ts
  • npm run typecheck --workspace=packages/core

@wenshao

wenshao commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

✅ Re-verification (updated scope) — PR #4520

⚠️ This updates my earlier verification comment. Since then the PR was rescoped by a0ee305fb fix(core): scope tool output truncation to scheduler strings (+ a merge with newer main). The previous approach (a truncation.ts/telemetry rewrite + truncation inside the shell tool) was dropped; the PR is now a focused 4-file change. Re-verified on the current head 3b375ba11.

结论 / Verdict: 新方案(在 CoreToolScheduler 集中截断)构建/测试/运行时/静态检查全部通过,建议合并。 The new scheduler-centric approach passes build, tests, runtime and static checks — recommend merge.

What changed since the last verification

Old head (fe0d5275d) Current head (3b375ba11)
Scope 14 files; rewrote truncation.ts, added telemetry events/types, truncated inside the shell tool 4 files; truncation moved into CoreToolScheduler, no truncation.ts/telemetry changes
Files core utils + telemetry + tools coreToolScheduler.{ts,test}, shell.{ts,test} only

The new design: the scheduler calls the existing truncateToolOutput(config, toolName, content) on string llmContent for every tool (not just shell); shell.ts drops its own truncation; PostToolUse hook additionalContext is now appended after truncation so the head/tail truncator can't bisect hook metadata; the user-facing resultDisplay gets an "Output too long and was saved to: " note.

Method

  • pr-4520 3b375ba11 is 0 behind / 16 ahead of current origin/main (68e4819d7) — it already contains current main, so I tested it directly (true post-merge state). Delta vs origin/main = 4 files, +291/−104. Clean-rebuilt core.
  • (My local checkout's main had been stale at 1c48e4121; I synced origin/main — that's why the raw three-dot diff initially looked inflated. The real PR delta is the 4 files above.)

Build & static ✅

  • core tsc --build clean (emit + typecheck).
  • eslint on the 4 files: clean. git diff --check: no whitespace errors.

Tests ✅

Scope Result
coreToolScheduler.test.ts -t "truncat" (new scheduler-truncation cases) 8 passed
coreToolScheduler.test.ts + shell.test.ts (targeted) 377 passed (2 files)
Full packages/core suite 9786 passed, 6 skipped, 0 failed (355 files)

🎯 Runtime check (built core)

Exercised the exact function the scheduler now calls, truncateToolOutput, with a non-shell tool name (read_file) — confirming the central truncation now applies to any tool, not just shell:

[A] truncateAndSaveToFile (underlying):  318,889 chars → 2,800; full content saved to temp file; "[CONTENT TRUNCATED]" marker present
[B] truncateToolOutput (scheduler path, tool="read_file"): bounded → 2,809; full output saved to read_file_<hex>.output
[C] small content: returned unchanged, no temp file
RUNTIME CHECK: PASS

Notes for reviewers

  • The end-to-end "oversized tool result enters history truncated" path now lives in the scheduler and is covered by the +228-line coreToolScheduler.test.ts (driving the scheduler with mocked tools); exercising it through the live TUI would need an LLM-issued tool call, so it's out of scope for a no-network runtime check — hence the direct truncateToolOutput exercise above plus the unit suite.
  • Behavioural intent worth a human eye: shell-tool metadata (long-run hint, attribution warning) is now intentionally subject to scheduler truncation on the model-facing copy while the user-facing returnDisplay keeps it intact — the comments in shell.ts call this out explicitly, and shell.test.ts (now 202→ updated) covers it.

Environment note (unrelated to this PR)

This checkout's dist was drifted and local main was behind origin/main; I synced origin and clean-rebuilt core before testing. Deps unchanged.


Bottom line: the rescoped, scheduler-centric truncation builds clean, the full core suite is green (incl. the new scheduler-truncation tests), the truncation function the scheduler calls works at runtime for non-shell tools (bounded output + full-content temp file + passthrough), and lint/typecheck/whitespace are clean. 👍

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings. Downgraded from Approve to Comment: CI failing: Post Coverage Comment, Test (${{ matrix.os }}, Node ${{ matrix.node-version }}), Lint, CodeQL, Classify PR. — qwen3.7-max via Qwen Code /review

wenshao
wenshao previously approved these changes Jun 1, 2026
@wenshao

wenshao commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

✅ 本地 tmux 真实构建 + 测试 + 运行时复验 — PR #4520

🔄 这是对 当前 head 3b375ba11 的一次全新独立复验。自我上一条验证(今日 09:41)以来 PR 没有新提交(head 未变),应维护者要求在合并决策前重新跑一遍,并补上更强的真实端到端运行时证据。本条更新/汇总此前结论。

A fresh, independent re-verification of the current head 3b375ba11 — there have been no new PR commits since my last report (head unchanged); re-run before the merge decision with stronger real end-to-end runtime evidence. Supersedes my earlier comments.

结论 / Verdict: 构建、类型检查、lint/格式、全部受影响测试、以及真实运行时行为全部通过,重构在构建产物中得到验证 —— 建议合并。 Build, typecheck, lint/format, all affected tests, and real runtime behaviour all pass; the refactor is verified in the built artifacts — recommend merge.

复验前提 / Setup

  • origin/main(68e4819d7)是 PR head 的祖先,即 PR 已包含当前 main,pr-4520 就是真实落地后的状态(mergeable: MERGEABLE;BLOCKED 仅为必需的 review/checks,非冲突)。
  • 隔离 git worktree,从零 npm ci --ignore-scripts → 重新构建 core → 全部在专用 tmux 会话 pr4520 内执行,完全不动你的主工作区。
Item Value
OS Linux (Debian 13, kernel 6.12)
Node / npm v22.22.2 / 10.9.7
tmux 3.5a — session pr4520, 隔离 worktree qwen-code-verify-4520
Head tested 3b375ba1127fd6c3d350f940e4d0b9f5ec6a56d1

当前改动范围(4 文件,+291/-104) / Current scope — verified against the diff, not the body

把"超大模型可见工具输出"的截断从 shell 工具上移到 CoreToolScheduler,使所有工具的大字符串输出都被截断(而不只是 shell):

  1. coreToolScheduler.ts (+34) — 对 string 类型的 llmContent 调用既有的 truncateToolOutput;并把 PostToolUse hook 的 additionalContext 延后到截断之后再追加,避免 head/tail 截断器切断 hook 注入的元数据;非字符串 Part[] 输出原样保留。
  2. shell.ts移除已重复的 in-shell 截断块(long-run 提示 / 归因警告改为在 scheduler 截断前组装,模型可见副本受 scheduler 截断,用户可见 returnDisplay 仍保留完整)。
  3. coreToolScheduler.test.ts (+228) / shell.test.ts — 配套测试。

注:truncateToolOutput / truncateAndSaveToFile 本就在 main 的 truncation.ts 中,本 PR 未改动该文件,只是复用。

1) 构建 & 类型检查 / Build & typecheck

Check Result
core 构建(build_package.js,tsc emit) ✅ PASS (exit 0)
npm run typecheck --workspace=packages/core(tsc --noEmit) ✅ PASS (exit 0)

构建产物结构性证据 / structural evidence in dist:

  • dist/.../tools/shell.jstruncateToolOutput 的引用数 = 0(已移除)。
  • dist/.../core/coreToolScheduler.jstruncateToolOutput 的引用 = 2(import + 调用,已新增)。

2) 单元测试 / Tests

Suite Result
coreToolScheduler.test.ts + shell.test.ts(全量) 377 passed / 0 failed(2 files)
3 个新增测试 -t "model-facing output truncation" ✅ 3 passed

3 个新增用例均通过:truncates large string output at the scheduler before it enters historyappends PostToolUse context after raw output truncationleaves non-string Part output on the existing Part[] path

3) Lint / 格式 / Lint & format

  • eslint(4 个改动文件)— ✅ exit 0,无告警。
  • prettier --check(4 个改动文件)— ✅ All matched files use Prettier code style!

4) 真实运行时复验(核心 PR,TUI 截图 N/A) / Real runtime exercise

直接驱动构建产物。除了跑作者用 mock 验证 scheduler 接线的单测外,这里用未 mock 的真实 truncateToolOutput 端到端跑一遍,证明机制本身真的生效:

(A) DIRECT: truncateAndSaveToFile on real oversized input
PASS  output bounded (595 < original 122889)
PASS  has truncation header + [CONTENT TRUNCATED] marker
PASS  full output saved to temp file (lossless: file === original)
PASS  small input passes through unchanged, no file written

(B) END-TO-END: real CoreToolScheduler + real (unmocked) truncateToolOutput + real MockTool
PASS  tool call status === success
PASS  MODEL-FACING history output was truncated by the scheduler (614 < original 76889)
PASS  oversized raw body did NOT enter history
PASS  contentLength records the ORIGINAL pre-truncation length (76889)
PASS  user-facing resultDisplay got "Output too long and was saved to: …"
PASS  real temp file on disk holds the COMPLETE original tool output

(C) SANITY: small tool output is NOT truncated end-to-end
PASS  passes through unchanged / no temp file / no saved-file note

===== RESULT: ALL CHECKS PASSED =====

即:超大输出在进入模型历史前被有界截断,完整内容无损落盘,小输出原样透传 —— 与 #4049 的目标一致。

备注 / Notes(均不阻塞合并 / non-blocking)

  1. PR 描述已与 diff 脱节,建议更新正文:
    • "How to verify" 里把 packages/core/src/utils/truncation.ts 列为改动文件,但当前 diff 并未改它;
    • 文档化的过滤器 -t "tool output truncation" 现在匹配 0 个测试(实际 describe 是 CoreToolScheduler model-facing output truncation),正确写法是 -t "model-facing output truncation"。建议修正,避免 reviewer 复制命令跑了个空。
  2. 早期版本附带的 temp-file 权限加固 / 新 telemetry 事件 / in-shell 截断等已全部下掉,现已收敛为最小必要改动 —— 与此前 review 的建议一致 👍
  3. "避免重复截断"是结构性达成的(shell 不再截断,仅 scheduler 截断一次),没有 sentinel 守卫,符合预期。
  4. 一个值得作者留意的行为细节:shell 的 long-run 提示与归因警告现在在模型可见路径上会受 scheduler 截断;对病态超大输出,尾部提示可能从模型可见副本中被截掉(用户可见 returnDisplay 仍完整)。代码注释已说明这是有意为之,风险低。

在隔离 worktree(origin/main 祖先 + PR head 3b375ba11)中于 tmux 会话 pr4520 内执行;未改动主工作区。Verified in an isolated worktree inside tmux session pr4520; the main checkout was left untouched.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R18: Rescoped PR (4 files, +291/-104 vs 14 files, +1652/-146). No review findings on the minimal scheduler-level truncation fix. Downgraded from Approve to Comment: CI failing: Post Coverage Comment, Test, Lint, CodeQL, Classify PR. 3 low-confidence observations for human review only (no try-catch around truncateToolOutput, potential double-truncation for self-truncating tools, non-string resultDisplay drops TUI notice). 205 tests pass. — qwen3.7-max via Qwen Code /review

@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Updated the PR description to match the rescoped 4-file diff.

Also checked the latest CI after R18: the current run is green now.

Current status:

  • Lint: pass
  • CodeQL: pass
  • Test (macos-latest, Node 22.x): pass
  • Test (ubuntu-latest, Node 22.x): pass
  • Test (windows-latest, Node 22.x): pass

I left the three R18 low-confidence observations as non-blocking for this minimal PR:

  • truncateToolOutput() failure handling is unchanged from the existing helper semantics and can be revisited separately if maintainers want best-effort scheduler fallback.
  • explicit self-truncation signaling would require a ToolResult API surface, which was intentionally removed from this PR during rescope.
  • structured/non-string resultDisplay is preserved unchanged; only string displays get the saved-output path appended.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings. Downgraded from Approve to Comment: CI failing: Post Coverage Comment, Test (${{ matrix.os }}, Node ${{ matrix.node-version }}), Lint, CodeQL, Classify PR. — qwen-latest-series-invite-beta-v38 via Qwen Code /review

@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from 3b375ba to 2ea9955 Compare June 1, 2026 17:11
BZ-D
BZ-D previously approved these changes Jun 2, 2026

@BZ-D BZ-D left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff and ran the targeted scheduler, shell, and truncation tests; no blocking issues found.

wenshao
wenshao previously approved these changes Jun 2, 2026

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #4520 (commit 2ea9955)

Direction: Sound. Scheduler-level truncation is the correct place to bound model-facing string output, and the rescope to 5 files keeps the change focused.

CI: all_pass (17/17 checks).

Deterministic checks: tsc=0, eslint=0.

Findings: 1 new Suggestion (combined guard re-truncation edge case). All other observations are already covered by the extensive prior review thread (40+ existing comments).

One inline comment posted.

— qwen-code via Qwen Code /review

const shouldTruncateCombinedContent =
content.length > combinedThreshold ||
content.split('\n').length > combinedLines;
if (shouldTruncateCombinedContent) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Combined guard can re-truncate already-truncated content

When raw tool output exceeds the default threshold, the first truncateToolOutput call (line ~2916) wraps it in a truncation envelope (starting with "Tool output was too large and has been truncated.") and saves the full output to a temp file. After PostToolUse hook context is appended, this combined guard checks whether the total exceeds combinedThreshold (2x default) and calls truncateToolOutput a second time.

In the edge case where both raw output AND hook context independently exceed the threshold, the combined guard re-truncates content that already contains a truncation envelope. The result:

  1. Nested truncation envelopes — the model sees two layers of "Tool output was too large..." markers
  2. Two temp files saved to disk (with two different random names)
  3. Two "Output too long and was saved to: ..." lines accumulated in resultDisplay
  4. The second head/tail split may bisect the first envelope's marker text

The test "guards the final combined output after PostToolUse context is appended" only exercises the path where the first truncation does NOT fire (rawOutput = 'A'.repeat(90) < threshold = 100), so this interaction is untested.

Consider tracking whether the first pass already truncated (the first call's outputFile is set) and skipping the combined guard in that case, or add a test that exercises both passes firing simultaneously.

— qwen-code via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a811f425d. The final combined guard now skips the second truncation pass when the raw tool output was already truncated and persisted by the first pass. That avoids nested truncation envelopes, duplicate temp files, and duplicate resultDisplay save notices while still appending the PostToolUse context.

Added coverage:

  • does not re-truncate already-truncated output after PostToolUse context is appended

Validated:

  • npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts
  • npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts
  • npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts

@Jerry2003826 Jerry2003826 dismissed stale reviews from wenshao and BZ-D via a811f42 June 8, 2026 09:06
content,
);
content = truncated.content;
if (truncated.outputFile && typeof resultDisplay === 'string') {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] toolOutputAlreadyTruncated flag is gated on typeof resultDisplay === 'string', but content = truncated.content on line 3174 runs unconditionally. When a tool returns string llmContent that gets truncated with an outputFile, but a non-string resultDisplay (e.g., FileDiff from edit tool, AgentResultDisplay from agent tool), the flag stays false even though content has already been truncated. The combined-content guard at line ~3272 will then re-truncate already-truncated content when PostToolUse context or conditional-rule reminders are appended — the exact double-truncation scenario this commit aims to prevent.

Decouple the flag from the display type check:

Suggested change
if (truncated.outputFile && typeof resultDisplay === 'string') {
if (truncated.outputFile) {
toolOutputAlreadyTruncated = true;
if (typeof resultDisplay === 'string') {
resultDisplay +=
(resultDisplay ? '\n' : '') +
`Output too long and was saved to: ${truncated.outputFile}`;
}
}

The regression test covers the string-display path but does not exercise a structured returnDisplay with large string llmContent + PostToolUse hook context. Consider adding a test case for that edge case as well.

— qwen3.7-plus via Qwen Code /review

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in the current branch. toolOutputAlreadyTruncated is now set whenever the first truncation pass persists raw string output to an outputFile, independent of whether resultDisplay is a string. The display save notice remains string-only.

Added a regression with structured FileDiff result display plus PostToolUse additional context. It asserts the raw output truncation runs once, the appended context is preserved, and the structured display object is left unchanged.

Validated:

  • npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts -t "structured result display"
  • npm run test --workspace=packages/core -- src/core/coreToolScheduler.test.ts
  • npx prettier --check packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts
  • npx eslint packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts

…-history-truncation

# Conflicts:
#	packages/core/src/core/coreToolScheduler.ts
@Jerry2003826 Jerry2003826 force-pushed the codex/fix-tool-output-history-truncation branch from 37a65fe to efa179a Compare June 9, 2026 08:00
wenshao
wenshao previously approved these changes Jun 9, 2026

@DragonnZhang DragonnZhang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #4520 (commit efa179a)

Direction: Sound. The toolOutputAlreadyTruncated guard added in a811f425d correctly prevents the combined guard from re-truncating content that was already persisted by the first truncation pass. This closes the nested-truncation-envelope concern from the previous review.

CI: all_pass (24/24 checks).

Detected issues: None beyond the extensively-discussed backlog (39 stale inline comments, all addressed by subsequent commits). The rescoped 4-file core change is focused and well-tested.

Incremental since R19 (commit 2ea9955): +8 lines in coreToolScheduler.ts (the toolOutputAlreadyTruncated flag + combined-guard skip), +54 lines of regression tests. No regressions in the diff.

— qwen3-coder via Qwen Code /review

@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Addressed the latest truncation review batch in fafe8ed93.

What changed:

  • Truncate pure text Part[] outputs before they enter model-facing history, while keeping non-text/multimodal Part[] on the existing path.
  • Added a shared scheduler truncation helper with in-memory fallback when persistence/truncation unexpectedly fails.
  • Covered the combined PostToolUse/context path with the split budget fallback, so fallback does not use the full raw-output budget.
  • Sanitized generated output filenames for Windows-invalid characters.
  • Extended tool_output_truncated telemetry with call_id, output_file_saved, and sanitized save error metadata.
  • Added direct utility tests for in-memory truncation, formatting, filename sanitization, and telemetry error redaction.

Validated locally:

  • npm run test --workspace=packages/core -- src/utils/truncation.test.ts src/core/coreToolScheduler.test.ts src/telemetry/loggers.test.ts
  • npm run lint --workspace=packages/core
  • npm run typecheck --workspace=packages/core
  • npx prettier --check packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts packages/core/src/telemetry/types.ts packages/core/src/telemetry/loggers.test.ts packages/core/src/telemetry/qwen-logger/qwen-logger.ts

@Jerry2003826

Copy link
Copy Markdown
Contributor Author

Pushed additional review follow-ups in f949b3379 and 81498ed70.

Changes added:

  • Documented TOOL_OUTPUT_TRUNCATED_PREFIX as the scheduler sentinel to avoid accidental UX-only edits breaking double-truncation detection.
  • Added a serialized-size estimate to the non-text content truncation skip debug log.
  • Parameterized the truncation envelope text for custom contentLabel values instead of hardcoding output in the body.
  • Added telemetry for unexpected truncateToolOutput() failures that fall back to in-memory truncation, including call id, save failure metadata, and sanitized error text.
  • Added/updated regression assertions for the new label-aware envelope and non-text skip debug log.

Validation:

  • npm run test --workspace=packages/core -- src/utils/truncation.test.ts src/core/coreToolScheduler.test.ts src/telemetry/loggers.test.ts
  • npm run lint --workspace=packages/core
  • npm run typecheck --workspace=packages/core
  • npx prettier --check packages/core/src/utils/truncation.ts packages/core/src/utils/truncation.test.ts packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts

One intentional behavior note: I kept structured resultDisplay values intact rather than converting them into strings to append the saved-output notice. The model-facing content still carries the truncation envelope, and string/undefined displays get the notice; preserving structured displays avoids breaking file diff / todo / agent result rendering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/bug Something isn't working as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] 工具输出未截断导致 Context Token 溢出,Session 无法继续

7 participants