fix(edit): preserve file encoding on read/edit/restore (GB18030, UTF-8 BOM)#1518
Merged
Conversation
added 2 commits
May 21, 2026 22:25
…8 BOM) read_file / edit_file / multi_edit / applyEditBlock all hard-coded UTF-8 in readFileSync + writeFileSync. On Chinese Windows hosts most project files are GB18030; lossy UTF-8 decode mangled the bytes so the model's SEARCH text never byte-matched, edit_file silently failed, and the model fell back to PowerShell + git-replay (token- expensive and lossy for any user manual edits in between). The shell stdout path already handled this via smartDecodeOutput's UTF-8-fatal → GB18030 fallback. Same pattern, applied to the filesystem path, plus encode-back via iconv-lite so the file stays in its original encoding. UTF-8 BOM round-trips too. Snapshot records prevEncoding so restoreSnapshots writes back correctly. Closes #1445
CI ubuntu build flagged the previous platform gate — GB18030 files are portable, Linux developers can have CN-Windows-origin files in their projects. The gate was a copy of smartDecodeOutput's cmd.exe-specific behavior that doesn't apply to file-on-disk encoding detection.
esengine
added a commit
that referenced
this pull request
May 22, 2026
…se (#1565) * chore(release): 0.49.0 — static-history TUI, queued steers, Bing default, lifecycle plans Headline themes: - TUI: Static-history renderer is the only path; virtual-viewport layers removed (#1529 stages 1-4) - Chat: queued mid-turn steer handling so input mid-render doesn't drop or fight the live frame (#1501) - Web search: default switches to Bing; dashboard engine switcher; Mojeek dropped (#1558) - Plans: lifecycle evidence summaries surface why a plan is ready to accept (#1500) - Desktop: native OS notifications for approvals + completion (#1519) - i18n: CLI command output (/mcp /sessions /prune /theme) + approval-prompt labels translated (#1524, #1560) - Security: SSRF block in web_fetch (#1544), edit-snapshot path containment (#1454), shell redirect sandbox (#1457), Task integrity guardrail (#1516) - Tools: per-turn dispatch-rate limit (#1356); run_command discourages shell-based edits (#1514) - Client: DeepSeek 429 → concurrency-limit hint (#1526); timeoutMs honored with AbortSignal (#1535); --no-proxy opt-out for direct route (#1507) - Files: read/edit/restore preserves source encoding (GB18030 / UTF-8 BOM) (#1518) - Context: pinned constraints survive folds + full tail capture (#1515, #1552) - Refactor: lifecycle risk policy extracted into its own module (#1557) See CHANGELOG for the full list. * fix(context): align fold summary prefix with main agent for cache reuse The summarizer call was sending a bespoke "You compress conversation history" system prompt and no tools, guaranteeing a 0% cache hit against the main agent's just-cached prefix. Reshape the request so system + tools + head bytes mirror the live agent's last call — the only novel bytes are the trailing summarize instruction. Skill-pin handling now collects bodies read-only instead of stubbing mid-head, so the cache prefix stays unbroken. The summarize instruction names pinned skills so the model knows not to paraphrase their bodies (which we append verbatim regardless). Measured on a real session at 48.7K prompt tokens: OLD shape: 0.0% cache hit → $0.145 per fold NEW shape: 99.6% cache hit → $0.015 per fold saving: 89.6% per fold * tools: add fold-cache shape + live benchmarks bench-fold-cache-shape.mjs replays real session jsonls, simulates OLD vs NEW summary-call shapes at the fold point, and reports byte-level shared-prefix with the main agent's preceding request. Pure local — no API required. bench-fold-cache-live.mjs sends one priming + two summary calls to DeepSeek and reports prompt_cache_hit_tokens / cost for each shape. Used to confirm the shape change actually translates to API-side cache hits. --------- Co-authored-by: reasonix <reasonix@deepseek.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Preserve original file encoding (GB18030, UTF-8 BOM) on the read/edit/write/snapshot/restore path. Closes #1445.
The bug chain reported in #1445
User on Chinese Windows 11, YOLO mode:
read_file/edit_filereturn mangled bytes or fail SEARCH on GB18030 files (most CN Windows projects)git show HEAD:<file>+ replay edits from conversation historyRoot cause is layer 1. The other layers are the model improvising around the silent corruption.
Why this happened
Every filesystem call in the edit pipeline hard-coded UTF-8:
src/tools/filesystem.ts:283—read_file:raw.toString("utf8")src/tools/filesystem.ts:656—write_file:writeFile(abs, content, "utf8")src/tools/fs/edit.ts:16,32—applyEditread+writesrc/tools/fs/edit.ts:86,126,132—applyMultiEditread, write, rollback-writesrc/code/edit-blocks.ts:160,187,213,243,275—applyEditBlockcontent,applyEditBlockwrite,toWholeFileEditBlock,snapshotBeforeEdits,restoreSnapshotsThe shell stdout path had already solved the equivalent problem:
smartDecodeOutputatsrc/tools/shell/exec.ts:177does UTF-8-fatal → GB18030 fallback on Windows. The filesystem path just never used the same approach.What this PR does
New
src/code/file-encoding.tswith two pure helpers:decodeFileBuffer(buf)→{ text, encoding: 'utf8' | 'utf8-bom' | 'gb18030' }EF BB BF, stripped from returned textsmartDecodeOutput's safety net so we never throw on read)encodeFile(text, encoding)→BufferBuffer.fromiconv-lite(new dep — 0-dep, MIT, ~50KB, 30M weekly downloads)All 12 call sites above swap to these helpers. The edit pipeline now:
decodeFileBuffer→ text string + detected encodingencodeFile(after, encoding)so the file keeps its original encodingEditSnapshotgains aprevEncoding?: FileEncodingfield sorestoreSnapshotsround-trips correctly when undoing edits on GB18030 / BOM files. Field is optional for backward compat with any persisted older snapshots — restore defaults to UTF-8 when absent.What this PR does NOT do
lineEndingOfstays the same — orthogonal to encoding).src/code/checkpoints.tscovers that surface; a per-turn auto-snapshot hook is a separate PR if desired. With the root encoding bug fixed, edit_file shouldn't fail in the first place, which removes the trigger for the entire model-fallback chain the user described.Test Plan
tests/file-encoding.test.ts— 8 cases covering plain UTF-8, multi-byte UTF-8, BOM strip + report, GB18030 fallback (Windows-only), empty file, round-trip in each encodingtests/edit-blocks.test.ts— 3 new cases: edit GB18030 file stays GB18030 (asserts UTF-8 decode fails on the result); edit BOM file preserves the BOM bytes; snapshot+restore round-trips through GB18030npx vitest run tests/file-encoding.test.ts tests/edit-blocks.test.ts tests/filesystem-tools.test.ts tests/comment-policy.test.ts— 168 passnpx tsc --noEmit— cleannpx biome checkon touched files — cleannpm run verifyvia pre-push hook — green