Skip to content

fix(edit): preserve file encoding on read/edit/restore (GB18030, UTF-8 BOM)#1518

Merged
esengine merged 2 commits into
mainfrom
fix/file-encoding-edit-roundtrip
May 22, 2026
Merged

fix(edit): preserve file encoding on read/edit/restore (GB18030, UTF-8 BOM)#1518
esengine merged 2 commits into
mainfrom
fix/file-encoding-edit-roundtrip

Conversation

@esengine

Copy link
Copy Markdown
Owner

Summary

Preserve original file encoding (GB18030, UTF-8 BOM) on the read/edit/write/snapshot/restore path. Closes #1445.

The bug chain reported in #1445

User on Chinese Windows 11, YOLO mode:

  1. read_file / edit_file return mangled bytes or fail SEARCH on GB18030 files (most CN Windows projects)
  2. Model falls back to PowerShell — same encoding issue, also fails
  3. Model's last resort: git show HEAD:<file> + replay edits from conversation history
  4. That last resort is token-expensive AND lossy — any user manual edits between the start of the turn and the failure are silently overwritten

Root cause is layer 1. The other layers are the model improvising around the silent corruption.

Why this happened

Every filesystem call in the edit pipeline hard-coded UTF-8:

The shell stdout path had already solved the equivalent problem: smartDecodeOutput at src/tools/shell/exec.ts:177 does UTF-8-fatal → GB18030 fallback on Windows. The filesystem path just never used the same approach.

What this PR does

New src/code/file-encoding.ts with two pure helpers:

  • decodeFileBuffer(buf){ text, encoding: 'utf8' | 'utf8-bom' | 'gb18030' }
    • UTF-8 BOM detected by leading EF BB BF, stripped from returned text
    • UTF-8 attempted in fatal mode; falls through to GB18030 fallback on Windows
    • Last-resort lossy UTF-8 with replacement chars (matches smartDecodeOutput's safety net so we never throw on read)
  • encodeFile(text, encoding)Buffer
    • UTF-8 / UTF-8-BOM via Buffer.from
    • GB18030 via iconv-lite (new dep — 0-dep, MIT, ~50KB, 30M weekly downloads)

All 12 call sites above swap to these helpers. The edit pipeline now:

  1. Reads bytes → decodes via decodeFileBuffer → text string + detected encoding
  2. Applies SEARCH/REPLACE on the text (existing logic unchanged)
  3. Writes via encodeFile(after, encoding) so the file keeps its original encoding

EditSnapshot gains a prevEncoding?: FileEncoding field so restoreSnapshots round-trips correctly when undoing edits on GB18030 / BOM files. Field is optional for backward compat with any persisted older snapshots — restore defaults to UTF-8 when absent.

What this PR does NOT do

  • Doesn't change line-ending handling (CRLF/LF detection at lineEndingOf stays the same — orthogonal to encoding).
  • Doesn't convert files. A GB18030 file stays GB18030. Any user expecting "Reasonix should normalize all files to UTF-8" is out of scope; that's a destructive workflow we shouldn't ship silently.
  • Doesn't detect arbitrary encodings (UTF-16, Big5, Shift-JIS, etc.). UTF-8 and GB18030 cover the actual reported failures; adding more without real demand is over-fitting. If those hit later, the helper is the right place to grow.
  • Doesn't address the secondary request in read_file、edit_file失败导致一轮对话中的修改丢失 #1445 ("per-prompt undo of all changes including user manual edits"). The existing checkpoint infrastructure in src/code/checkpoints.ts covers that surface; a per-turn auto-snapshot hook is a separate PR if desired. With the root encoding bug fixed, edit_file shouldn't fail in the first place, which removes the trigger for the entire model-fallback chain the user described.

Test Plan

  • tests/file-encoding.test.ts — 8 cases covering plain UTF-8, multi-byte UTF-8, BOM strip + report, GB18030 fallback (Windows-only), empty file, round-trip in each encoding
  • tests/edit-blocks.test.ts — 3 new cases: edit GB18030 file stays GB18030 (asserts UTF-8 decode fails on the result); edit BOM file preserves the BOM bytes; snapshot+restore round-trips through GB18030
  • npx vitest run tests/file-encoding.test.ts tests/edit-blocks.test.ts tests/filesystem-tools.test.ts tests/comment-policy.test.ts — 168 pass
  • npx tsc --noEmit — clean
  • npx biome check on touched files — clean
  • npm run verify via pre-push hook — green

reasonix added 2 commits May 21, 2026 22:25
…8 BOM)

read_file / edit_file / multi_edit / applyEditBlock all hard-coded
UTF-8 in readFileSync + writeFileSync. On Chinese Windows hosts most
project files are GB18030; lossy UTF-8 decode mangled the bytes so
the model's SEARCH text never byte-matched, edit_file silently
failed, and the model fell back to PowerShell + git-replay (token-
expensive and lossy for any user manual edits in between).

The shell stdout path already handled this via smartDecodeOutput's
UTF-8-fatal → GB18030 fallback. Same pattern, applied to the
filesystem path, plus encode-back via iconv-lite so the file stays
in its original encoding. UTF-8 BOM round-trips too. Snapshot
records prevEncoding so restoreSnapshots writes back correctly.

Closes #1445
CI ubuntu build flagged the previous platform gate — GB18030 files
are portable, Linux developers can have CN-Windows-origin files in
their projects. The gate was a copy of smartDecodeOutput's
cmd.exe-specific behavior that doesn't apply to file-on-disk
encoding detection.
@esengine esengine merged commit 1024a0c into main May 22, 2026
4 checks passed
@esengine esengine deleted the fix/file-encoding-edit-roundtrip branch May 22, 2026 05:36
esengine added a commit that referenced this pull request May 22, 2026
…se (#1565)

* chore(release): 0.49.0 — static-history TUI, queued steers, Bing default, lifecycle plans

Headline themes:
- TUI: Static-history renderer is the only path; virtual-viewport layers removed (#1529 stages 1-4)
- Chat: queued mid-turn steer handling so input mid-render doesn't drop or fight the live frame (#1501)
- Web search: default switches to Bing; dashboard engine switcher; Mojeek dropped (#1558)
- Plans: lifecycle evidence summaries surface why a plan is ready to accept (#1500)
- Desktop: native OS notifications for approvals + completion (#1519)
- i18n: CLI command output (/mcp /sessions /prune /theme) + approval-prompt labels translated (#1524, #1560)
- Security: SSRF block in web_fetch (#1544), edit-snapshot path containment (#1454), shell redirect sandbox (#1457), Task integrity guardrail (#1516)
- Tools: per-turn dispatch-rate limit (#1356); run_command discourages shell-based edits (#1514)
- Client: DeepSeek 429 → concurrency-limit hint (#1526); timeoutMs honored with AbortSignal (#1535); --no-proxy opt-out for direct route (#1507)
- Files: read/edit/restore preserves source encoding (GB18030 / UTF-8 BOM) (#1518)
- Context: pinned constraints survive folds + full tail capture (#1515, #1552)
- Refactor: lifecycle risk policy extracted into its own module (#1557)

See CHANGELOG for the full list.

* fix(context): align fold summary prefix with main agent for cache reuse

The summarizer call was sending a bespoke "You compress conversation
history" system prompt and no tools, guaranteeing a 0% cache hit
against the main agent's just-cached prefix. Reshape the request so
system + tools + head bytes mirror the live agent's last call — the
only novel bytes are the trailing summarize instruction.

Skill-pin handling now collects bodies read-only instead of stubbing
mid-head, so the cache prefix stays unbroken. The summarize
instruction names pinned skills so the model knows not to paraphrase
their bodies (which we append verbatim regardless).

Measured on a real session at 48.7K prompt tokens:
  OLD shape: 0.0% cache hit  → $0.145 per fold
  NEW shape: 99.6% cache hit → $0.015 per fold
  saving: 89.6% per fold

* tools: add fold-cache shape + live benchmarks

bench-fold-cache-shape.mjs replays real session jsonls, simulates
OLD vs NEW summary-call shapes at the fold point, and reports
byte-level shared-prefix with the main agent's preceding request.
Pure local — no API required.

bench-fold-cache-live.mjs sends one priming + two summary calls to
DeepSeek and reports prompt_cache_hit_tokens / cost for each shape.
Used to confirm the shape change actually translates to API-side
cache hits.

---------

Co-authored-by: reasonix <reasonix@deepseek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_file、edit_file失败导致一轮对话中的修改丢失

1 participant