fix(patch): unescape \t/\r in new_string when file region uses real control bytes (#33733)#33799
Merged
Conversation
…lized matches When the patch tool matches via the escape_normalized strategy, old_string contains literal \t, \n, \r sequences that get unescaped to match real control characters in the file. However, new_string was written as-is, leaving literal backslash sequences in the output. Add _unescape_common_sequences() helper and apply it to new_string when the matching strategy is escape_normalized. This ensures LLM-generated tab/newline sequences become real bytes in the patched file. Fixes #33733
…33733) Extends @liuhao1024's escape-normalized fix so the patch tool also recovers when old_string carries a real tab byte and matches via the `exact` strategy — which is the headline reproduction in the issue and the most common case in practice (LLMs frequently get old_string right because they re-read the file, but still serialize new_string's tabs as two-character `\t`). Instead of gating on the match strategy, decide per-sequence by looking at the *matched region of the file*: only convert `\t` -> tab and `\r` -> CR when the file region we're replacing actually contains the corresponding control byte. That mirrors the region-based heuristic in `_detect_escape_drift` and keeps legitimate writes of the literal two-character string `"\t"` (e.g. patching `sep = "\t"` in Python source) untouched — those files have a backslash+t in the matched region, not a real tab, so new_string passes through verbatim. `\n` is still excluded because newlines serialize correctly through JSON and unescaping would corrupt source escape sequences far more often than help. E2E verified against the live `patch` tool: tab-indented file + literal `\t` in new_string under both `exact` (Variant 1) and `escape_normalized` (Variant 2) strategies now produces real tab bytes; a Python source line containing `sep = "\t"` (legitimate literal backslash-t) survives a patch unchanged. Tests updated to cover both strategies and the legitimate-literal case, and to assert that `\n` is intentionally preserved. Refs #33733
Contributor
🔎 Lint report:
|
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Patch tool no longer corrupts tab-indented files when the LLM serializes tabs as the two-character sequence
\tinnew_string. Salvages @liuhao1024's #33737 and widens it to cover the issue's headline reproduction (matched viaexactstrategy, notescape_normalized).Fixes #33733.
Root cause
LLMs frequently produce
\t(backslash + t, two characters) in JSON tool-call arguments where they meant a real tab byte. The patch tool wrote those literal characters straight to disk._strategy_escape_normalizedalready unescapedold_stringfor matching;new_stringhad no equivalent treatment.What changed vs #33737
#33737 gated the unescape on
strategy_name == "escape_normalized". The issue's repro has a real tab inold_string, so the match fires underexact— the gate never opens. This PR decides per-sequence by inspecting the matched region of the file: only convert\t-> tab (or\r-> CR) when the file region actually contains the corresponding control byte. Mirrors_detect_escape_drift's region-based heuristic and avoids touching legitimate writes of the literal two-character string"\t"(e.g.sep = "\t"in Python source).\nis intentionally excluded — newlines serialize correctly through JSON and unescaping would corrupt source escape sequences far more often than help.Changes
tools/fuzzy_match.py: replace_unescape_common_sequenceswith_maybe_unescape_new_string(new_string, content, matches); call it unconditionally fromfuzzy_find_and_replace(the helper itself bails out cheaply when there's nothing to do).tests/tools/test_fuzzy_match.py: rewriteTestEscapeNormalizedNewStringto cover bothexactandescape_normalizedstrategies, the legitimate-literal preservation case, and the\n-not-unescaped invariant.Validation
E2E against the real
patchtool, fixed code on disk:\t\t\t"\t"literal\+ t\+ t\+ t (new word)\+ t preserved ✓\'/\"drift guard'\'-drifted\'-driftedTargeted tests: 220 passing across
test_fuzzy_match.py,test_patch_failure_tracking.py,test_patch_parser.py,test_file_operations*.py,test_file_tools.py.Credit
Cherry-picked @liuhao1024's commit from #33737 with authorship preserved; follow-up commit widens the fix. Closes #33737 (superseded) and #33745 (closed by author).
Infographic