Skip to content

fix(patch): unescape \t/\r in new_string when file region uses real control bytes (#33733)#33799

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-6ef29853
May 28, 2026
Merged

fix(patch): unescape \t/\r in new_string when file region uses real control bytes (#33733)#33799
teknium1 merged 2 commits into
mainfrom
hermes/hermes-6ef29853

Conversation

@teknium1

@teknium1 teknium1 commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Patch tool no longer corrupts tab-indented files when the LLM serializes tabs as the two-character sequence \t in new_string. Salvages @liuhao1024's #33737 and widens it to cover the issue's headline reproduction (matched via exact strategy, not escape_normalized).

Fixes #33733.

Root cause

LLMs frequently produce \t (backslash + t, two characters) in JSON tool-call arguments where they meant a real tab byte. The patch tool wrote those literal characters straight to disk. _strategy_escape_normalized already unescaped old_string for matching; new_string had no equivalent treatment.

What changed vs #33737

#33737 gated the unescape on strategy_name == "escape_normalized". The issue's repro has a real tab in old_string, so the match fires under exact — the gate never opens. This PR decides per-sequence by inspecting the matched region of the file: only convert \t -> tab (or \r -> CR) when the file region actually contains the corresponding control byte. Mirrors _detect_escape_drift's region-based heuristic and avoids touching legitimate writes of the literal two-character string "\t" (e.g. sep = "\t" in Python source). \n is intentionally excluded — newlines serialize correctly through JSON and unescaping would corrupt source escape sequences far more often than help.

Changes

  • tools/fuzzy_match.py: replace _unescape_common_sequences with _maybe_unescape_new_string(new_string, content, matches); call it unconditionally from fuzzy_find_and_replace (the helper itself bails out cheaply when there's nothing to do).
  • tests/tools/test_fuzzy_match.py: rewrite TestEscapeNormalizedNewString to cover both exact and escape_normalized strategies, the legitimate-literal preservation case, and the \n-not-unescaped invariant.

Validation

E2E against the real patch tool, fixed code on disk:

variant file has old_string new_string result
1 — issue headline real tab real tab literal \t real tab written ✓
2 — both literal real tab literal \t literal \t real tab written ✓
3 — legit "\t" literal \ + t \ + t \ + t (new word) \ + t preserved ✓
4 — passthrough none none none no-op ✓
5 — \'/\" drift guard real ' \'-drifted \'-drifted drift error still fires ✓

Targeted tests: 220 passing across test_fuzzy_match.py, test_patch_failure_tracking.py, test_patch_parser.py, test_file_operations*.py, test_file_tools.py.

Credit

Cherry-picked @liuhao1024's commit from #33737 with authorship preserved; follow-up commit widens the fix. Closes #33737 (superseded) and #33745 (closed by author).

Infographic

patch-tool-tab-byte-recovery

liuhao1024 and others added 2 commits May 28, 2026 02:58
…lized matches

When the patch tool matches via the escape_normalized strategy, old_string
contains literal \t, \n, \r sequences that get unescaped to match real
control characters in the file. However, new_string was written as-is,
leaving literal backslash sequences in the output.

Add _unescape_common_sequences() helper and apply it to new_string when
the matching strategy is escape_normalized. This ensures LLM-generated
tab/newline sequences become real bytes in the patched file.

Fixes #33733
…33733)

Extends @liuhao1024's escape-normalized fix so the patch tool also
recovers when old_string carries a real tab byte and matches via the
`exact` strategy — which is the headline reproduction in the issue and
the most common case in practice (LLMs frequently get old_string right
because they re-read the file, but still serialize new_string's tabs as
two-character `\t`).

Instead of gating on the match strategy, decide per-sequence by looking
at the *matched region of the file*: only convert `\t` -> tab and
`\r` -> CR when the file region we're replacing actually contains the
corresponding control byte. That mirrors the region-based heuristic in
`_detect_escape_drift` and keeps legitimate writes of the literal
two-character string `"\t"` (e.g. patching `sep = "\t"` in Python
source) untouched — those files have a backslash+t in the matched
region, not a real tab, so new_string passes through verbatim. `\n` is
still excluded because newlines serialize correctly through JSON and
unescaping would corrupt source escape sequences far more often than
help.

E2E verified against the live `patch` tool: tab-indented file + literal
`\t` in new_string under both `exact` (Variant 1) and `escape_normalized`
(Variant 2) strategies now produces real tab bytes; a Python source line
containing `sep = "\t"` (legitimate literal backslash-t) survives a
patch unchanged.

Tests updated to cover both strategies and the legitimate-literal case,
and to assert that `\n` is intentionally preserved.

Refs #33733
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-6ef29853 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9543 on HEAD, 9543 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5028 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@teknium1 teknium1 merged commit 78be458 into main May 28, 2026
25 checks passed
@teknium1 teknium1 deleted the hermes/hermes-6ef29853 branch May 28, 2026 10:27
@alt-glitch alt-glitch added P2 Medium — degraded but workaround exists type/bug Something isn't working comp/tools Tool registry, model_tools, toolsets tool/file File tools (read, write, patch, search) labels May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists tool/file File tools (read, write, patch, search) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

patch tool: new_string escape sequences (\t) get written literally

3 participants