Skip to content

worktree: Fix binary files misdetected as UTF-16#50890

Merged
probably-neb merged 4 commits intozed-industries:mainfrom
notJoon:fix/binary-file-detection-wav
Mar 17, 2026
Merged

worktree: Fix binary files misdetected as UTF-16#50890
probably-neb merged 4 commits intozed-industries:mainfrom
notJoon:fix/binary-file-detection-wav

Conversation

@notJoon
Copy link
Copy Markdown
Contributor

@notJoon notJoon commented Mar 6, 2026

Closes #50785

Opening a .wav file (PCM 16-bit) caused Zed to freeze because the binary detection heuristic in analyze_byte_contentmisidentified it as UTF-16LE text. The heuristic determines UTF-16 encoding solely by checking whether null bytes are skewed toward even or odd positions. PCM 16-bit audio with small sample values produces bytes like [sample, 0x00], creating an alternating null pattern at odd positions that is indistinguishable from BOM-less UTF-16LE by position alone.

Why not just add more binary headers?

The initial approach (32d8bd7) was to add audio format signatures (RIFF, OGG, FLAC, MP3) to known binary header. While this solved the reported .wav case, any binary format containing small 16-bit values (audio, images, or arbitrary data) would still be misclassified. Adding headers is an endless game that cannot cover unknown or uncommon formats.

Changes

  • Adds is_plausible_utf16_text as a secondary validation: when the null byte skew suggests UTF-16, decode the bytes and count code units that fall in C0/C1 control character ranges (U+0000–U+001F, U+007F–U+009F, excluding common whitespace) or form unpaired surrogates. Real UTF-16 text has near-zero such characters. I've set the threshold at 2% — note that this is an empirically derived value, not based on any formal standard.

Before fix

스크린샷 2026-03-06 오후 9 00 07

After fix
스크린샷 2026-03-06 오전 1 17 43

Before you mark this PR as ready for review, make sure that you have:

  • Added a solid test coverage and/or screenshots from doing manual testing
  • Done a self-review taking into account security and performance aspects
  • Aligned any UI changes with the UI checklist

Release Notes:

  • Fixed binary files (e.g. WAV) being misdetected as UTF-16 text, causing Zed to freeze.

@cla-bot cla-bot bot added the cla-signed The user has signed the Contributor License Agreement label Mar 6, 2026
@zed-community-bot zed-community-bot bot added the first contribution the author's first pull request to Zed. NOTE: the label application is automated via github actions label Mar 6, 2026
@notJoon notJoon marked this pull request as ready for review March 6, 2026 12:01
@zelenenka zelenenka added the guild Pull requests by someone in Zed Guild. NOTE: the label application is automated via github actions label Mar 16, 2026
@probably-neb probably-neb enabled auto-merge (squash) March 17, 2026 02:40
@probably-neb probably-neb merged commit f7ec531 into zed-industries:main Mar 17, 2026
48 checks passed
@notJoon notJoon deleted the fix/binary-file-detection-wav branch March 17, 2026 05:16
@yeskunall
Copy link
Copy Markdown
Member

Hey @notJoon -- missed your first PR here: #51108 but congratulations on your 1st and 2nd contribution to Zed! 💖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The user has signed the Contributor License Agreement first contribution the author's first pull request to Zed. NOTE: the label application is automated via github actions guild Pull requests by someone in Zed Guild. NOTE: the label application is automated via github actions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Accidentally opening .wav causes freeze/crash.

5 participants