worktree: Fix binary files misdetected as UTF-16#50890
Merged
probably-neb merged 4 commits intozed-industries:mainfrom Mar 17, 2026
Merged
worktree: Fix binary files misdetected as UTF-16#50890probably-neb merged 4 commits intozed-industries:mainfrom
probably-neb merged 4 commits intozed-industries:mainfrom
Conversation
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #50785
Opening a .wav file (PCM 16-bit) caused Zed to freeze because the binary detection heuristic in
analyze_byte_contentmisidentifiedit as UTF-16LE text. The heuristic determines UTF-16 encoding solely by checking whether null bytes are skewed toward even or odd positions. PCM 16-bit audio with small sample values produces bytes like[sample, 0x00], creating an alternating null pattern at odd positions that is indistinguishable from BOM-less UTF-16LE by position alone.Why not just add more binary headers?
The initial approach (32d8bd7) was to add audio format signatures (RIFF, OGG, FLAC, MP3) to known binary header. While this solved the reported
.wavcase, any binary format containing small 16-bit values (audio, images, or arbitrary data) would still be misclassified. Adding headers is an endless game that cannot cover unknown or uncommon formats.Changes
is_plausible_utf16_textas a secondary validation: when the null byte skew suggests UTF-16, decode the bytes and count code units that fall in C0/C1 control character ranges (U+0000–U+001F, U+007F–U+009F, excluding common whitespace) or form unpaired surrogates. Real UTF-16 text has near-zero such characters. I've set the threshold at 2% — note that this is an empirically derived value, not based on any formal standard.Before fix
After fix

Before you mark this PR as ready for review, make sure that you have:
Release Notes: