-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Special Character Combinations Display as Garbled Text under UTF-16 LE BOM #3367
Description
The issue occurs when outputting the word combination 加上一个 (translate as plus one more) contained in a file which has UTF-16 LE BOM encoding. After some investigation, I found that certain common characters also exhibit this problem:
上 U+4E0A一 U+4E00伊 U+4F0A刀 U+5200
Upon closer inspection, I found that this issue is closely related to the Unicode encoding of these character combinations. In Little Endian, the actual storage representation of 上一 is 0x0A4E004E. I suspect that 0A is interpreted as LF (line feed) and, together with the following 00, is treated as something else.
Unfortunately, both cat, less, tail, 'awk', 'sed' share this issue. But ripgrep survived.
Given that bat claims to natively support UTF-8 as well as UTF-16, I believe this is a bug worth reporting.
As for why I use UTF-16 LE BOM to save files, it stems from compatibility issues with the Windows operating system. Some
.inifiles and Windows Registry exported files use UCS-2 LE BOM encoding (a subset of UTF-16), and I view these files in the terminal.
What steps will reproduce the bug?
- Create a text file with the following content and save it with
UTF-16 LE with BOMencoding:
上一伊刀
You can use this python3 script to create it:
import codecs
with open("utf16le-bom.txt", "wb") as f:
f.write(codecs.BOM_UTF16_LE)
f.write("上一伊刀".encode("utf-16le"))- Use
batto view the file.
What happens?
The content is displayed across three lines, with garbled text starting from the second character:
$ bat --style=header utf16le-bom.txt
File: utf16le-bom.txt <UTF-16LE>
上�
O
�
$ bat --plain utf16le-bom.txt | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ff fe 0a 4e 00 4e 0a 4f ┊ 00 52 │××_N0N_O┊0R │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
$ hexyl utf16le-bom.txt
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ff fe 0a 4e 00 4e 0a 4f ┊ 00 52 │××_N0N_O┊0R │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
$ cat utf16le-bom.txt
��
NN
OR
$ rg . utf16le-bom.txt
1:上一伊刀Additionally, if encode with UTF-16 BE BOM, the result is:
$ bat --style=header utf16be-bom.txt
File: utf16be-bom.txt <UTF-16BE>
上
一伊
刀What did you expect to happen instead?
The file should display normally, showing the four characters 上一伊刀, as it does when opened in VSCode.
How did you install bat?
The issue was verified on both Ubuntu (WSL2) and Windows:
- Download bat_0.25.0_amd64.deb from GitHub release and
dpkg -i winget install sharkdp.bat jftuga.less
bat version and environment
Output from `bat --diagnostic`
Software version
bat 0.25.0 (25f4f96)
Operating system
Linux 6.6.87.2-microsoft-standard-WSL2
Command-line
bat --diagnostic Environment variables
BAT_CACHE_PATH=<not set>
BAT_CONFIG_PATH=<not set>
BAT_OPTS=<not set>
BAT_PAGER=<not set>
BAT_PAGING=<not set>
BAT_STYLE=<not set>
BAT_TABS=<not set>
BAT_THEME=<not set>
COLORTERM=<not set>
LANG=zh_CN.UTF-8
LC_ALL=<not set>
LESS=<not set>
MANPAGER=<not set>
NO_COLOR=<not set>
PAGER=<not set>
SHELL=/bin/bash
TERM=xterm-256color
XDG_CACHE_HOME=<not set>
XDG_CONFIG_HOME=<not set>System Config file
Could not read contents of '/etc/bat/config': No such file or directory (os error 2).
Config file
Could not read contents of '/home/enihsyou/.config/bat/config': No such file or directory (os error 2).
Custom assets metadata
Could not read contents of '/home/enihsyou/.cache/bat/metadata.yaml': No such file or directory (os error 2).
Custom assets
'/home/enihsyou/.cache/bat' not found
Compile time information
- Profile: release
- Target triple: x86_64-unknown-linux-gnu
- Family: unix
- OS: linux
- Architecture: x86_64
- Pointer width: 64
- Endian: little
- CPU features: fxsr,sse,sse2
- Host: x86_64-unknown-linux-gnu
Less version
> less --version
less 590 (GNU regular expressions)
Copyright (C) 1984-2021 Mark Nudelman
less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution,
see the file named README in the less distribution.
Home page: https://greenwoodsoftware.com/less