Skip to content

Special Character Combinations Display as Garbled Text under UTF-16 LE BOM #3367

@enihsyou

Description

@enihsyou

The issue occurs when outputting the word combination 加上一个 (translate as plus one more) contained in a file which has UTF-16 LE BOM encoding. After some investigation, I found that certain common characters also exhibit this problem:

  • 上 U+4E0A
  • 一 U+4E00
  • 伊 U+4F0A
  • 刀 U+5200

Upon closer inspection, I found that this issue is closely related to the Unicode encoding of these character combinations. In Little Endian, the actual storage representation of 上一 is 0x0A4E004E. I suspect that 0A is interpreted as LF (line feed) and, together with the following 00, is treated as something else.

Unfortunately, both cat, less, tail, 'awk', 'sed' share this issue. But ripgrep survived.
Given that bat claims to natively support UTF-8 as well as UTF-16, I believe this is a bug worth reporting.

As for why I use UTF-16 LE BOM to save files, it stems from compatibility issues with the Windows operating system. Some .ini files and Windows Registry exported files use UCS-2 LE BOM encoding (a subset of UTF-16), and I view these files in the terminal.

What steps will reproduce the bug?

  1. Create a text file with the following content and save it with UTF-16 LE with BOM encoding:
上一伊刀

You can use this python3 script to create it:

import codecs
with open("utf16le-bom.txt", "wb") as f:
    f.write(codecs.BOM_UTF16_LE)
    f.write("上一伊刀".encode("utf-16le"))
  1. Use bat to view the file.

What happens?

The content is displayed across three lines, with garbled text starting from the second character:

$ bat --style=header utf16le-bom.txt
File: utf16le-bom.txt   <UTF-16LE>
上�
੎O


$ bat --plain utf16le-bom.txt | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ff fe 0a 4e 00 4e 0a 4f ┊ 00 52                   │××_N0N_O┊0R      │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

$ hexyl utf16le-bom.txt
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ff fe 0a 4e 00 4e 0a 4f ┊ 00 52                   │××_N0N_O┊0R      │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

$ cat utf16le-bom.txt
��
NN
OR

$ rg . utf16le-bom.txt
1:上一伊刀

Additionally, if encode with UTF-16 BE BOM, the result is:

$ bat --style=header utf16be-bom.txt
File: utf16be-bom.txt   <UTF-16BE>

一伊

What did you expect to happen instead?

The file should display normally, showing the four characters 上一伊刀, as it does when opened in VSCode.

How did you install bat?

The issue was verified on both Ubuntu (WSL2) and Windows:

  • Download bat_0.25.0_amd64.deb from GitHub release and dpkg -i
  • winget install sharkdp.bat jftuga.less

bat version and environment

Output from `bat --diagnostic`

Software version

bat 0.25.0 (25f4f96)

Operating system

Linux 6.6.87.2-microsoft-standard-WSL2

Command-line

bat --diagnostic 

Environment variables

BAT_CACHE_PATH=<not set>
BAT_CONFIG_PATH=<not set>
BAT_OPTS=<not set>
BAT_PAGER=<not set>
BAT_PAGING=<not set>
BAT_STYLE=<not set>
BAT_TABS=<not set>
BAT_THEME=<not set>
COLORTERM=<not set>
LANG=zh_CN.UTF-8
LC_ALL=<not set>
LESS=<not set>
MANPAGER=<not set>
NO_COLOR=<not set>
PAGER=<not set>
SHELL=/bin/bash
TERM=xterm-256color
XDG_CACHE_HOME=<not set>
XDG_CONFIG_HOME=<not set>

System Config file

Could not read contents of '/etc/bat/config': No such file or directory (os error 2).

Config file

Could not read contents of '/home/enihsyou/.config/bat/config': No such file or directory (os error 2).

Custom assets metadata

Could not read contents of '/home/enihsyou/.cache/bat/metadata.yaml': No such file or directory (os error 2).

Custom assets

'/home/enihsyou/.cache/bat' not found

Compile time information

  • Profile: release
  • Target triple: x86_64-unknown-linux-gnu
  • Family: unix
  • OS: linux
  • Architecture: x86_64
  • Pointer width: 64
  • Endian: little
  • CPU features: fxsr,sse,sse2
  • Host: x86_64-unknown-linux-gnu

Less version

> less --version 
less 590 (GNU regular expressions)
Copyright (C) 1984-2021  Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution,
see the file named README in the less distribution.
Home page: https://greenwoodsoftware.com/less

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions