UTF16 Encode Decode Tool for Text and Files

This UTF-16 encoder and decoder is built for practical troubleshooting rather than theory alone. The live workflow supports both text and

Setup

UTF-16 Conversion Workbench

Configure options, paste input, then use sample/detect or conversion actions.

Ready

Configure settings in this column, then run conversion/inspection and review outputs in the result panel.

Focus: UTF-16 surrogate, code unit, and endian/BOM debugging.

Problem Presets

Remove notation ( \u ) Preserve Whitespace During Conversion Convert to uppercase

UTF-16 endianness Include BOM Input mode

Convert Text
Convert File

Text

UTF-16

Copy Clear

Round-trip: n/a

Diagnostics Report

No diagnostics generated yet.

Encoding: n/a

BOM/Endian: n/a

Notation: n/a

UTF-16 Inspector

Code units: 0 | Code points: 0 | Bytes: 0

Endian: n/a

BOM: n/a

Notation: n/a

UTF-16 Bytes Preview

Round-trip diff: n/a.

Upload a file

Source Encoding

Upload a file to inspect encoding details.

Preview Before Conversion

Preview After Conversion

Processing...

UTF8 Encode Decode

UTF32 Encode Decode

Unicode Text Converter

About the UTF-16 Encoder / Decoder

The UTF-16 Encoder / Decoder converts text to UTF-16 byte sequences and decodes UTF-16 encoded data back to readable Unicode characters. UTF-16 is the native string encoding in Windows, Java, JavaScript, and .NET — making this tool essential for debugging file I/O, inter-process communication, and API payloads in those ecosystems.

How to Use

To encode: Paste your text into the input field and click Encode. The tool outputs the UTF-16 byte sequence in hex, showing each code unit as a 2-byte (4 hex character) value.
To decode: Paste a UTF-16 hex byte sequence into the input field and click Decode to recover the original text.
Select byte order — UTF-16 LE (Little Endian) or UTF-16 BE (Big Endian) — to match your source system.
Toggle BOM inclusion on or off. Most files should include a BOM; programmatic byte arrays often omit it.

How UTF-16 Encoding Works

UTF-16 represents Unicode code points using one or two 16-bit code units (2 or 4 bytes):

BMP characters (U+0000–U+FFFF) — Encoded as a single 16-bit code unit. The code unit value equals the code point. Most Latin, Cyrillic, Greek, Arabic, Hebrew, CJK, and Hiragana/Katakana characters fall in this range.
Supplementary characters (U+10000–U+10FFFF) — Encoded as a surrogate pair: two 16-bit code units. The high surrogate (U+D800–U+DBFF) encodes the upper bits; the low surrogate (U+DC00–U+DFFF) encodes the lower bits. Emoji and many historic scripts require surrogate pairs.

Little Endian vs Big Endian

UTF-16 is a 16-bit encoding and each code unit can be stored with the low byte first (LE) or the high byte first (BE):

UTF-16 LE — Low byte stored first. Default on Windows and x86 systems. Example: A (U+0041) → 41 00.
UTF-16 BE — High byte stored first. Used in network protocols, Java class files, and some Unix systems. Example: A (U+0041) → 00 41.
BOM (Byte Order Mark) — The character U+FEFF placed at the start of a file signals the byte order: FF FE indicates LE; FE FF indicates BE. Without a BOM, the receiver must know or guess the byte order.

UTF-16 vs UTF-8

ASCII content — UTF-8 uses 1 byte per ASCII character; UTF-16 always uses 2. For English-heavy text, UTF-8 files are roughly half the size of UTF-16.
CJK content — CJK characters use 3 bytes in UTF-8 but only 2 in UTF-16 (for BMP characters). UTF-16 is more compact for Chinese, Japanese, and Korean text.
Web compatibility — UTF-8 is the web standard. UTF-16 is the internal string format for Windows, Java, JavaScript engines, and .NET. Cross-boundary data should be converted to UTF-8.
Null bytes — UTF-16 contains null bytes for ASCII characters (e.g., A → 41 00), which breaks C-string handling. UTF-8 never contains null bytes except for the NUL character itself.

Common UTF-16 Issues

Incorrect byte order — Reading UTF-16 LE data as UTF-16 BE (or vice versa) produces completely garbled output. Always check for a BOM or confirm the byte order from the data source documentation.
Broken surrogate pairs — Slicing a UTF-16 string at an odd byte boundary can split a surrogate pair, producing an invalid lone surrogate. Use length-aware string operations that count code units, not bytes.
Missing BOM on file read — Text editors and parsers that do not detect a BOM will misread a UTF-16 file as binary or as a different encoding. Always include a BOM in UTF-16 files destined for file I/O.
Confusion with UTF-16 and UCS-2 — UCS-2 is a fixed-width 2-byte encoding that cannot represent supplementary characters. Legacy systems using UCS-2 will silently corrupt emoji or supplementary symbols. UTF-16 supersedes UCS-2 completely.

Frequently Asked Questions

Why does Windows use UTF-16 internally?: Windows adopted Unicode in the early 1990s when Unicode was a fixed 16-bit standard (what became UCS-2). The Win32 API was built around 2-byte characters. When Unicode expanded beyond 65,536 code points, Windows extended to UTF-16 while keeping the 2-byte API boundary. The Windows kernel, NTFS filenames, and the Win32 API all operate in UTF-16 LE natively.
Does JavaScript use UTF-16?: Yes. JavaScript strings are sequences of UTF-16 code units. String.prototype.length counts code units, not characters — a single emoji may have a length of 2. Use Array.from(str).length or the string iterator to count actual Unicode characters (code points). String.fromCodePoint() and codePointAt() handle supplementary characters correctly.
How do I convert UTF-16 to UTF-8 in code?: In Python: text.encode('utf-8') on a native string (which Python 3 stores as UCS-4 internally). In Java: new String(bytes, StandardCharsets.UTF_16).getBytes(StandardCharsets.UTF_8). In .NET: Encoding.UTF8.GetBytes(Encoding.Unicode.GetString(bytes)) — note that Encoding.Unicode in .NET means UTF-16 LE.
What is a surrogate pair and when does it matter?: A surrogate pair is two UTF-16 code units that together encode a supplementary Unicode character (U+10000 and above). It matters whenever you count characters, slice strings, or process text character by character — operations that treat a surrogate pair as two separate items will corrupt the character. Modern languages handle this automatically in string iterators, but low-level byte manipulation requires explicit surrogate-pair awareness.

Unicode: ASCII, UTF-8, code points, character encodings

Knowledge is power.
Francis Bacon