About the UTF-16 Encoder / Decoder
The UTF-16 Encoder / Decoder converts text to UTF-16 byte sequences and decodes UTF-16 encoded data back to readable Unicode characters. UTF-16 is the native string encoding in Windows, Java, JavaScript, and .NET — making this tool essential for debugging file I/O, inter-process communication, and API payloads in those ecosystems.
How to Use
- To encode: Paste your text into the input field and click Encode. The tool outputs the UTF-16 byte sequence in hex, showing each code unit as a 2-byte (4 hex character) value.
- To decode: Paste a UTF-16 hex byte sequence into the input field and click Decode to recover the original text.
- Select byte order — UTF-16 LE (Little Endian) or UTF-16 BE (Big Endian) — to match your source system.
- Toggle BOM inclusion on or off. Most files should include a BOM; programmatic byte arrays often omit it.
How UTF-16 Encoding Works
UTF-16 represents Unicode code points using one or two 16-bit code units (2 or 4 bytes):
- BMP characters (U+0000–U+FFFF) — Encoded as a single 16-bit code unit. The code unit value equals the code point. Most Latin, Cyrillic, Greek, Arabic, Hebrew, CJK, and Hiragana/Katakana characters fall in this range.
- Supplementary characters (U+10000–U+10FFFF) — Encoded as a surrogate pair: two 16-bit code units. The high surrogate (U+D800–U+DBFF) encodes the upper bits; the low surrogate (U+DC00–U+DFFF) encodes the lower bits. Emoji and many historic scripts require surrogate pairs.
Little Endian vs Big Endian
UTF-16 is a 16-bit encoding and each code unit can be stored with the low byte first (LE) or the high byte first (BE):
- UTF-16 LE — Low byte stored first. Default on Windows and x86 systems. Example:
A (U+0041) → 41 00.
- UTF-16 BE — High byte stored first. Used in network protocols, Java class files, and some Unix systems. Example:
A (U+0041) → 00 41.
- BOM (Byte Order Mark) — The character U+FEFF placed at the start of a file signals the byte order:
FF FE indicates LE; FE FF indicates BE. Without a BOM, the receiver must know or guess the byte order.
UTF-16 vs UTF-8
- ASCII content — UTF-8 uses 1 byte per ASCII character; UTF-16 always uses 2. For English-heavy text, UTF-8 files are roughly half the size of UTF-16.
- CJK content — CJK characters use 3 bytes in UTF-8 but only 2 in UTF-16 (for BMP characters). UTF-16 is more compact for Chinese, Japanese, and Korean text.
- Web compatibility — UTF-8 is the web standard. UTF-16 is the internal string format for Windows, Java, JavaScript engines, and .NET. Cross-boundary data should be converted to UTF-8.
- Null bytes — UTF-16 contains null bytes for ASCII characters (e.g.,
A → 41 00), which breaks C-string handling. UTF-8 never contains null bytes except for the NUL character itself.
Common UTF-16 Issues
- Incorrect byte order — Reading UTF-16 LE data as UTF-16 BE (or vice versa) produces completely garbled output. Always check for a BOM or confirm the byte order from the data source documentation.
- Broken surrogate pairs — Slicing a UTF-16 string at an odd byte boundary can split a surrogate pair, producing an invalid lone surrogate. Use length-aware string operations that count code units, not bytes.
- Missing BOM on file read — Text editors and parsers that do not detect a BOM will misread a UTF-16 file as binary or as a different encoding. Always include a BOM in UTF-16 files destined for file I/O.
- Confusion with UTF-16 and UCS-2 — UCS-2 is a fixed-width 2-byte encoding that cannot represent supplementary characters. Legacy systems using UCS-2 will silently corrupt emoji or supplementary symbols. UTF-16 supersedes UCS-2 completely.
Frequently Asked Questions
- Why does Windows use UTF-16 internally?
- Windows adopted Unicode in the early 1990s when Unicode was a fixed 16-bit standard (what became UCS-2). The Win32 API was built around 2-byte characters. When Unicode expanded beyond 65,536 code points, Windows extended to UTF-16 while keeping the 2-byte API boundary. The Windows kernel, NTFS filenames, and the Win32 API all operate in UTF-16 LE natively.
- Does JavaScript use UTF-16?
- Yes. JavaScript strings are sequences of UTF-16 code units.
String.prototype.length counts code units, not characters — a single emoji may have a length of 2. Use Array.from(str).length or the string iterator to count actual Unicode characters (code points). String.fromCodePoint() and codePointAt() handle supplementary characters correctly.
- How do I convert UTF-16 to UTF-8 in code?
- In Python:
text.encode('utf-8') on a native string (which Python 3 stores as UCS-4 internally). In Java: new String(bytes, StandardCharsets.UTF_16).getBytes(StandardCharsets.UTF_8). In .NET: Encoding.UTF8.GetBytes(Encoding.Unicode.GetString(bytes)) — note that Encoding.Unicode in .NET means UTF-16 LE.
- What is a surrogate pair and when does it matter?
- A surrogate pair is two UTF-16 code units that together encode a supplementary Unicode character (U+10000 and above). It matters whenever you count characters, slice strings, or process text character by character — operations that treat a surrogate pair as two separate items will corrupt the character. Modern languages handle this automatically in string iterators, but low-level byte manipulation requires explicit surrogate-pair awareness.