📦

UTF-8 Byte by Byte: How Characters Become Bytes

A visual, byte-level walkthrough of UTF-8 encoding showing exactly how code points map to 1-4 bytes.

The Four Ranges of UTF-8

UTF-8 is a variable-length encoding that represents every Unicode code point using 1 to 4 bytes. The number of bytes depends on the code point range:

BytesCode point rangeLeading byte patternTotal bits for CP
1U+0000 .. U+007F0xxxxxxx7
2U+0080 .. U+07FF110xxxxx 10xxxxxx11
3U+0800 .. U+FFFF1110xxxx 10xxxxxx 10xxxxxx16
4U+10000 .. U+10FFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx21

The first 128 code points (ASCII) use a single byte identical to ASCII itself. This backward compatibility was a deliberate design decision by Ken Thompson and Rob Pike — any valid ASCII file is automatically valid UTF-8.

The 2-byte range covers Latin extended characters, Greek, Cyrillic, Arabic, and Hebrew. Most European languages fit entirely within 1-2 bytes per character.

Bit-Level Anatomy of UTF-8

Let's trace exactly how a code point becomes bytes. Take (U+00E9, Latin small letter e with acute), which falls in the 2-byte range (U+0080..U+07FF):

U+00E9 in binary: 000 1110 1001  (11 bits)

Split into groups:    [00011] [101001]
Insert into template: 110xxxxx 10xxxxxx
Result:               11000011 10101001
Hex:                  0xC3     0xA9

Now consider (U+6F22, CJK ideograph), which falls in the 3-byte range:

U+6F22 in binary: 0110 1111 0010 0010  (16 bits)

Split into groups:    [0110] [111100] [100010]
Insert into template: 1110xxxx 10xxxxxx 10xxxxxx
Result:               11100110 10111100 10100010
Hex:                  0xE6     0xBC     0xA2

The key insight: the leading byte's high bits tell the decoder exactly how many bytes follow. A byte starting with 0 is a standalone ASCII byte. A byte starting with 110 means “read 1 more continuation byte.” Starting with 1110 means “read 2 more.” And 11110 means “read 3 more.” Continuation bytes always start with 10.

Why CJK = 3 Bytes and Emoji = 4 Bytes

One of the most common questions: why do Chinese, Japanese, and Korean characters take 3 bytes in UTF-8? The answer lies in code point allocation:

ScriptRangeUTF-8 bytesExample
ASCII / Latin basicU+0000..U+007F1 byteA = 0x41
Latin extended / CyrillicU+0080..U+07FF2 bytesé = 0xC3 0xA9
CJK IdeographsU+4E00..U+9FFF3 bytes漢 = 0xE6 0xBC 0xA2
Emoji / SMPU+10000..U+10FFFF4 bytes🌍 = 0xF0 0x9F 0x8C 0x8D

The CJK Unified Ideographs block spans U+4E00 to U+9FFF (over 20,000 characters), placing them squarely in the 3-byte zone. This means a Chinese or Japanese text file in UTF-8 is roughly 50% larger than the same file in a dedicated CJK encoding like GB2312 or Shift_JIS (which use 2 bytes per character).

Emoji live in the Supplementary Multilingual Plane (U+1F000 and above), which requires 4 bytes. A single emoji like 🌍 (U+1F30D) takes 4 bytes in UTF-8, 4 bytes in UTF-32, but only 2 bytes worth of “logical space” as a surrogate pair in UTF-16.

Self-Synchronizing: UTF-8's Killer Feature

UTF-8 has a remarkable property: you can jump to any arbitrary byte in a stream and determine whether you are at the start of a character or in the middle of one. This is called self-synchronization.

The rules are simple: if a byte starts with 0, it's a single-byte character. If it starts with 10, it's a continuation byte — scan backward to find the leading byte. If it starts with 11, it's the leading byte of a multi-byte sequence.

Byte pattern    Meaning
0xxxxxxx        Single-byte character (ASCII)
10xxxxxx        Continuation byte (never a start)
110xxxxx        Start of 2-byte sequence
1110xxxx        Start of 3-byte sequence
11110xxx        Start of 4-byte sequence

This design means that if a single byte is corrupted or lost, at most one character is destroyed — the decoder can resynchronize at the next leading byte. Compare this with Shift_JIS, where losing a single byte can cause all subsequent characters to be misinterpreted (the “mojibake cascade” problem).

Another benefit: you can search for an ASCII substring (like / in a file path) using simple byte comparison without any risk of false matches inside multi-byte characters. This is impossible with Shift_JIS, where a trail byte can coincidentally equal an ASCII byte value (e.g., the infamous 0x5C backslash problem).

UTF-8 vs UTF-16: Size Comparison

Which encoding is more space-efficient depends entirely on the content:

Content typeUTF-8 bytesUTF-16 bytesWinner
ASCII text (English code)1 per char2 per charUTF-8 (50% smaller)
European text (Latin ext.)1-2 per char2 per charUTF-8 (slightly smaller)
CJK text (Chinese/Japanese)3 per char2 per charUTF-16 (33% smaller)
Emoji-heavy text4 per char4 per char (surrogate)Tie
Mixed (HTML with CJK)~2.2 avg2 per char (+ BOM)Close to tie

For web content, UTF-8 almost always wins because HTML markup, CSS, JavaScript, and URLs are ASCII-heavy. Even a Japanese web page has so much ASCII in its markup that UTF-8 tends to be comparable or smaller than UTF-16. This is one reason the WHATWG HTML specification mandates UTF-8 as the default encoding.

UTF-16 retains an advantage for in-memory string processing of CJK-heavy text (which is why Java, JavaScript, and Windows chose it as their internal string format in the 1990s). However, for storage and network transfer, UTF-8 has become the universal standard — over 98% of web pages use UTF-8 as of 2024.

// Quick size comparison in Node.js:
const text = "漢字とASCII mixed テキスト";
console.log(Buffer.byteLength(text, "utf8"));  // 39 bytes
console.log(Buffer.byteLength(text, "utf16le")); // 36 bytes
// UTF-16 wins slightly for CJK-heavy mixed text

const code = "function hello() { return 42; }";
console.log(Buffer.byteLength(code, "utf8"));  // 31 bytes
console.log(Buffer.byteLength(code, "utf16le")); // 62 bytes
// UTF-8 wins decisively for ASCII

Related articles