🧩

Surrogate Pairs: Why JavaScript Strings Break on Emoji

How UTF-16 surrogate pairs work, why they affect JavaScript/Java/C#, and how to handle them correctly.

The BMP Boundary: U+FFFF

When Unicode was first designed in the late 1980s, the consortium believed 65,536 code points (16 bits) would be enough for every character in every language. This original range — U+0000 to U+FFFF — is called the Basic Multilingual Plane (BMP).

They were wrong. As more scripts, historic characters, and eventually emoji were added, Unicode expanded to 17 planes totaling 1,114,112 code points (U+0000 to U+10FFFF). The 16 additional planes beyond the BMP are called Supplementary Planes. Characters in these planes have code points above U+FFFF and require a special encoding trick in UTF-16.

Plane	Range	Name	Example characters
0 (BMP)	U+0000..U+FFFF	Basic Multilingual Plane	A, é, 漢, ♠
1 (SMP)	U+10000..U+1FFFF	Supplementary Multilingual	🌍, 𝄞, 𐐀
2 (SIP)	U+20000..U+2FFFF	Supplementary Ideographic	𠮟, 𠀀
3-16	U+30000..U+10FFFF	Tertiary + unassigned	𰀀 (Ext. G)

Languages like JavaScript, Java, and C# chose UTF-16 as their internal string encoding when 16 bits seemed sufficient. When Unicode outgrew 16 bits, these languages had to accommodate supplementary characters without changing their fundamental string type. The solution was surrogate pairs.

Inspect 🌍 (above U+FFFF)

How Surrogate Pairs Encode Supplementary Characters

UTF-16 represents code points above U+FFFF using a pair of 16-bit code units called a surrogate pair. The algorithm is precise:

Given code point U (where U > 0xFFFF):

1. Subtract 0x10000:  U' = U - 0x10000
   (result is 0x00000 .. 0xFFFFF, exactly 20 bits)

2. Split into two 10-bit halves:
   High 10 bits: H = (U' >> 10) + 0xD800   → range 0xD800..0xDBFF
   Low 10 bits:  L = (U' & 0x3FF) + 0xDC00  → range 0xDC00..0xDFFF

Example: 🌍 = U+1F30D
   U' = 0x1F30D - 0x10000 = 0xF30D
   H  = (0xF30D >> 10) + 0xD800 = 0x3C + 0xD800 = 0xD83C
   L  = (0xF30D & 0x3FF) + 0xDC00 = 0x30D + 0xDC00 = 0xDF0D

So 🌍 in UTF-16: 0xD83C 0xDF0D

Unicode permanently reserved the range U+D800 to U+DFFF (2,048 code points) exclusively for surrogates. These values are not valid code points themselves — they exist only as encoding artifacts in UTF-16. The high surrogate (U+D800..U+DBFF) always comes first, followed by the low surrogate (U+DC00..U+DFFF).

This means UTF-16 can encode all 1,112,064 valid Unicode code points: 63,488 BMP characters directly (65,536 minus the 2,048 surrogates), plus 1,048,576 supplementary characters via surrogate pairs (1,024 high surrogates x 1,024 low surrogates).

Inspect 𠮟 (CJK Extension B)

Why "🌍".length === 2 in JavaScript

JavaScript strings are sequences of UTF-16 code units. The .length property counts these code units, not characters or code points:

Expression	Value	Explanation
"A".length	1	BMP character = 1 code unit
"漢".length	1	BMP character = 1 code unit
"🌍".length	2	Supplementary = surrogate pair = 2 code units
"🌍"[0]	"\uD83C"	High surrogate (not a valid character)
"🌍"[1]	"\uDF0D"	Low surrogate (not a valid character)
"🌍".charCodeAt(0)	0xD83C	High surrogate numeric value
"🌍".codePointAt(0)	0x1F30D	Actual code point (ES6+)

This is not a “bug” — it is the fundamental reality of how JavaScript stores strings. Every string operation that uses index-based access (bracket notation, charAt, charCodeAt, substring, slice) operates on UTF-16 code units, not code points.

ES6 introduced code-point-aware alternatives: codePointAt(), String.fromCodePoint(), and the string iterator ([...str] or for...of). These correctly handle surrogate pairs as single units.

// ES6 code point iteration
const str = "A🌍B";

// WRONG: index-based
for (let i = 0; i < str.length; i++) {
  console.log(str[i]);
}
// Output: "A", "\uD83C", "\uDF0D", "B"  (4 iterations, broken emoji)

// RIGHT: iterator-based
for (const ch of str) {
  console.log(ch);
}
// Output: "A", "🌍", "B"  (3 iterations, correct)

Inspect mixed BMP + supplementary

Iterating Correctly Over Strings

Here are the safe and unsafe ways to iterate over JavaScript strings containing surrogate pairs:

Method	Surrogate-safe?	Notes
for (i=0; i<str.length; i++)	No	Iterates UTF-16 code units
str.split('')	No	Splits at code unit boundaries
for (const ch of str)	Yes	Uses string iterator (ES6)
[...str]	Yes	Spread uses string iterator
Array.from(str)	Yes	Uses string iterator
str.match(/./gsu)	Yes	With 'u' flag, . matches code points
Intl.Segmenter	Yes	Also handles grapheme clusters

The u flag on regular expressions is critical. Without it, /./g matches individual code units, splitting surrogate pairs. With /./gu, the dot matches full code points. Similarly, \u{1F30D} syntax (requiring the u flag) lets you match supplementary characters directly in regex patterns.

// String reversal: a classic surrogate pair trap
const str = "A🌍B";

// WRONG: breaks surrogate pair
str.split('').reverse().join('');
// → "B\uDF0D\uD83C A" (corrupted, shows replacement chars)

// RIGHT: spread preserves pairs
[...str].reverse().join('');
// → "B🌍A" (correct)

// String length: count actual characters
[...str].length;  // 3 (correct)
str.length;       // 4 (counts code units)

Inspect surrogate pair structure

Practical Bugs from Surrogate Pairs

Surrogate pair bugs are among the most common Unicode issues in production software. Here are real-world scenarios:

Database truncation: A VARCHAR(100) column that counts UTF-16 code units will truncate "99 chars + 🌍" at the high surrogate, producing a lone surrogate that poisons downstream processing.
JSON encoding: Lone surrogates (\uD83C without a following low surrogate) are technically invalid in JSON per RFC 8259. Some parsers reject them; others silently produce U+FFFD.
Substring extraction: str.substring(0, 3) on "A🌍B" returns "A" + the high surrogate of 🌍 — a corrupted string that may render as A�.
Twitter/SMS character limits: Twitter counts code points (not code units) for its character limit. A single emoji counts as 1 character toward the limit despite being .length === 2 in JavaScript.
Cursor movement in text editors: Pressing the right arrow key should skip over both code units of a surrogate pair. Many custom text input implementations get this wrong, placing the cursor between the high and low surrogate.

The safest approach: always use code-point-aware APIs (for...of, codePointAt, String.fromCodePoint) when processing text, and use Intl.Segmenter when counting user-visible characters.

// Safe substring that never splits surrogate pairs
function safeSubstring(str, start, end) {
  const chars = [...str];
  return chars.slice(start, end).join('');
}

safeSubstring("A🌍B", 0, 2); // "A🌍" (correct)
"A🌍B".substring(0, 2);      // "A\uD83C" (broken!)

// Detect if a string contains lone surrogates
function hasLoneSurrogates(str) {
  return /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/.test(str);
}

Inspect A🌍B in the viewer

🔤