Surrogate Pairs: Why JavaScript Strings Break on Emoji
How UTF-16 surrogate pairs work, why they affect JavaScript/Java/C#, and how to handle them correctly.
The BMP Boundary: U+FFFF
When Unicode was first designed in the late 1980s, the consortium believed 65,536 code points (16 bits) would be enough for every character in every language. This original range — U+0000 to U+FFFF — is called the Basic Multilingual Plane (BMP).
They were wrong. As more scripts, historic characters, and eventually emoji were added, Unicode expanded to 17 planes totaling 1,114,112 code points (U+0000 to U+10FFFF). The 16 additional planes beyond the BMP are called Supplementary Planes. Characters in these planes have code points above U+FFFF and require a special encoding trick in UTF-16.
| Plane | Range | Name | Example characters |
|---|---|---|---|
| 0 (BMP) | U+0000..U+FFFF | Basic Multilingual Plane | A, é, 漢, ♠ |
| 1 (SMP) | U+10000..U+1FFFF | Supplementary Multilingual | 🌍, 𝄞, 𐐀 |
| 2 (SIP) | U+20000..U+2FFFF | Supplementary Ideographic | 𠮟, 𠀀 |
| 3-16 | U+30000..U+10FFFF | Tertiary + unassigned | 𰀀 (Ext. G) |
Languages like JavaScript, Java, and C# chose UTF-16 as their internal string encoding when 16 bits seemed sufficient. When Unicode outgrew 16 bits, these languages had to accommodate supplementary characters without changing their fundamental string type. The solution was surrogate pairs.
How Surrogate Pairs Encode Supplementary Characters
UTF-16 represents code points above U+FFFF using a pair of 16-bit code units called a surrogate pair. The algorithm is precise:
Given code point U (where U > 0xFFFF): 1. Subtract 0x10000: U' = U - 0x10000 (result is 0x00000 .. 0xFFFFF, exactly 20 bits) 2. Split into two 10-bit halves: High 10 bits: H = (U' >> 10) + 0xD800 → range 0xD800..0xDBFF Low 10 bits: L = (U' & 0x3FF) + 0xDC00 → range 0xDC00..0xDFFF Example: 🌍 = U+1F30D U' = 0x1F30D - 0x10000 = 0xF30D H = (0xF30D >> 10) + 0xD800 = 0x3C + 0xD800 = 0xD83C L = (0xF30D & 0x3FF) + 0xDC00 = 0x30D + 0xDC00 = 0xDF0D So 🌍 in UTF-16: 0xD83C 0xDF0D
Unicode permanently reserved the range U+D800 to U+DFFF (2,048 code points) exclusively for surrogates. These values are not valid code points themselves — they exist only as encoding artifacts in UTF-16. The high surrogate (U+D800..U+DBFF) always comes first, followed by the low surrogate (U+DC00..U+DFFF).
This means UTF-16 can encode all 1,112,064 valid Unicode code points: 63,488 BMP characters directly (65,536 minus the 2,048 surrogates), plus 1,048,576 supplementary characters via surrogate pairs (1,024 high surrogates x 1,024 low surrogates).
Why "🌍".length === 2 in JavaScript
JavaScript strings are sequences of UTF-16 code units. The .length property counts these code units, not characters or code points:
| Expression | Value | Explanation |
|---|---|---|
| "A".length | 1 | BMP character = 1 code unit |
| "漢".length | 1 | BMP character = 1 code unit |
| "🌍".length | 2 | Supplementary = surrogate pair = 2 code units |
| "🌍"[0] | "\uD83C" | High surrogate (not a valid character) |
| "🌍"[1] | "\uDF0D" | Low surrogate (not a valid character) |
| "🌍".charCodeAt(0) | 0xD83C | High surrogate numeric value |
| "🌍".codePointAt(0) | 0x1F30D | Actual code point (ES6+) |
This is not a “bug” — it is the fundamental reality of how JavaScript stores strings. Every string operation that uses index-based access (bracket notation, charAt, charCodeAt, substring, slice) operates on UTF-16 code units, not code points.
ES6 introduced code-point-aware alternatives: codePointAt(), String.fromCodePoint(), and the string iterator ([...str] or for...of). These correctly handle surrogate pairs as single units.
// ES6 code point iteration
const str = "A🌍B";
// WRONG: index-based
for (let i = 0; i < str.length; i++) {
console.log(str[i]);
}
// Output: "A", "\uD83C", "\uDF0D", "B" (4 iterations, broken emoji)
// RIGHT: iterator-based
for (const ch of str) {
console.log(ch);
}
// Output: "A", "🌍", "B" (3 iterations, correct)Iterating Correctly Over Strings
Here are the safe and unsafe ways to iterate over JavaScript strings containing surrogate pairs:
| Method | Surrogate-safe? | Notes |
|---|---|---|
| for (i=0; i<str.length; i++) | No | Iterates UTF-16 code units |
| str.split('') | No | Splits at code unit boundaries |
| for (const ch of str) | Yes | Uses string iterator (ES6) |
| [...str] | Yes | Spread uses string iterator |
| Array.from(str) | Yes | Uses string iterator |
| str.match(/./gsu) | Yes | With 'u' flag, . matches code points |
| Intl.Segmenter | Yes | Also handles grapheme clusters |
The u flag on regular expressions is critical. Without it, /./g matches individual code units, splitting surrogate pairs. With /./gu, the dot matches full code points. Similarly, \u{1F30D} syntax (requiring the u flag) lets you match supplementary characters directly in regex patterns.
// String reversal: a classic surrogate pair trap
const str = "A🌍B";
// WRONG: breaks surrogate pair
str.split('').reverse().join('');
// → "B\uDF0D\uD83C A" (corrupted, shows replacement chars)
// RIGHT: spread preserves pairs
[...str].reverse().join('');
// → "B🌍A" (correct)
// String length: count actual characters
[...str].length; // 3 (correct)
str.length; // 4 (counts code units)Practical Bugs from Surrogate Pairs
Surrogate pair bugs are among the most common Unicode issues in production software. Here are real-world scenarios:
- Database truncation: A
VARCHAR(100)column that counts UTF-16 code units will truncate"99 chars + 🌍"at the high surrogate, producing a lone surrogate that poisons downstream processing. - JSON encoding: Lone surrogates (
\uD83Cwithout a following low surrogate) are technically invalid in JSON per RFC 8259. Some parsers reject them; others silently produce U+FFFD. - Substring extraction:
str.substring(0, 3)on"A🌍B"returns"A"+ the high surrogate of 🌍 — a corrupted string that may render asA�. - Twitter/SMS character limits: Twitter counts code points (not code units) for its character limit. A single emoji counts as 1 character toward the limit despite being
.length === 2in JavaScript. - Cursor movement in text editors: Pressing the right arrow key should skip over both code units of a surrogate pair. Many custom text input implementations get this wrong, placing the cursor between the high and low surrogate.
The safest approach: always use code-point-aware APIs (for...of, codePointAt, String.fromCodePoint) when processing text, and use Intl.Segmenter when counting user-visible characters.
// Safe substring that never splits surrogate pairs
function safeSubstring(str, start, end) {
const chars = [...str];
return chars.slice(start, end).join('');
}
safeSubstring("A🌍B", 0, 2); // "A🌍" (correct)
"A🌍B".substring(0, 2); // "A\uD83C" (broken!)
// Detect if a string contains lone surrogates
function hasLoneSurrogates(str) {
return /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/.test(str);
}Related articles
Characters Are a Lie: Understanding Grapheme Clusters
Why string.length gives wrong answers, what grapheme clusters really are, and how Intl.Segmenter fixes everything.
UTF-8 Byte by Byte: How Characters Become Bytes
A visual, byte-level walkthrough of UTF-8 encoding showing exactly how code points map to 1-4 bytes.
Unicode Normalization: NFC, NFD, NFKC, NFKD Demystified
Why the same-looking text can have different bytes, when each normalization form matters, and how to see the differences visually.