Characters Are a Lie: Understanding Grapheme Clusters
Why string.length gives wrong answers, what grapheme clusters really are, and how Intl.Segmenter fixes everything.
The "One Character" Illusion
How many characters are in this string: π¨βπ©βπ§βπ¦?
Most people answer βoneβ β it looks like a single family emoji. But ask JavaScript, and you get three different answers:
| Method | Result | What it counts |
|---|---|---|
| "π¨βπ©βπ§βπ¦".length | 11 | UTF-16 code units |
| [..."π¨βπ©βπ§βπ¦"].length | 7 | Code points |
| Intl.Segmenter | 1 | Grapheme clusters (visual characters) |
The family emoji is composed of 7 code points: four person/child emoji joined by three Zero Width Joiners (ZWJ). In UTF-16, each emoji above U+FFFF takes two code units (a surrogate pair), giving .length a count of 11.
Three Different Counts
Understanding the three counts is fundamental to working with Unicode correctly:
| Unit | What it is | Example: ππ½ |
|---|---|---|
| UTF-16 code units | What .length counts. Includes surrogate pairs. | 4 units |
| Code points | Basic Unicode unit (U+XXXX). What [...str] gives. | 2 points |
| Grapheme clusters | What humans see as "one character". | 1 cluster |
ππ½ is two code points: π (U+1F44D) + skin tone modifier π½ (U+1F3FD). Each is above U+FFFF, so each takes two UTF-16 code units. But it renders as one grapheme cluster.
Flag Emoji: Regional Indicator Math
Flag emoji are pairs of Regional Indicator symbols forming one grapheme cluster:
| Flag | Code points | Meaning |
|---|---|---|
| π―π΅ | U+1F1EF + U+1F1F5 | Regional J + P |
| πΊπΈ | U+1F1FA + U+1F1F8 | Regional U + S |
| π¬π§ | U+1F1EC + U+1F1E7 | Regional G + B |
Three flags, but "π―π΅πΊπΈπ¬π§".length returns 12. Only Intl.Segmenter correctly identifies 3 grapheme clusters.
When Normalization Changes the Count
The string cafΓ© can be encoded two ways in Unicode:
| Form | Code points | Representation |
|---|---|---|
| NFC (composed) | c a f Γ© (4 CPs) | Γ© = U+00E9 |
| NFD (decomposed) | c a f e βΜ (5 CPs) | e + combining acute = U+0065 U+0301 |
Both look identical β 4 grapheme clusters. But code point count differs (4 vs 5), and so does .length.
Practical Consequences
Getting character counting wrong causes real bugs:
- String truncation: Cutting at
.length / 2can split a surrogate pair, producing U+FFFD. - Cursor movement: Should skip the entire grapheme cluster, not individual code points.
- Input validation: βMax 10 charactersβ should count grapheme clusters, not
.length. - String reversal:
[...str].reverse().join("")breaks ZWJ sequences and flag emoji.
The Solution: Intl.Segmenter
Intl.Segmenter (available in all modern browsers since 2024) correctly segments text by grapheme cluster boundaries:
const segmenter = new Intl.Segmenter(undefined, {
granularity: "grapheme"
});
const text = "π¨βπ©βπ§βπ¦π―π΅cafΓ©";
const segments = [...segmenter.segment(text)];
console.log(segments.length); // 6 (correct!)
// Compare:
console.log(text.length); // 18 (UTF-16 units)
console.log([...text].length); // 12 (code points)This tool uses Intl.Segmenter internally. Each cell in the grid represents one grapheme cluster β click any cell to see its internal structure.
Related articles
UTF-8 Byte by Byte: How Characters Become Bytes
A visual, byte-level walkthrough of UTF-8 encoding showing exactly how code points map to 1-4 bytes.
Unicode Normalization: NFC, NFD, NFKC, NFKD Demystified
Why the same-looking text can have different bytes, when each normalization form matters, and how to see the differences visually.
Surrogate Pairs: Why JavaScript Strings Break on Emoji
How UTF-16 surrogate pairs work, why they affect JavaScript/Java/C#, and how to handle them correctly.