πŸ”€

Characters Are a Lie: Understanding Grapheme Clusters

Why string.length gives wrong answers, what grapheme clusters really are, and how Intl.Segmenter fixes everything.

The "One Character" Illusion

How many characters are in this string: πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦?

Most people answer β€œone” β€” it looks like a single family emoji. But ask JavaScript, and you get three different answers:

MethodResultWhat it counts
"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length11UTF-16 code units
[..."πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"].length7Code points
Intl.Segmenter1Grapheme clusters (visual characters)

The family emoji is composed of 7 code points: four person/child emoji joined by three Zero Width Joiners (ZWJ). In UTF-16, each emoji above U+FFFF takes two code units (a surrogate pair), giving .length a count of 11.

Three Different Counts

Understanding the three counts is fundamental to working with Unicode correctly:

UnitWhat it isExample: πŸ‘πŸ½
UTF-16 code unitsWhat .length counts. Includes surrogate pairs.4 units
Code pointsBasic Unicode unit (U+XXXX). What [...str] gives.2 points
Grapheme clustersWhat humans see as "one character".1 cluster

πŸ‘πŸ½ is two code points: πŸ‘ (U+1F44D) + skin tone modifier 🏽 (U+1F3FD). Each is above U+FFFF, so each takes two UTF-16 code units. But it renders as one grapheme cluster.

Flag Emoji: Regional Indicator Math

Flag emoji are pairs of Regional Indicator symbols forming one grapheme cluster:

FlagCode pointsMeaning
πŸ‡―πŸ‡΅U+1F1EF + U+1F1F5Regional J + P
πŸ‡ΊπŸ‡ΈU+1F1FA + U+1F1F8Regional U + S
πŸ‡¬πŸ‡§U+1F1EC + U+1F1E7Regional G + B

Three flags, but "πŸ‡―πŸ‡΅πŸ‡ΊπŸ‡ΈπŸ‡¬πŸ‡§".length returns 12. Only Intl.Segmenter correctly identifies 3 grapheme clusters.

When Normalization Changes the Count

The string cafΓ© can be encoded two ways in Unicode:

FormCode pointsRepresentation
NFC (composed)c a f Γ© (4 CPs)Γ© = U+00E9
NFD (decomposed)c a f e β—ŒΜ (5 CPs)e + combining acute = U+0065 U+0301

Both look identical β€” 4 grapheme clusters. But code point count differs (4 vs 5), and so does .length.

Practical Consequences

Getting character counting wrong causes real bugs:

  • String truncation: Cutting at .length / 2 can split a surrogate pair, producing U+FFFD.
  • Cursor movement: Should skip the entire grapheme cluster, not individual code points.
  • Input validation: β€œMax 10 characters” should count grapheme clusters, not .length.
  • String reversal: [...str].reverse().join("") breaks ZWJ sequences and flag emoji.

The Solution: Intl.Segmenter

Intl.Segmenter (available in all modern browsers since 2024) correctly segments text by grapheme cluster boundaries:

const segmenter = new Intl.Segmenter(undefined, {
  granularity: "grapheme"
});

const text = "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸ‡―πŸ‡΅cafΓ©";
const segments = [...segmenter.segment(text)];
console.log(segments.length); // 6 (correct!)

// Compare:
console.log(text.length);      // 18 (UTF-16 units)
console.log([...text].length); // 12 (code points)

This tool uses Intl.Segmenter internally. Each cell in the grid represents one grapheme cluster β€” click any cell to see its internal structure.

Related articles