Emoji Under the Hood: ZWJ Sequences, Skin Tones, and Flag Math
How complex emoji are built from multiple code points using ZWJ, variation selectors, and regional indicators.
Simple Emoji: One Code Point, One Glyph
The simplest emoji are single Unicode code points. Characters like π (U+1F600), β€ (U+2764), and β (U+2600) each occupy exactly one code point. However, βsimpleβ is relative β even a single emoji can occupy more than one unit in memory.
Emoji below U+FFFF (like β at U+2600) fit in a single UTF-16 code unit, so "β".length returns 1. But most modern emoji live above U+FFFF in the Supplementary Multilingual Plane. π at U+1F600 requires a surrogate pair in UTF-16, so "π".length returns 2 even though it is a single code point.
This distinction matters for any code that indexes into strings. Even the most basic emoji can trip up naive string handling if it falls in the supplementary planes. The key insight: one visual character does not mean one unit of storage.
Text vs Emoji Presentation: VS15 and VS16
Some code points have dual lives. The character βΊ (U+263A) existed in Unicode long before emoji β it was a plain text symbol. When emoji arrived, the same code point gained a colorful emoji rendering. Unicode solves the ambiguity with two invisible variation selectors:
| Selector | Code point | Effect | Example |
|---|---|---|---|
| VS15 (text) | U+FE0E | Force monochrome/text presentation | βΊοΈ |
| VS16 (emoji) | U+FE0F | Force colorful emoji presentation | βΊοΈ |
The string βΊοΈβΊοΈ contains two visually distinct characters, but their base code point is identical β only the trailing variation selector differs. Without a variation selector, the default presentation depends on the platform and context. On most phones, βΊ defaults to emoji style; in many terminal emulators, it defaults to text style.
VS16 (U+FE0F) is especially common in ZWJ sequences. The rainbow flag π³οΈβπ contains a VS16 after the white flag to ensure it renders in emoji style before the ZWJ joins it with the rainbow. Stripping VS16 can break the entire sequence.
Skin Tone Modifiers: Fitzpatrick Scale in Unicode
Unicode 8.0 introduced five skin tone modifiers based on the Fitzpatrick dermatological scale. These are code points U+1F3FB through U+1F3FF, placed immediately after a compatible base emoji to change its skin color:
| Modifier | Code point | Fitzpatrick type | Example |
|---|---|---|---|
| π» | U+1F3FB | Type 1-2 (light) | ππ» |
| πΌ | U+1F3FC | Type 3 (medium-light) | ππΌ |
| π½ | U+1F3FD | Type 4 (medium) | ππ½ |
| πΎ | U+1F3FE | Type 5 (medium-dark) | ππΎ |
| πΏ | U+1F3FF | Type 6 (dark) | ππΏ |
A skin-toned emoji is two code points forming one grapheme cluster. ππ½ = U+1F44D (thumbs up) + U+1F3FD (medium skin tone). In UTF-16 both code points are in the supplementary plane, each requiring a surrogate pair, so "ππ½".length returns 4, [..."ππ½"].length returns 2, but Intl.Segmenter correctly reports 1.
Not every emoji supports skin tones. Applying a modifier to an incompatible base (like a car or a pizza) simply renders the modifier as a separate colored square. The full list of compatible bases is defined in Unicodeβs emoji-data.txt under the Emoji_Modifier_Base property.
ZWJ Sequences: Gluing Emoji Together
The Zero Width Joiner (U+200D) is an invisible character that βgluesβ emoji together into a single grapheme cluster. The family emoji π¨βπ©βπ§βπ¦ is constructed from four individual emoji connected by three ZWJ characters:
π¨βπ©βπ§βπ¦ = π¨ + ZWJ + π© + ZWJ + π§ + ZWJ + π¦
= U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466
= 7 code points β 11 UTF-16 code units β 1 grapheme clusterZWJ sequences power a huge variety of modern emoji. Profession emoji combine a person with a tool: π©βπ (woman + ZWJ + rocket), π¨βπ» (man + ZWJ + laptop), π©βπ¬ (woman + ZWJ + microscope). Couple emoji combine two people with a heart: π©ββ€οΈβπ¨. The rainbow flag combines a white flag with a rainbow: π³οΈβπ = π³ + VS16 + ZWJ + π.
What happens when a platform does not recognize a particular ZWJ sequence? The fallback is graceful β the individual component emoji are shown side by side. This means new ZWJ combinations can be proposed and used before they are officially standardized; older systems simply display the components.
Flag Emoji: Regional Indicator Pairs
Country flag emoji are not standalone characters. They are pairs of Regional Indicator Symbols (U+1F1E6 through U+1F1FF), a set of 26 characters that map to the letters A through Z. Two regional indicators together form a flag based on the ISO 3166-1 alpha-2 country code:
| Flag | Indicators | Country code | UTF-16 length |
|---|---|---|---|
| π―π΅ | π― (U+1F1EF) + π΅ (U+1F1F5) | JP (Japan) | 4 |
| πΊπΈ | πΊ (U+1F1FA) + πΈ (U+1F1F8) | US (USA) | 4 |
| π¬π§ | π¬ (U+1F1EC) + π§ (U+1F1E7) | GB (UK) | 4 |
Each regional indicator is in the supplementary plane (above U+FFFF), so each needs a surrogate pair in UTF-16. One flag = 2 code points = 4 UTF-16 code units. Three flags side by side: "π―π΅πΊπΈπ¬π§".length returns 12, but there are only 3 grapheme clusters.
This pairing system means that concatenating flags carelessly can create unexpected results. If you split π―π΅πΊπΈ between the π΅ and πΊ, those two indicators may join to form π΅πΊ (the flag of an unintended country, or an unrecognized pair). This is why grapheme-cluster-aware splitting is essential when handling text containing flags.
Why emoji.length Is Always Surprising
Bringing it all together, here is what JavaScriptβs .length reports for various emoji constructions:
| Emoji | Visual | .length | Code points | Grapheme clusters |
|---|---|---|---|---|
| Simple (BMP) | β | 1 | 1 | 1 |
| Simple (SMP) | π | 2 | 1 | 1 |
| With VS16 | βΊοΈ | 2 | 2 | 1 |
| Skin tone | ππ½ | 4 | 2 | 1 |
| Flag | π―π΅ | 4 | 2 | 1 |
| ZWJ family | π¨βπ©βπ§βπ¦ | 11 | 7 | 1 |
| ZWJ flag | π³οΈβπ | 6 | 4 | 1 |
The pattern is clear: everything that looks like one character has a .length that ranges from 1 to 11+. The only reliable way to count βcharactersβ as users perceive them is to count grapheme clusters using Intl.Segmenter.
// The only reliable emoji-aware character count
const count = (s: string) =>
[...new Intl.Segmenter().segment(s)].length;
count("π¨βπ©βπ§βπ¦"); // 1
count("π―π΅πΊπΈπ¬π§"); // 3
count("βΊοΈβΊοΈ"); // 2
count("ππ½"); // 1String reversal is another common pitfall. [..."π¨βπ©βπ§βπ¦"].reverse().join("") produces a garbled sequence because it reverses the individual code points (including ZWJ characters), destroying the intended grouping. Flag emoji suffer even worse: reversing π―π΅ at the code point level yields π΅π―, which is a completely different flag. Always operate on grapheme clusters, not raw code points or code units.
Related articles
Unicode Homoglyph Attacks: When Characters Lie About Who They Are
How visually identical characters from different scripts enable phishing and spoofing, and how to detect them.
Invisible Characters: Zero-Width Spaces, Bidi Overrides, and Hidden Text
A catalog of invisible Unicode characters that can break or hide in text, with the tool to reveal them.
WHATWG vs Unicode.org: Why Browsers and Standards Disagree on Encoding
A cross-encoding survey of mapping discrepancies between web standards and official Unicode/national standards.