🧬

Emoji Under the Hood: ZWJ Sequences, Skin Tones, and Flag Math

How complex emoji are built from multiple code points using ZWJ, variation selectors, and regional indicators.

Simple Emoji: One Code Point, One Glyph

The simplest emoji are single Unicode code points. Characters like πŸ˜€ (U+1F600), ❀ (U+2764), and β˜€ (U+2600) each occupy exactly one code point. However, β€œsimple” is relative β€” even a single emoji can occupy more than one unit in memory.

Emoji below U+FFFF (like β˜€ at U+2600) fit in a single UTF-16 code unit, so "β˜€".length returns 1. But most modern emoji live above U+FFFF in the Supplementary Multilingual Plane. πŸ˜€ at U+1F600 requires a surrogate pair in UTF-16, so "πŸ˜€".length returns 2 even though it is a single code point.

This distinction matters for any code that indexes into strings. Even the most basic emoji can trip up naive string handling if it falls in the supplementary planes. The key insight: one visual character does not mean one unit of storage.

Text vs Emoji Presentation: VS15 and VS16

Some code points have dual lives. The character ☺ (U+263A) existed in Unicode long before emoji β€” it was a plain text symbol. When emoji arrived, the same code point gained a colorful emoji rendering. Unicode solves the ambiguity with two invisible variation selectors:

SelectorCode pointEffectExample
VS15 (text)U+FE0EForce monochrome/text presentation☺︎
VS16 (emoji)U+FE0FForce colorful emoji presentation☺️

The string ☺︎☺️ contains two visually distinct characters, but their base code point is identical β€” only the trailing variation selector differs. Without a variation selector, the default presentation depends on the platform and context. On most phones, ☺ defaults to emoji style; in many terminal emulators, it defaults to text style.

VS16 (U+FE0F) is especially common in ZWJ sequences. The rainbow flag πŸ³οΈβ€πŸŒˆ contains a VS16 after the white flag to ensure it renders in emoji style before the ZWJ joins it with the rainbow. Stripping VS16 can break the entire sequence.

Skin Tone Modifiers: Fitzpatrick Scale in Unicode

Unicode 8.0 introduced five skin tone modifiers based on the Fitzpatrick dermatological scale. These are code points U+1F3FB through U+1F3FF, placed immediately after a compatible base emoji to change its skin color:

ModifierCode pointFitzpatrick typeExample
🏻U+1F3FBType 1-2 (light)πŸ‘πŸ»
🏼U+1F3FCType 3 (medium-light)πŸ‘πŸΌ
🏽U+1F3FDType 4 (medium)πŸ‘πŸ½
🏾U+1F3FEType 5 (medium-dark)πŸ‘πŸΎ
🏿U+1F3FFType 6 (dark)πŸ‘πŸΏ

A skin-toned emoji is two code points forming one grapheme cluster. πŸ‘πŸ½ = U+1F44D (thumbs up) + U+1F3FD (medium skin tone). In UTF-16 both code points are in the supplementary plane, each requiring a surrogate pair, so "πŸ‘πŸ½".length returns 4, [..."πŸ‘πŸ½"].length returns 2, but Intl.Segmenter correctly reports 1.

Not every emoji supports skin tones. Applying a modifier to an incompatible base (like a car or a pizza) simply renders the modifier as a separate colored square. The full list of compatible bases is defined in Unicode’s emoji-data.txt under the Emoji_Modifier_Base property.

ZWJ Sequences: Gluing Emoji Together

The Zero Width Joiner (U+200D) is an invisible character that β€œglues” emoji together into a single grapheme cluster. The family emoji πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ is constructed from four individual emoji connected by three ZWJ characters:

πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ = πŸ‘¨ + ZWJ + πŸ‘© + ZWJ + πŸ‘§ + ZWJ + πŸ‘¦
     = U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466
     = 7 code points β†’ 11 UTF-16 code units β†’ 1 grapheme cluster

ZWJ sequences power a huge variety of modern emoji. Profession emoji combine a person with a tool: πŸ‘©β€πŸš€ (woman + ZWJ + rocket), πŸ‘¨β€πŸ’» (man + ZWJ + laptop), πŸ‘©β€πŸ”¬ (woman + ZWJ + microscope). Couple emoji combine two people with a heart: πŸ‘©β€β€οΈβ€πŸ‘¨. The rainbow flag combines a white flag with a rainbow: πŸ³οΈβ€πŸŒˆ = 🏳 + VS16 + ZWJ + 🌈.

What happens when a platform does not recognize a particular ZWJ sequence? The fallback is graceful β€” the individual component emoji are shown side by side. This means new ZWJ combinations can be proposed and used before they are officially standardized; older systems simply display the components.

Flag Emoji: Regional Indicator Pairs

Country flag emoji are not standalone characters. They are pairs of Regional Indicator Symbols (U+1F1E6 through U+1F1FF), a set of 26 characters that map to the letters A through Z. Two regional indicators together form a flag based on the ISO 3166-1 alpha-2 country code:

FlagIndicatorsCountry codeUTF-16 length
πŸ‡―πŸ‡΅πŸ‡― (U+1F1EF) + πŸ‡΅ (U+1F1F5)JP (Japan)4
πŸ‡ΊπŸ‡ΈπŸ‡Ί (U+1F1FA) + πŸ‡Έ (U+1F1F8)US (USA)4
πŸ‡¬πŸ‡§πŸ‡¬ (U+1F1EC) + πŸ‡§ (U+1F1E7)GB (UK)4

Each regional indicator is in the supplementary plane (above U+FFFF), so each needs a surrogate pair in UTF-16. One flag = 2 code points = 4 UTF-16 code units. Three flags side by side: "πŸ‡―πŸ‡΅πŸ‡ΊπŸ‡ΈπŸ‡¬πŸ‡§".length returns 12, but there are only 3 grapheme clusters.

This pairing system means that concatenating flags carelessly can create unexpected results. If you split πŸ‡―πŸ‡΅πŸ‡ΊπŸ‡Έ between the πŸ‡΅ and πŸ‡Ί, those two indicators may join to form πŸ‡΅πŸ‡Ί (the flag of an unintended country, or an unrecognized pair). This is why grapheme-cluster-aware splitting is essential when handling text containing flags.

Why emoji.length Is Always Surprising

Bringing it all together, here is what JavaScript’s .length reports for various emoji constructions:

EmojiVisual.lengthCode pointsGrapheme clusters
Simple (BMP)β˜€111
Simple (SMP)πŸ˜€211
With VS16☺️221
Skin toneπŸ‘πŸ½421
FlagπŸ‡―πŸ‡΅421
ZWJ familyπŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦1171
ZWJ flagπŸ³οΈβ€πŸŒˆ641

The pattern is clear: everything that looks like one character has a .length that ranges from 1 to 11+. The only reliable way to count β€œcharacters” as users perceive them is to count grapheme clusters using Intl.Segmenter.

// The only reliable emoji-aware character count
const count = (s: string) =>
  [...new Intl.Segmenter().segment(s)].length;

count("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦");  // 1
count("πŸ‡―πŸ‡΅πŸ‡ΊπŸ‡ΈπŸ‡¬πŸ‡§"); // 3
count("☺︎☺️");    // 2
count("πŸ‘πŸ½");    // 1

String reversal is another common pitfall. [..."πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦"].reverse().join("") produces a garbled sequence because it reverses the individual code points (including ZWJ characters), destroying the intended grouping. Flag emoji suffer even worse: reversing πŸ‡―πŸ‡΅ at the code point level yields πŸ‡΅πŸ‡―, which is a completely different flag. Always operate on grapheme clusters, not raw code points or code units.

Related articles