🔄

Unicode Normalization: NFC, NFD, NFKC, NFKD Demystified

Why the same-looking text can have different bytes, when each normalization form matters, and how to see the differences visually.

The Problem: "café" Encoded Two Ways

Type café on two different computers and you might get two completely different byte sequences — even though the text looks identical on screen. The letter é can be represented as a single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or as two code points: U+0065 (e) followed by U+0301 (COMBINING ACUTE ACCENT).

This means 'caf\u00E9' === 'cafe\u0301' evaluates to false in JavaScript, even though both strings render identically. File systems, databases, and search engines all face this problem: the same human-readable text can have multiple valid binary representations. A user searching for “café” might miss results stored in the other form.

Unicode normalization exists to solve exactly this problem. It defines deterministic algorithms for converting between equivalent representations, ensuring that text which looks the same actually compares as equal.

Compare NFC vs NFD for café

Canonical Equivalence: NFC and NFD

The two most common normalization forms handle canonical equivalence — sequences that represent the exact same abstract character:

Form	Name	Strategy	café
NFC	Canonical Decomposition + Composition	Compose into fewest code points	c a f é (4 CPs)
NFD	Canonical Decomposition	Decompose into base + combining marks	c a f e ◌́ (5 CPs)

NFD breaks every precomposed character into its base character plus combining marks. The é (U+00E9) becomes e (U+0065) + combining acute accent (U+0301). Characters that are already in their most decomposed form remain unchanged. NFD also applies canonical ordering to ensure combining marks appear in a deterministic order.

NFC first performs the full NFD decomposition, then recomposes the result using canonical composition rules. The net effect is that each character is represented by the fewest possible code points. NFC is by far the most common normalization form on the web: the W3C recommends NFC for all web content, and most modern operating systems use NFC by default. The notable exception is macOS, whose HFS+ file system historically stored filenames in a variant of NFD.

// NFC and NFD produce the same visual result
const nfc = "caf\u00E9";           // "café" — 4 code points
const nfd = "cafe\u0301";          // "café" — 5 code points

nfc === nfd;                        // false!
nfc.normalize("NFC") === nfd.normalize("NFC");  // true
nfc.normalize("NFD") === nfd.normalize("NFD");  // true

Compatibility Equivalence: NFKC and NFKD

Beyond canonical equivalence, Unicode defines compatibility equivalence: characters that are semantically similar but not identical. These are characters that were given separate code points for historical or formatting reasons but are considered “the same” for many practical purposes.

Original	NFKC/NFKD result	Relationship
ﬁ (U+FB01)	fi	Ligature → component letters
ﬄ (U+FB04)	ffl	Ligature → component letters
㍻ (U+337B)	平成	Square composition → characters
㌔ (U+3314)	キロ	Square composition → characters
㍑ (U+3351)	リットル	Square composition → characters
① (U+2460)	1	Enclosed numeral → digit
Ｈ (U+FF28)	H	Fullwidth → ASCII
ℌ (U+210C)	H	Letterlike symbol → letter

NFKD (Compatibility Decomposition) performs canonical decomposition plus compatibility decomposition, breaking apart ligatures, square compositions, and other compatibility characters. NFKC (Compatibility Decomposition + Composition) does the same decomposition and then recomposes using canonical composition — like NFC applied after NFKD.

Compatibility normalization is lossy: the round-trip ㍻ → 平成 → ㍻ is impossible because NFKC/NFKD destroys the information that the original was a single square composition character. Use compatibility normalization for search and matching, but preserve the original form for display.

Inspect Japanese square compositions

The 4-Form Comparison Matrix

The four normalization forms can be organized along two axes: canonical vs. compatibility decomposition, and whether recomposition is applied:

	Canonical only	Canonical + Compatibility
Composed	NFC	NFKC
Decomposed	NFD	NFKD

Here is how a single example character transforms under each form:

Input	NFC	NFD	NFKC	NFKD
é (U+00E9)	é (U+00E9)	e◌́ (U+0065 U+0301)	é (U+00E9)	e◌́ (U+0065 U+0301)
ﬁ (U+FB01)	ﬁ (U+FB01)	ﬁ (U+FB01)	fi (U+0066 U+0069)	fi (U+0066 U+0069)
㍻ (U+337B)	㍻ (U+337B)	㍻ (U+337B)	平成	平成
Å (U+00C5)	Å (U+00C5)	A◌̊ (U+0041 U+030A)	Å (U+00C5)	A◌̊ (U+0041 U+030A)
Å (U+212B)	Å (U+00C5)	A◌̊ (U+0041 U+030A)	Å (U+00C5)	A◌̊ (U+0041 U+030A)

Notice the last two rows: U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE) and U+212B (ANGSTROM SIGN) are canonically equivalent. NFC maps both to U+00C5, NFD maps both to U+0041 U+030A, and the compatibility forms follow the same pattern. This is a case where two distinct code points represent the same abstract character.

Inspect ligatures under normalization

Practical Impact: Comparison, Databases, and Security

Normalization issues surface in surprisingly many real-world systems:

String comparison: Without normalization, === fails on visually identical strings. Always normalize before comparing user input.
Database uniqueness: A UNIQUE constraint on a username field will allow both café (NFC) and café (NFD) as separate entries. PostgreSQL and SQLite do not normalize by default.
Search and indexing: Elasticsearch and other search engines typically apply NFKC normalization in their analyzers to ensure ﬁnd matches find.
Security: Homoglyph attacks can exploit normalization differences. An attacker might register a domain using compatibility characters that normalize to a known brand name. NFKC is used in PRECIS (RFC 8264) for username and password preparation.
File systems: macOS HFS+ stores filenames in a modified NFD form, while most Linux systems store bytes as-is. Copying files between systems can create duplicates that look identical but differ at the byte level.

// Always normalize before comparing
function safeEquals(a: string, b: string): boolean {
  return a.normalize("NFC") === b.normalize("NFC");
}

// For search: use NFKC to catch compatibility variants
function normalizeForSearch(s: string): string {
  return s.normalize("NFKC").toLowerCase();
}

normalizeForSearch("ﬁnd");  // "find"
normalizeForSearch("㍻");   // "平成"

Compare three ways to write Å

When to Use Which Form

Choosing the right normalization form depends on your use case:

Use case	Recommended form	Why
Web content / HTML	NFC	W3C recommendation; smallest representation
Database storage	NFC	Consistent canonical form; preserves formatting characters
String comparison	NFC or NFD	Either works as long as both sides agree
Search / indexing	NFKC	Catches ligatures, fullwidth, and other compatibility variants
Username / password	NFKC (PRECIS)	RFC 8264 requires NFKC for identifier normalization
Display / rendering	Preserve original	Do not normalize display text; you lose formatting intent

The golden rule: normalize early, normalize consistently. Pick one form for your system and apply it at the boundary where text enters. NFC is the safe default for most applications. Switch to NFKC only when you specifically need to collapse compatibility variants (search, security checks).

Be cautious with NFKC/NFKD in contexts where formatting matters. Converting ㍻ (a single-character representation of the Japanese era name “Heisei”) to 平成 is helpful for search, but the original compact form carries distinct semantic meaning in Japanese typography. Similarly, converting ﬁ to fi is correct for search but may affect typographic rendering in some fonts.

🔤

Characters Are a Lie: Understanding Grapheme Clusters

Why string.length gives wrong answers, what grapheme clusters really are, and how Intl.Segmenter fixes everything.

📦

UTF-8 Byte by Byte: How Characters Become Bytes

A visual, byte-level walkthrough of UTF-8 encoding showing exactly how code points map to 1-4 bytes.

🧩

Surrogate Pairs: Why JavaScript Strings Break on Emoji

How UTF-16 surrogate pairs work, why they affect JavaScript/Java/C#, and how to handle them correctly.

Unicode Viewer