Unicode Normalization: NFC, NFD, NFKC, NFKD Demystified
Why the same-looking text can have different bytes, when each normalization form matters, and how to see the differences visually.
The Problem: "café" Encoded Two Ways
Type café on two different computers and you might get two completely different byte sequences — even though the text looks identical on screen. The letter é can be represented as a single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or as two code points: U+0065 (e) followed by U+0301 (COMBINING ACUTE ACCENT).
This means 'caf\u00E9' === 'cafe\u0301' evaluates to false in JavaScript, even though both strings render identically. File systems, databases, and search engines all face this problem: the same human-readable text can have multiple valid binary representations. A user searching for “café” might miss results stored in the other form.
Unicode normalization exists to solve exactly this problem. It defines deterministic algorithms for converting between equivalent representations, ensuring that text which looks the same actually compares as equal.
Canonical Equivalence: NFC and NFD
The two most common normalization forms handle canonical equivalence — sequences that represent the exact same abstract character:
| Form | Name | Strategy | café |
|---|---|---|---|
| NFC | Canonical Decomposition + Composition | Compose into fewest code points | c a f é (4 CPs) |
| NFD | Canonical Decomposition | Decompose into base + combining marks | c a f e ◌́ (5 CPs) |
NFD breaks every precomposed character into its base character plus combining marks. The é (U+00E9) becomes e (U+0065) + combining acute accent (U+0301). Characters that are already in their most decomposed form remain unchanged. NFD also applies canonical ordering to ensure combining marks appear in a deterministic order.
NFC first performs the full NFD decomposition, then recomposes the result using canonical composition rules. The net effect is that each character is represented by the fewest possible code points. NFC is by far the most common normalization form on the web: the W3C recommends NFC for all web content, and most modern operating systems use NFC by default. The notable exception is macOS, whose HFS+ file system historically stored filenames in a variant of NFD.
// NFC and NFD produce the same visual result
const nfc = "caf\u00E9"; // "café" — 4 code points
const nfd = "cafe\u0301"; // "café" — 5 code points
nfc === nfd; // false!
nfc.normalize("NFC") === nfd.normalize("NFC"); // true
nfc.normalize("NFD") === nfd.normalize("NFD"); // trueCompatibility Equivalence: NFKC and NFKD
Beyond canonical equivalence, Unicode defines compatibility equivalence: characters that are semantically similar but not identical. These are characters that were given separate code points for historical or formatting reasons but are considered “the same” for many practical purposes.
| Original | NFKC/NFKD result | Relationship |
|---|---|---|
| fi (U+FB01) | fi | Ligature → component letters |
| ffl (U+FB04) | ffl | Ligature → component letters |
| ㍻ (U+337B) | 平成 | Square composition → characters |
| ㌔ (U+3314) | キロ | Square composition → characters |
| ㍑ (U+3351) | リットル | Square composition → characters |
| ① (U+2460) | 1 | Enclosed numeral → digit |
| H (U+FF28) | H | Fullwidth → ASCII |
| ℌ (U+210C) | H | Letterlike symbol → letter |
NFKD (Compatibility Decomposition) performs canonical decomposition plus compatibility decomposition, breaking apart ligatures, square compositions, and other compatibility characters. NFKC (Compatibility Decomposition + Composition) does the same decomposition and then recomposes using canonical composition — like NFC applied after NFKD.
Compatibility normalization is lossy: the round-trip ㍻ → 平成 → ㍻ is impossible because NFKC/NFKD destroys the information that the original was a single square composition character. Use compatibility normalization for search and matching, but preserve the original form for display.
The 4-Form Comparison Matrix
The four normalization forms can be organized along two axes: canonical vs. compatibility decomposition, and whether recomposition is applied:
| Canonical only | Canonical + Compatibility | |
|---|---|---|
| Composed | NFC | NFKC |
| Decomposed | NFD | NFKD |
Here is how a single example character transforms under each form:
| Input | NFC | NFD | NFKC | NFKD |
|---|---|---|---|---|
| é (U+00E9) | é (U+00E9) | e◌́ (U+0065 U+0301) | é (U+00E9) | e◌́ (U+0065 U+0301) |
| fi (U+FB01) | fi (U+FB01) | fi (U+FB01) | fi (U+0066 U+0069) | fi (U+0066 U+0069) |
| ㍻ (U+337B) | ㍻ (U+337B) | ㍻ (U+337B) | 平成 | 平成 |
| Å (U+00C5) | Å (U+00C5) | A◌̊ (U+0041 U+030A) | Å (U+00C5) | A◌̊ (U+0041 U+030A) |
| Å (U+212B) | Å (U+00C5) | A◌̊ (U+0041 U+030A) | Å (U+00C5) | A◌̊ (U+0041 U+030A) |
Notice the last two rows: U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE) and U+212B (ANGSTROM SIGN) are canonically equivalent. NFC maps both to U+00C5, NFD maps both to U+0041 U+030A, and the compatibility forms follow the same pattern. This is a case where two distinct code points represent the same abstract character.
Practical Impact: Comparison, Databases, and Security
Normalization issues surface in surprisingly many real-world systems:
- String comparison: Without normalization,
===fails on visually identical strings. Always normalize before comparing user input. - Database uniqueness: A UNIQUE constraint on a username field will allow both
café(NFC) andcafé(NFD) as separate entries. PostgreSQL and SQLite do not normalize by default. - Search and indexing: Elasticsearch and other search engines typically apply NFKC normalization in their analyzers to ensure
findmatchesfind. - Security: Homoglyph attacks can exploit normalization differences. An attacker might register a domain using compatibility characters that normalize to a known brand name. NFKC is used in PRECIS (RFC 8264) for username and password preparation.
- File systems: macOS HFS+ stores filenames in a modified NFD form, while most Linux systems store bytes as-is. Copying files between systems can create duplicates that look identical but differ at the byte level.
// Always normalize before comparing
function safeEquals(a: string, b: string): boolean {
return a.normalize("NFC") === b.normalize("NFC");
}
// For search: use NFKC to catch compatibility variants
function normalizeForSearch(s: string): string {
return s.normalize("NFKC").toLowerCase();
}
normalizeForSearch("find"); // "find"
normalizeForSearch("㍻"); // "平成"When to Use Which Form
Choosing the right normalization form depends on your use case:
| Use case | Recommended form | Why |
|---|---|---|
| Web content / HTML | NFC | W3C recommendation; smallest representation |
| Database storage | NFC | Consistent canonical form; preserves formatting characters |
| String comparison | NFC or NFD | Either works as long as both sides agree |
| Search / indexing | NFKC | Catches ligatures, fullwidth, and other compatibility variants |
| Username / password | NFKC (PRECIS) | RFC 8264 requires NFKC for identifier normalization |
| Display / rendering | Preserve original | Do not normalize display text; you lose formatting intent |
The golden rule: normalize early, normalize consistently. Pick one form for your system and apply it at the boundary where text enters. NFC is the safe default for most applications. Switch to NFKC only when you specifically need to collapse compatibility variants (search, security checks).
Be cautious with NFKC/NFKD in contexts where formatting matters. Converting ㍻ (a single-character representation of the Japanese era name “Heisei”) to 平成 is helpful for search, but the original compact form carries distinct semantic meaning in Japanese typography. Similarly, converting fi to fi is correct for search but may affect typographic rendering in some fonts.
Related articles
Characters Are a Lie: Understanding Grapheme Clusters
Why string.length gives wrong answers, what grapheme clusters really are, and how Intl.Segmenter fixes everything.
UTF-8 Byte by Byte: How Characters Become Bytes
A visual, byte-level walkthrough of UTF-8 encoding showing exactly how code points map to 1-4 bytes.
Surrogate Pairs: Why JavaScript Strings Break on Emoji
How UTF-16 surrogate pairs work, why they affect JavaScript/Java/C#, and how to handle them correctly.