🔄

Unicode Normalization: NFC, NFD, NFKC, NFKD Demystified

Why the same-looking text can have different bytes, when each normalization form matters, and how to see the differences visually.

The Problem: "café" Encoded Two Ways

Type café on two different computers and you might get two completely different byte sequences — even though the text looks identical on screen. The letter é can be represented as a single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or as two code points: U+0065 (e) followed by U+0301 (COMBINING ACUTE ACCENT).

This means 'caf\u00E9' === 'cafe\u0301' evaluates to false in JavaScript, even though both strings render identically. File systems, databases, and search engines all face this problem: the same human-readable text can have multiple valid binary representations. A user searching for “café” might miss results stored in the other form.

Unicode normalization exists to solve exactly this problem. It defines deterministic algorithms for converting between equivalent representations, ensuring that text which looks the same actually compares as equal.

Canonical Equivalence: NFC and NFD

The two most common normalization forms handle canonical equivalence — sequences that represent the exact same abstract character:

FormNameStrategycafé
NFCCanonical Decomposition + CompositionCompose into fewest code pointsc a f é (4 CPs)
NFDCanonical DecompositionDecompose into base + combining marksc a f e ◌́ (5 CPs)

NFD breaks every precomposed character into its base character plus combining marks. The é (U+00E9) becomes e (U+0065) + combining acute accent (U+0301). Characters that are already in their most decomposed form remain unchanged. NFD also applies canonical ordering to ensure combining marks appear in a deterministic order.

NFC first performs the full NFD decomposition, then recomposes the result using canonical composition rules. The net effect is that each character is represented by the fewest possible code points. NFC is by far the most common normalization form on the web: the W3C recommends NFC for all web content, and most modern operating systems use NFC by default. The notable exception is macOS, whose HFS+ file system historically stored filenames in a variant of NFD.

// NFC and NFD produce the same visual result
const nfc = "caf\u00E9";           // "café" — 4 code points
const nfd = "cafe\u0301";          // "café" — 5 code points

nfc === nfd;                        // false!
nfc.normalize("NFC") === nfd.normalize("NFC");  // true
nfc.normalize("NFD") === nfd.normalize("NFD");  // true

Compatibility Equivalence: NFKC and NFKD

Beyond canonical equivalence, Unicode defines compatibility equivalence: characters that are semantically similar but not identical. These are characters that were given separate code points for historical or formatting reasons but are considered “the same” for many practical purposes.

OriginalNFKC/NFKD resultRelationship
fi (U+FB01)fiLigature → component letters
ffl (U+FB04)fflLigature → component letters
㍻ (U+337B)平成Square composition → characters
㌔ (U+3314)キロSquare composition → characters
㍑ (U+3351)リットルSquare composition → characters
① (U+2460)1Enclosed numeral → digit
H (U+FF28)HFullwidth → ASCII
ℌ (U+210C)HLetterlike symbol → letter

NFKD (Compatibility Decomposition) performs canonical decomposition plus compatibility decomposition, breaking apart ligatures, square compositions, and other compatibility characters. NFKC (Compatibility Decomposition + Composition) does the same decomposition and then recomposes using canonical composition — like NFC applied after NFKD.

Compatibility normalization is lossy: the round-trip 平成 is impossible because NFKC/NFKD destroys the information that the original was a single square composition character. Use compatibility normalization for search and matching, but preserve the original form for display.

The 4-Form Comparison Matrix

The four normalization forms can be organized along two axes: canonical vs. compatibility decomposition, and whether recomposition is applied:

Canonical onlyCanonical + Compatibility
ComposedNFCNFKC
DecomposedNFDNFKD

Here is how a single example character transforms under each form:

InputNFCNFDNFKCNFKD
é (U+00E9)é (U+00E9)e◌́ (U+0065 U+0301)é (U+00E9)e◌́ (U+0065 U+0301)
fi (U+FB01)fi (U+FB01)fi (U+FB01)fi (U+0066 U+0069)fi (U+0066 U+0069)
㍻ (U+337B)㍻ (U+337B)㍻ (U+337B)平成平成
Å (U+00C5)Å (U+00C5)A◌̊ (U+0041 U+030A)Å (U+00C5)A◌̊ (U+0041 U+030A)
Å (U+212B)Å (U+00C5)A◌̊ (U+0041 U+030A)Å (U+00C5)A◌̊ (U+0041 U+030A)

Notice the last two rows: U+00C5 (LATIN CAPITAL LETTER A WITH RING ABOVE) and U+212B (ANGSTROM SIGN) are canonically equivalent. NFC maps both to U+00C5, NFD maps both to U+0041 U+030A, and the compatibility forms follow the same pattern. This is a case where two distinct code points represent the same abstract character.

Practical Impact: Comparison, Databases, and Security

Normalization issues surface in surprisingly many real-world systems:

  • String comparison: Without normalization, === fails on visually identical strings. Always normalize before comparing user input.
  • Database uniqueness: A UNIQUE constraint on a username field will allow both café (NFC) and café (NFD) as separate entries. PostgreSQL and SQLite do not normalize by default.
  • Search and indexing: Elasticsearch and other search engines typically apply NFKC normalization in their analyzers to ensure find matches find.
  • Security: Homoglyph attacks can exploit normalization differences. An attacker might register a domain using compatibility characters that normalize to a known brand name. NFKC is used in PRECIS (RFC 8264) for username and password preparation.
  • File systems: macOS HFS+ stores filenames in a modified NFD form, while most Linux systems store bytes as-is. Copying files between systems can create duplicates that look identical but differ at the byte level.
// Always normalize before comparing
function safeEquals(a: string, b: string): boolean {
  return a.normalize("NFC") === b.normalize("NFC");
}

// For search: use NFKC to catch compatibility variants
function normalizeForSearch(s: string): string {
  return s.normalize("NFKC").toLowerCase();
}

normalizeForSearch("find");  // "find"
normalizeForSearch("㍻");   // "平成"

When to Use Which Form

Choosing the right normalization form depends on your use case:

Use caseRecommended formWhy
Web content / HTMLNFCW3C recommendation; smallest representation
Database storageNFCConsistent canonical form; preserves formatting characters
String comparisonNFC or NFDEither works as long as both sides agree
Search / indexingNFKCCatches ligatures, fullwidth, and other compatibility variants
Username / passwordNFKC (PRECIS)RFC 8264 requires NFKC for identifier normalization
Display / renderingPreserve originalDo not normalize display text; you lose formatting intent

The golden rule: normalize early, normalize consistently. Pick one form for your system and apply it at the boundary where text enters. NFC is the safe default for most applications. Switch to NFKC only when you specifically need to collapse compatibility variants (search, security checks).

Be cautious with NFKC/NFKD in contexts where formatting matters. Converting (a single-character representation of the Japanese era name “Heisei”) to 平成 is helpful for search, but the original compact form carries distinct semantic meaning in Japanese typography. Similarly, converting to fi is correct for search but may affect typographic rendering in some fonts.

Related articles