πŸ‡ΊπŸ‡³

Han Unification: How Unicode Merged 100,000 CJK Characters

How the IRG decided which characters from Japan, China, Taiwan, and Korea are 'the same,' with a tool to check any character's source.

The Problem: Same Origin, Different Shapes

Chinese characters (ζΌ’ε­—/汉字) are used across four major writing systems: Chinese (simplified and traditional), Japanese, Korean, and Vietnamese (historically). Over centuries, the same character evolved different shapes in each region.

Consider the character meaning β€œbone”: in Japan it is written as ιͺ¨ with a slightly different stroke structure than the Chinese form. Should Unicode assign them the same code point or different ones?

The Unicode Consortium chose Han Unification: characters that share the same origin and meaning are assigned a single code point, even if their glyphs differ across regions. The font and language tag determine which visual form is rendered.

ConceptExampleCode point
UnifiedηΉ‹ (JP) vs ηΉ‹ (CN)Same: U+7E4B
Not unifiedζΈ‘ vs εΊ¦Different: U+6E21 vs U+5EA6

The Source Separation Rule

The core principle governing Han Unification is the Source Separation Rule: if two characters were encoded as separate code points in any of the source standards (national character sets from China, Japan, Korea, Taiwan, etc.), they must remain separate in Unicode.

Conversely, if a character appears in multiple source standards at what the IRG (Ideographic Rapporteur Group) judges to be the β€œsame position,” it is unified into one code point.

Each unified character carries source references (also called IRG source flags) that record which national standards include it:

Source prefixStandardCountry/Region
GGB 2312 / GB 18030 / etc.China (PRC)
JJIS X 0208 / JIS X 0213Japan
KKS X 1001 / KS X 1002Korea
TCNS 11643Taiwan
VTCVNVietnam
HHKSCSHong Kong

This tool shows the IRG source flags for every CJK character, letting you trace exactly which standards contributed each code point.

Reading IRG Source Flags

When you inspect a CJK character in this tool, you may see source data like G0-3A3A J0-3441 T1-4E5B K0-7956. Here is how to decode these:

FlagMeaning
G0-3A3AGB 2312 row 26, col 26 (China source)
J0-3441JIS X 0208 row 20, col 33 (Japan source)
T1-4E5BCNS 11643 plane 1, row 46, col 59 (Taiwan source)
K0-7956KS X 1001 row 89, col 70 (Korea source)

The number after the letter indicates which level of the standard: G0 = GB 2312 (level 0), G1 = GB 12345, J0 = JIS X 0208, J1 = JIS X 0212, and so on.

A character with sources from all four major regions (G, J, K, T) is strongly unified β€” all four national standards agreed this was one character. A character with only one source (e.g., only J) may be Japan-specific.

CJK Extensions A through I

The original CJK Unified Ideographs block (U+4E00–U+9FFF) holds 20,992 characters from the most common national standards. But this was not nearly enough. Unicode has added extensions over the decades:

BlockRangeCountYearNote
CJK UnifiedU+4E00–9FFF20,9921993Core set (GB, JIS, KS, CNS)
Extension AU+3400–4DBF6,5921999Rare characters from CNS, JIS X 0213
Extension BU+20000–2A6DF42,7202001Historic, variant, rare
Extension CU+2A700–2B73F4,1542009Additional rare characters
Extension DU+2B740–2B81F2222010Urgent additions
Extension EU+2B820–2CEAF5,7622015Continued expansion
Extension FU+2CEB0–2EBEF7,4732017Further additions
Extension GU+30000–3134F4,9392020Includes oracle bone script
Extension HU+31350–323AF4,1922022Continued expansion
Extension IU+2EBF0–2F7FF6222023CJK ideographs for personal names

Extensions B and beyond live in the Supplementary Ideographic Plane (SIP), requiring surrogate pairs in UTF-16. This means "π €€".length in JavaScript returns 2, not 1.

The Controversy

Han Unification remains one of Unicode's most debated decisions. Critics argue:

  • Loss of distinction: Japanese and Chinese readers may expect different stroke forms for the same code point. Relying on lang attributes and fonts is fragile.
  • Font dependency: Without correct language tagging, a CJK character may render in the β€œwrong” regional form, confusing readers.
  • Philosophical objection: Some scholars argue that regional variants have diverged enough to be distinct characters, not merely glyph variants.

Defenders counter:

  • Precedent: Latin β€˜a’ renders differently across fonts (serif vs sans-serif) without getting separate code points. Regional CJK glyph variation is analogous.
  • Practicality: Without unification, CJK blocks would be 3–4x larger, making interoperability harder.
  • IVS escape valve: Ideographic Variation Sequences allow specifying exact glyph forms when needed.

CJK Compatibility Ideographs: The Exceptions

Despite the unification philosophy, Unicode does include some duplicated CJK characters in the CJK Compatibility Ideographs block (U+F900–U+FAFF). These exist for round-trip compatibility with source standards that encoded the same character twice.

For example, U+F91D (隷) is a CJK Compatibility Ideograph that duplicates U+96B7 (隷). Under NFC normalization, the compatibility ideograph maps to the unified form:

// CJK Compatibility Ideograph β†’ Unified form
"\uF91D".normalize("NFC")
// β†’ "隷" (U+96B7)

// Check if a character is in the compatibility block:
const cp = "隷".codePointAt(0); // U+F91D
const isCompat = cp >= 0xF900 && cp <= 0xFAFF;

There are 472 CJK Compatibility Ideographs. Most exist because the Korean KS X 1001 standard encoded some variant forms separately, and the Source Separation Rule required preserving them.

Related articles