Han Unification: How Unicode Merged 100,000 CJK Characters
How the IRG decided which characters from Japan, China, Taiwan, and Korea are 'the same,' with a tool to check any character's source.
The Problem: Same Origin, Different Shapes
Chinese characters (ζΌ’ε/ζ±ε) are used across four major writing systems: Chinese (simplified and traditional), Japanese, Korean, and Vietnamese (historically). Over centuries, the same character evolved different shapes in each region.
Consider the character meaning βboneβ: in Japan it is written as ιͺ¨ with a slightly different stroke structure than the Chinese form. Should Unicode assign them the same code point or different ones?
The Unicode Consortium chose Han Unification: characters that share the same origin and meaning are assigned a single code point, even if their glyphs differ across regions. The font and language tag determine which visual form is rendered.
| Concept | Example | Code point |
|---|---|---|
| Unified | ηΉ (JP) vs ηΉ (CN) | Same: U+7E4B |
| Not unified | ζΈ‘ vs εΊ¦ | Different: U+6E21 vs U+5EA6 |
The Source Separation Rule
The core principle governing Han Unification is the Source Separation Rule: if two characters were encoded as separate code points in any of the source standards (national character sets from China, Japan, Korea, Taiwan, etc.), they must remain separate in Unicode.
Conversely, if a character appears in multiple source standards at what the IRG (Ideographic Rapporteur Group) judges to be the βsame position,β it is unified into one code point.
Each unified character carries source references (also called IRG source flags) that record which national standards include it:
| Source prefix | Standard | Country/Region |
|---|---|---|
| G | GB 2312 / GB 18030 / etc. | China (PRC) |
| J | JIS X 0208 / JIS X 0213 | Japan |
| K | KS X 1001 / KS X 1002 | Korea |
| T | CNS 11643 | Taiwan |
| V | TCVN | Vietnam |
| H | HKSCS | Hong Kong |
This tool shows the IRG source flags for every CJK character, letting you trace exactly which standards contributed each code point.
Reading IRG Source Flags
When you inspect a CJK character in this tool, you may see source data like G0-3A3A J0-3441 T1-4E5B K0-7956. Here is how to decode these:
| Flag | Meaning |
|---|---|
| G0-3A3A | GB 2312 row 26, col 26 (China source) |
| J0-3441 | JIS X 0208 row 20, col 33 (Japan source) |
| T1-4E5B | CNS 11643 plane 1, row 46, col 59 (Taiwan source) |
| K0-7956 | KS X 1001 row 89, col 70 (Korea source) |
The number after the letter indicates which level of the standard: G0 = GB 2312 (level 0), G1 = GB 12345, J0 = JIS X 0208, J1 = JIS X 0212, and so on.
A character with sources from all four major regions (G, J, K, T) is strongly unified β all four national standards agreed this was one character. A character with only one source (e.g., only J) may be Japan-specific.
CJK Extensions A through I
The original CJK Unified Ideographs block (U+4E00βU+9FFF) holds 20,992 characters from the most common national standards. But this was not nearly enough. Unicode has added extensions over the decades:
| Block | Range | Count | Year | Note |
|---|---|---|---|---|
| CJK Unified | U+4E00β9FFF | 20,992 | 1993 | Core set (GB, JIS, KS, CNS) |
| Extension A | U+3400β4DBF | 6,592 | 1999 | Rare characters from CNS, JIS X 0213 |
| Extension B | U+20000β2A6DF | 42,720 | 2001 | Historic, variant, rare |
| Extension C | U+2A700β2B73F | 4,154 | 2009 | Additional rare characters |
| Extension D | U+2B740β2B81F | 222 | 2010 | Urgent additions |
| Extension E | U+2B820β2CEAF | 5,762 | 2015 | Continued expansion |
| Extension F | U+2CEB0β2EBEF | 7,473 | 2017 | Further additions |
| Extension G | U+30000β3134F | 4,939 | 2020 | Includes oracle bone script |
| Extension H | U+31350β323AF | 4,192 | 2022 | Continued expansion |
| Extension I | U+2EBF0β2F7FF | 622 | 2023 | CJK ideographs for personal names |
Extensions B and beyond live in the Supplementary Ideographic Plane (SIP), requiring surrogate pairs in UTF-16. This means "π ".length in JavaScript returns 2, not 1.
The Controversy
Han Unification remains one of Unicode's most debated decisions. Critics argue:
- Loss of distinction: Japanese and Chinese readers may expect different stroke forms for the same code point. Relying on
langattributes and fonts is fragile. - Font dependency: Without correct language tagging, a CJK character may render in the βwrongβ regional form, confusing readers.
- Philosophical objection: Some scholars argue that regional variants have diverged enough to be distinct characters, not merely glyph variants.
Defenders counter:
- Precedent: Latin βaβ renders differently across fonts (serif vs sans-serif) without getting separate code points. Regional CJK glyph variation is analogous.
- Practicality: Without unification, CJK blocks would be 3β4x larger, making interoperability harder.
- IVS escape valve: Ideographic Variation Sequences allow specifying exact glyph forms when needed.
CJK Compatibility Ideographs: The Exceptions
Despite the unification philosophy, Unicode does include some duplicated CJK characters in the CJK Compatibility Ideographs block (U+F900βU+FAFF). These exist for round-trip compatibility with source standards that encoded the same character twice.
For example, U+F91D (ι·) is a CJK Compatibility Ideograph that duplicates U+96B7 (ι·). Under NFC normalization, the compatibility ideograph maps to the unified form:
// CJK Compatibility Ideograph β Unified form
"\uF91D".normalize("NFC")
// β "ι·" (U+96B7)
// Check if a character is in the compatibility block:
const cp = "ι·".codePointAt(0); // U+F91D
const isCompat = cp >= 0xF900 && cp <= 0xFAFF;There are 472 CJK Compatibility Ideographs. Most exist because the Korean KS X 1001 standard encoded some variant forms separately, and the Source Separation Rule required preserving them.
Related articles
IVS: How Unicode Represents 47 Versions of the Same Kanji
Understanding Ideographic Variation Sequences and Standardized Variation Sequences, with live font rendering of all registered variants.
Why One Font Isn't Enough: CJK Variant Coverage Across Fonts
How different CJK fonts implement different IVD collections, why a single font can't show every registered variant, and how this site combines three fonts to render every IVS faithfully.
JIS Levels and Kuten Codes: Japan's Character Classification System
How Japan classifies kanji into 4 levels across JIS X 0208 and JIS X 0213, with kuten positional codes.