🇺🇳

Han Unification: How Unicode Merged 100,000 CJK Characters

How the IRG decided which characters from Japan, China, Taiwan, and Korea are 'the same,' with a tool to check any character's source.

The Problem: Same Origin, Different Shapes

Chinese characters (漢字/汉字) are used across four major writing systems: Chinese (simplified and traditional), Japanese, Korean, and Vietnamese (historically). Over centuries, the same character evolved different shapes in each region.

Consider the character meaning “bone”: in Japan it is written as 骨 with a slightly different stroke structure than the Chinese form. Should Unicode assign them the same code point or different ones?

The Unicode Consortium chose Han Unification: characters that share the same origin and meaning are assigned a single code point, even if their glyphs differ across regions. The font and language tag determine which visual form is rendered.

Concept	Example	Code point
Unified	繋 (JP) vs 繋 (CN)	Same: U+7E4B
Not unified	渡 vs 度	Different: U+6E21 vs U+5EA6

Inspect a unified character

The Source Separation Rule

The core principle governing Han Unification is the Source Separation Rule: if two characters were encoded as separate code points in any of the source standards (national character sets from China, Japan, Korea, Taiwan, etc.), they must remain separate in Unicode.

Conversely, if a character appears in multiple source standards at what the IRG (Ideographic Rapporteur Group) judges to be the “same position,” it is unified into one code point.

Each unified character carries source references (also called IRG source flags) that record which national standards include it:

Source prefix	Standard	Country/Region
G	GB 2312 / GB 18030 / etc.	China (PRC)
J	JIS X 0208 / JIS X 0213	Japan
K	KS X 1001 / KS X 1002	Korea
T	CNS 11643	Taiwan
V	TCVN	Vietnam
H	HKSCS	Hong Kong

This tool shows the IRG source flags for every CJK character, letting you trace exactly which standards contributed each code point.

Check IRG sources for 漢字

Reading IRG Source Flags

When you inspect a CJK character in this tool, you may see source data like G0-3A3A J0-3441 T1-4E5B K0-7956. Here is how to decode these:

Flag	Meaning
G0-3A3A	GB 2312 row 26, col 26 (China source)
J0-3441	JIS X 0208 row 20, col 33 (Japan source)
T1-4E5B	CNS 11643 plane 1, row 46, col 59 (Taiwan source)
K0-7956	KS X 1001 row 89, col 70 (Korea source)

The number after the letter indicates which level of the standard: G0 = GB 2312 (level 0), G1 = GB 12345, J0 = JIS X 0208, J1 = JIS X 0212, and so on.

A character with sources from all four major regions (G, J, K, T) is strongly unified — all four national standards agreed this was one character. A character with only one source (e.g., only J) may be Japan-specific.

CJK Extensions A through I

The original CJK Unified Ideographs block (U+4E00–U+9FFF) holds 20,992 characters from the most common national standards. But this was not nearly enough. Unicode has added extensions over the decades:

Block	Range	Count	Year	Note
CJK Unified	U+4E00–9FFF	20,992	1993	Core set (GB, JIS, KS, CNS)
Extension A	U+3400–4DBF	6,592	1999	Rare characters from CNS, JIS X 0213
Extension B	U+20000–2A6DF	42,720	2001	Historic, variant, rare
Extension C	U+2A700–2B73F	4,154	2009	Additional rare characters
Extension D	U+2B740–2B81F	222	2010	Urgent additions
Extension E	U+2B820–2CEAF	5,762	2015	Continued expansion
Extension F	U+2CEB0–2EBEF	7,473	2017	Further additions
Extension G	U+30000–3134F	4,939	2020	Includes oracle bone script
Extension H	U+31350–323AF	4,192	2022	Continued expansion
Extension I	U+2EBF0–2F7FF	622	2023	CJK ideographs for personal names

Extensions B and beyond live in the Supplementary Ideographic Plane (SIP), requiring surrogate pairs in UTF-16. This means "𠀀".length in JavaScript returns 2, not 1.

The Controversy

Han Unification remains one of Unicode's most debated decisions. Critics argue:

Loss of distinction: Japanese and Chinese readers may expect different stroke forms for the same code point. Relying on lang attributes and fonts is fragile.
Font dependency: Without correct language tagging, a CJK character may render in the “wrong” regional form, confusing readers.
Philosophical objection: Some scholars argue that regional variants have diverged enough to be distinct characters, not merely glyph variants.

Defenders counter:

Precedent: Latin ‘a’ renders differently across fonts (serif vs sans-serif) without getting separate code points. Regional CJK glyph variation is analogous.
Practicality: Without unification, CJK blocks would be 3–4x larger, making interoperability harder.
IVS escape valve: Ideographic Variation Sequences allow specifying exact glyph forms when needed.

CJK Compatibility Ideographs: The Exceptions

Despite the unification philosophy, Unicode does include some duplicated CJK characters in the CJK Compatibility Ideographs block (U+F900–U+FAFF). These exist for round-trip compatibility with source standards that encoded the same character twice.

For example, U+F91D (隷) is a CJK Compatibility Ideograph that duplicates U+96B7 (隷). Under NFC normalization, the compatibility ideograph maps to the unified form:

// CJK Compatibility Ideograph → Unified form
"\uF91D".normalize("NFC")
// → "隷" (U+96B7)

// Check if a character is in the compatibility block:
const cp = "隷".codePointAt(0); // U+F91D
const isCompat = cp >= 0xF900 && cp <= 0xFAFF;

There are 472 CJK Compatibility Ideographs. Most exist because the Korean KS X 1001 standard encoded some variant forms separately, and the Source Separation Rule required preserving them.

Compare compatibility ideograph with NFC

✍️

IVS: How Unicode Represents 47 Versions of the Same Kanji

Understanding Ideographic Variation Sequences and Standardized Variation Sequences, with live font rendering of all registered variants.

🎨

Why One Font Isn't Enough: CJK Variant Coverage Across Fonts

How different CJK fonts implement different IVD collections, why a single font can't show every registered variant, and how this site combines three fonts to render every IVS faithfully.

📊

JIS Levels and Kuten Codes: Japan's Character Classification System

How Japan classifies kanji into 4 levels across JIS X 0208 and JIS X 0213, with kuten positional codes.

Unicode Viewer