Learn Unicode

Interactive guides with live examples. Each article links to the Unicode Viewer tool so you can explore the concepts hands-on.

Fundamentals

🔤

Characters Are a Lie: Understanding Grapheme Clusters

Why string.length gives wrong answers, what grapheme clusters really are, and how Intl.Segmenter fixes everything.

📦

UTF-8 Byte by Byte: How Characters Become Bytes

A visual, byte-level walkthrough of UTF-8 encoding showing exactly how code points map to 1-4 bytes.

🔄

Unicode Normalization: NFC, NFD, NFKC, NFKD Demystified

Why the same-looking text can have different bytes, when each normalization form matters, and how to see the differences visually.

🧩

Surrogate Pairs: Why JavaScript Strings Break on Emoji

How UTF-16 surrogate pairs work, why they affect JavaScript/Java/C#, and how to handle them correctly.

Encoding & Legacy

🆚

Shift_JIS vs CP932: The Encoding Everyone Confuses

The precise technical differences between Shift_JIS and CP932 (Windows-31J), with byte-level evidence.

〜

The Wave Dash Problem: 〜 vs ～ and 7 Other Mapping Conflicts

Complete reference on the 7 JIS-Unicode mapping discrepancies with an interactive toggle to see both variants.

📜

Legacy Encoding Survival Guide: From ASCII to GB18030

A practical overview of 20+ character encodings across languages, how they relate, and how to identify them.

CJK

🇺🇳

Han Unification: How Unicode Merged 100,000 CJK Characters

How the IRG decided which characters from Japan, China, Taiwan, and Korea are 'the same,' with a tool to check any character's source.

✍️

IVS: How Unicode Represents 47 Versions of the Same Kanji

Understanding Ideographic Variation Sequences and Standardized Variation Sequences, with live font rendering of all registered variants.

🎨

Why One Font Isn't Enough: CJK Variant Coverage Across Fonts

How different CJK fonts implement different IVD collections, why a single font can't show every registered variant, and how this site combines three fonts to render every IVS faithfully.

📊

JIS Levels and Kuten Codes: Japan's Character Classification System

How Japan classifies kanji into 4 levels across JIS X 0208 and JIS X 0213, with kuten positional codes.

Security & Edge Cases

🔍

Unicode Homoglyph Attacks: When Characters Lie About Who They Are

How visually identical characters from different scripts enable phishing and spoofing, and how to detect them.

👻

Invisible Characters: Zero-Width Spaces, Bidi Overrides, and Hidden Text

A catalog of invisible Unicode characters that can break or hide in text, with the tool to reveal them.

🧬

Emoji Under the Hood: ZWJ Sequences, Skin Tones, and Flag Math

How complex emoji are built from multiple code points using ZWJ, variation selectors, and regional indicators.

⚖️

WHATWG vs Unicode.org: Why Browsers and Standards Disagree on Encoding

A cross-encoding survey of mapping discrepancies between web standards and official Unicode/national standards.