📜

Legacy Encoding Survival Guide: From ASCII to GB18030

A practical overview of 20+ character encodings across languages, how they relate, and how to identify them.

The Encoding Family Tree

Before Unicode, every language needed its own encoding. The result was a tangled tree of standards, each solving the same problem differently:

Era	Encoding	Characters
1963	ASCII	128 (English only)
1987	ISO 8859-1 (Latin-1)	256 (Western European)
1990s	Windows code pages	256 per page (regional)
1978+	CJK double-byte	6,000-20,000 (East Asian)
1991+	Unicode (UTF-8/16)	150,000+ (all languages)

This tool shows encoding bytes for 20+ legacy encodings simultaneously, letting you compare how the same character is represented across different systems.

See encoding bytes across languages

Single-Byte Encodings

Single-byte encodings map each byte (0x00-0xFF) to one character. They share ASCII in the lower half (0x00-0x7F) but differ in the upper half (0x80-0xFF):

Encoding	Region	Upper half contains
ISO 8859-1 / Latin-1	Western Europe	àáâãäå, ñ, ü, ß, etc.
Windows-1252	Western Europe	Same + €,	, —, etc.
ISO 8859-2	Central Europe	ą, ć, č, ě, ł, ő, ž, etc.
Windows-1251	Cyrillic	А-Я, а-я, etc.
KOI8-U	Ukrainian	Cyrillic (different order)
ISO 8859-7	Greek	Α-Ω, α-ω, etc.

WHATWG treats iso-8859-1 as windows-1252. Bytes 0x80-0x9F that are C1 controls in ISO 8859-1 become typographic characters (€, ", —) in Windows-1252.

Compare Western encodings

East Asian Double-Byte Encodings

CJK languages need thousands of characters, requiring multi-byte encodings:

Encoding	Language	Standard
Shift_JIS / CP932	Japanese	JIS X 0208 / Windows-31J
EUC-JP	Japanese	JIS X 0208 (Unix)
ISO-2022-JP	Japanese	JIS X 0208 (Email)
Big5	Chinese (Traditional)	Taiwan standard
GBK / GB18030	Chinese (Simplified)	China standard
EUC-KR (CP949)	Korean	KS X 1001 / UHC

The same CJK character gets completely different byte sequences in each encoding. This tool shows them all side by side.

Compare CJK encodings for a single character

Auto-Detection: How the Tool Picks Encodings

This tool automatically detects which encoding groups are relevant for each character using two methods:

For CJK characters: The Unihan IRG Source database (88,000+ characters) identifies which national standards include each character. This is more accurate than simple encoding checks.

For other scripts: Encodability checks against the broadest encoding in each group (e.g., Windows-1252 for Western, Windows-1251 for Cyrillic).

See auto-detected Greek encoding

🆚