⚖️

WHATWG vs Unicode.org: Why Browsers and Standards Disagree on Encoding

A cross-encoding survey of mapping discrepancies between web standards and official Unicode/national standards.

Why Discrepancies Exist

When converting legacy encodings to Unicode, multiple organizations created their own mapping tables independently. The official Unicode.org tables followed national standards strictly, while Microsoft and later WHATWG followed what browsers actually implemented.

WHATWG’s fundamental principle: “Don’t break existing web content.” This means ratifying browser behavior even when it contradicts official standards.

Japanese: 7 JIS Discrepancies

The most well-known mapping conflict. The same JIS byte position maps to different Unicode code points:

JIS BytesUnicode.orgWHATWG (Microsoft)
81 5F\ (U+005C)\ (U+FF3C)
81 60〜 (U+301C)~ (U+FF5E)
81 61‖ (U+2016)∥ (U+2225)
81 7C− (U+2212)- (U+FF0D)
81 91¢ (U+00A2)¢ (U+FFE0)
81 92£ (U+00A3)£ (U+FFE1)
81 CA¬ (U+00AC)¬ (U+FFE2)

This tool lets you toggle between both mappings in Settings to see the difference.

Chinese: Big5 and GB18030

Big5 (Traditional Chinese): WHATWG merges CP950 and HKSCS into a hybrid table. 6 characters have duplicate byte positions where the encoding order differs between WHATWG and other implementations.

GB18030 (Simplified Chinese): Byte 0xA3 0xA0 maps to U+3000 (ideographic space) in WHATWG, but the official GB18030 standard maps it to U+E5E5 (a PUA character). This was a deliberate web-compatibility fix from 2002.

Korean: EUC-KR Scope Expansion

WHATWG’s “EUC-KR” is actually Windows CP949/UHC, covering all 11,172 Hangul syllables — far more than the original KS X 1001 standard’s ~2,350.

Notably, the Unicode.org KSX1001.TXT was created from Microsoft’s UHC mapping (stated in the file header), so there is no WHATWG vs Unicode.org discrepancy for Korean — unlike JIS.

Western: ISO 8859-1 → Windows-1252

WHATWG’s most sweeping decision: all labels including iso-8859-1, latin1, and ascii resolve to Windows-1252. This means 27 bytes in the 0x80-0x9F range decode as typographic characters instead of C1 controls.

ByteISO 8859-1Windows-1252 (WHATWG)
0x80C1 control€ (U+20AC)
0x93C1 control“ (U+201C)
0x94C1 control” (U+201D)
0x97C1 control— (U+2014)

The Common Pattern

Across all encodings, the pattern is the same:

SourceApproachPriority
Official standardsFollow national/ISO specificationsCorrectness
Microsoft/BrowserFollow Windows code page behaviorCompatibility
WHATWGRatify what browsers actually doWeb content

WHATWG standardized what browsers had been doing for decades. The result is pragmatic but means “the standard” and “browser behavior” now agree — at the cost of diverging from the original national standards.

Related articles