🆚

Shift_JIS vs CP932: The Encoding Everyone Confuses

The precise technical differences between Shift_JIS and CP932 (Windows-31J), with byte-level evidence.

Historical Context: How Two Encodings Diverged

Shift_JIS was created in 1982 as a collaboration between ASCII Corporation and Microsoft to encode Japanese characters on personal computers. It cleverly interleaves single-byte ASCII-compatible characters with double-byte JIS X 0208 characters, avoiding conflicts with existing control codes.

Microsoft adopted Shift_JIS for MS-DOS and Windows but gradually extended it with additional characters that customers demanded. This extended version was internally called “Code Page 932” (CP932), later standardized by IANA as Windows-31J. The problem: Microsoft never clearly distinguished their extensions from the original standard, and the name “Shift_JIS” became ambiguous.

NameStandard bodyBaseExtensions
Shift_JISJIS (JIS X 0208)JIS X 0201 + JIS X 0208None
CP932 / Windows-31JMicrosoft / IANAShift_JISNEC specials, NEC selection of IBM, IBM extensions
Shift_JIS-2004JIS (JIS X 0213)JIS X 0201 + JIS X 02134th-level kanji

In practice, when software claims to use “Shift_JIS,” it almost always means CP932. The pure JIS-standard Shift_JIS is rarely encountered in the wild. This creates a persistent source of confusion in encoding detection and conversion.

Row Coverage Comparison

Shift_JIS and CP932 share the same fundamental structure: single-byte characters (0x00-0x7F and 0xA1-0xDF for half-width katakana) plus double-byte characters arranged in “rows” (ku) and “cells” (ten) from JIS X 0208. The difference lies in which rows are populated:

Row (ku)Shift_JIS (JIS X 0208)CP932 (Windows-31J)
1-8Symbols, numerals, Latin, kanaSame
9-12UnassignedUnassigned
13UnassignedNEC special characters (①②③ etc.)
14UnassignedUnassigned
15UnassignedUnassigned
16-84JIS Level 1+2 kanjiSame
85-88UnassignedUnassigned
89-92UnassignedNEC selection of IBM extensions
93-94UnassignedUnassigned
95-120Not in JIS X 0208IBM extensions (user-defined area)

The core rows 1-8 and 16-84 are identical between the two encodings. The divergence occurs in the “gaps” — rows that JIS X 0208 left unassigned but Microsoft filled with vendor extensions. This is why most everyday Japanese text works identically in both encodings, but certain special characters fail when a strict Shift_JIS decoder encounters CP932-only characters.

NEC Special Characters (Row 13)

Row 13 is the most notorious CP932 extension. NEC introduced these characters for their PC-9801 series in the 1980s, and Microsoft incorporated them into CP932. They include circled numbers, Roman numerals, unit symbols, and other frequently requested characters:

Byte rangeCharactersUnicode mapping
0x8740-0x875D①②③...⑳U+2460-U+2473
0x875F-0x8775Ⅰ Ⅱ Ⅲ ... Ⅹ ⅰ ⅱ ⅲ ... ⅹU+2160-U+2179
0x8780-0x878F㍉ ㌔ ㌢ ㍍ ㌘ ㌧ ㌃ ㌶ etc.U+3349, U+3314, ...
0x8790-0x879C㍻ ㍼ ㍽ ㍾ (era names)U+337B-U+337E

These characters are absent from standard Shift_JIS. A document containing (circled digit one, byte 0x8740) will decode perfectly in CP932 but produce an error or replacement character in a strict Shift_JIS decoder. This is the single most common cause of encoding issues when exchanging Japanese text between Windows and Unix/Mac systems.

The circled numbers are especially problematic because they are extensively used in Japanese business documents, legal texts, and everyday writing. Users are often shocked to learn that these “basic” characters are actually vendor extensions.

IBM Extensions and the Duplicate Problem

CP932 includes two sets of IBM-originated characters, and this creates a unique problem: some characters have two different byte sequences that map to the same Unicode code point.

SourceRowsCountExamples
NEC selection of IBM ext.89-92~374 chars髙 﨑 (NEC byte: 0xEEEF, 0xEEFC)
IBM extensions115-119~388 chars髙 﨑 (IBM byte: 0xFBFC, 0xFBF2)

The character (taka, the “tall” variant of 高) can be encoded as either the NEC-row byte sequence or the IBM-row byte sequence in CP932. Both decode to U+9AD9 in Unicode. However, when converting from Unicode back to CP932, the encoder must choose one representation. Microsoft chose to favor the NEC rows for round-trip compatibility, but other implementations may differ.

This duplication causes subtle bugs: byte-level string comparison may consider two CP932 strings “different” even though they contain identical text. Hash-based lookups, deduplication, and binary search can all break. The solution is to always normalize to Unicode for comparison, never compare raw CP932 bytes directly.

// The duplicate encoding problem:
// 髙 (U+9AD9) in CP932:
//   NEC row 89:  0xEEEF
//   IBM row 115: 0xFBFC
// Both are valid CP932, both map to the same Unicode character.
//
// Byte comparison: 0xEEEF ≠ 0xFBFC → "different"
// Unicode comparison: U+9AD9 === U+9AD9 → "same"
//
// Always convert to Unicode before comparing!

WHATWG Reality: The Web Treats Shift_JIS as CP932

The WHATWG Encoding Standard, which governs how web browsers handle character encodings, made a pragmatic decision: the label “Shift_JIS” is treated as an alias for the Windows-31J (CP932) decoder. When a web page declares charset=Shift_JIS, browsers decode it using CP932 rules.

Label in HTMLActual decoder usedStandard
Shift_JISWindows-31J (CP932)WHATWG
shift_jisWindows-31J (CP932)WHATWG
windows-31jWindows-31J (CP932)WHATWG
csshiftjisWindows-31J (CP932)WHATWG
ms_kanjiWindows-31J (CP932)WHATWG
x-sjisWindows-31J (CP932)WHATWG

This means on the web, the distinction between Shift_JIS and CP932 is effectively erased. All six labels above trigger the same decoder. The WHATWG made this choice because virtually all “Shift_JIS” content on the web is actually CP932, and using a strict Shift_JIS decoder would break millions of pages containing NEC special characters.

However, this web-centric unification does not apply everywhere. Email (MIME), programming languages, and database systems may still distinguish between the two. Python's shift_jis codec is stricter than its cp932 codec. Java's Shift_JIS maps to JIS X 0208, while MS932 maps to CP932. These differences matter when processing data outside the browser.

# Python encoding behavior difference:
text = "①"  # Circled digit one (U+2460)

# CP932: works fine
text.encode('cp932')          # b'\x87\x40'

# Strict Shift_JIS: fails!
text.encode('shift_jis')      # UnicodeEncodeError

# Java equivalents:
# Charset.forName("Shift_JIS")  → JIS X 0208 based
# Charset.forName("MS932")      → CP932 / Windows-31J

Unicode Mapping Variants

Even when both Shift_JIS and CP932 contain the same character, they sometimes map to different Unicode code points. The most famous example is the wave dash problem:

JIS characterJIS X 0208 → UnicodeCP932 → UnicodeDifference
Wave dash (1-33)U+301C 〜U+FF5E ~WAVE DASH vs FULLWIDTH TILDE
Double vertical line (1-34)U+2016 ‖U+2225 ∥DOUBLE VERTICAL LINE vs PARALLEL TO
Minus sign (1-61)U+2212 −U+FF0D -MINUS SIGN vs FULLWIDTH HYPHEN-MINUS
Cent sign (1-81)U+00A2 ¢U+FFE0 ¢CENT SIGN vs FULLWIDTH CENT SIGN
Pound sign (1-82)U+00A3 £U+FFE1 £POUND SIGN vs FULLWIDTH POUND SIGN
Not sign (1-76)U+00AC ¬U+FFE2 ¬NOT SIGN vs FULLWIDTH NOT SIGN
EM dash (1-29)U+2014 —U+2015 ―EM DASH vs HORIZONTAL BAR

These 7 mapping discrepancies (sometimes called the “wave dash problem” as a group) cause round-trip conversion failures. If you convert text from CP932 to Unicode and then to JIS-standard Shift_JIS (or vice versa), these characters will change identity. The wave dash ( vs ) is the most visible and the most frequently encountered in practice — it appears in countless Japanese price ranges, dates, and expressions.

The root cause is that Microsoft and the JIS committee independently chose different Unicode code points for the same visual character. Neither choice is “wrong” — they simply reflect different mapping philosophies. Microsoft preferred fullwidth compatibility forms; JIS preferred semantically precise code points.

Related articles