🔍

Unicode Homoglyph Attacks: When Characters Lie About Who They Are

How visually identical characters from different scripts enable phishing and spoofing, and how to detect them.

The Threat: Characters That Look Alike

Unicode contains over 150,000 characters from hundreds of scripts. Many of these characters are visually identical or nearly identical, despite being completely different code points from different writing systems. These are called homoglyphs.

The most notorious example: Latin a (U+0061) and Cyrillic а (U+0430). They are pixel-perfect identical in most fonts, but they are distinct Unicode characters from different scripts.

Looks like	Latin	Cyrillic	Greek
A	U+0041 A	U+0410 А	U+0391 Α
a	U+0061 a	U+0430 а	—
B	U+0042 B	U+0412 В	U+0392 Β
E	U+0045 E	U+0415 Е	U+0395 Ε
o	U+006F o	U+043E о	U+03BF ο
p	U+0070 p	U+0440 р	U+03C1 ρ
x	U+0078 x	U+0445 х	U+03C7 χ

This is not a bug — these are legitimately different characters that happen to have converged on the same visual form. But attackers exploit this coincidence.

Compare Latin, Cyrillic, and Greek lookalikes

Multi-Script Homoglyphs in Depth

The problem extends far beyond Latin/Cyrillic. Unicode contains homoglyphs across many script pairs:

Visual	Scripts involved	Code points
1 / l / I	Digit / Latin lower / Latin upper	U+0031, U+006C, U+0049
0 / O / О	Digit / Latin / Cyrillic	U+0030, U+004F, U+041E
н / H	Cyrillic lower / Latin upper	U+043D, U+0048
ν / v	Greek lower / Latin lower	U+03BD, U+0076
ℓ / l	Script small L / Latin L	U+2113, U+006C
⁰ / ° / o	Superscript 0 / Degree / Latin o	U+2070, U+00B0, U+006F
ー / — / ─	Katakana / Em dash / Box drawing	U+30FC, U+2014, U+2500

Fullwidth characters add another dimension: Ａ (U+FF21, FULLWIDTH LATIN CAPITAL LETTER A) looks similar to A (U+0041) in some contexts. Mathematical alphanumeric symbols (U+1D400–U+1D7FF) provide yet more lookalikes: 𝐀 (U+1D400, MATHEMATICAL BOLD CAPITAL A).

Real-World Attacks

Homoglyph attacks have caused real damage:

IDN homograph attacks: Registering domains like аpple.com (with Cyrillic а) that display identically to apple.com in browser address bars. This led to the development of IDN display policies in browsers.
Source code attacks (Trojan Source): Inserting Bidi override characters or homoglyphs in source code to make malicious logic appear benign during code review. A 2021 Cambridge paper demonstrated how this could inject vulnerabilities invisible to human reviewers.
Phishing emails: Spoofing sender names or URLs using mixed-script homoglyphs that pass visual inspection.
Social media impersonation: Creating usernames that look identical to legitimate accounts using Cyrillic or Greek substitutions.

// These strings look identical but are different:
const latin  = "apple";      // All Latin
const mixed  = "аpple";      // Cyrillic а + Latin pple

latin === mixed              // false
latin.length === mixed.length // true (both 5)

// Byte comparison reveals the difference:
latin.codePointAt(0)  // 97  (U+0061 Latin Small Letter A)
mixed.codePointAt(0)  // 1072 (U+0430 Cyrillic Small Letter A)

Detection Methods

Several approaches exist to detect homoglyph attacks:

Script mixing detection: Flag strings that contain characters from multiple scripts (e.g., Latin mixed with Cyrillic). Unicode TR#39 defines “mixed-script” detection algorithms.
Confusable detection: Unicode TR#39 also publishes a confusables.txt file that maps each character to its “skeleton” — a canonical form for comparison. Two strings with the same skeleton are confusable.
Single-script enforcement: Requiring that identifiers (usernames, domains) use characters from only one script.
Visual inspection tools: Using tools like this one to reveal the actual code points behind text that looks suspicious.

// Unicode TR#39 confusable skeleton (conceptual):
skeleton("аpple") → "apple"  // Cyrillic а maps to Latin a
skeleton("apple") → "apple"  // Already Latin

// If skeletons match, strings are confusable:
skeleton("аpple") === skeleton("apple")  // true → confusable!

// Script detection:
function getScripts(str) {
  return [...new Set(
    [...str].map(ch => {
      // Use Unicode script property
      // (simplified; real impl uses Unicode data)
      const cp = ch.codePointAt(0);
      if (cp >= 0x0400 && cp <= 0x04FF) return "Cyrillic";
      if (cp >= 0x0370 && cp <= 0x03FF) return "Greek";
      return "Latin";
    })
  )];
}

Normalization as a Partial Defense

Unicode normalization (especially NFKC) can collapse some homoglyphs but not all:

Homoglyph pair	NFKC helps?	Reason
Ａ (U+FF21) vs A (U+0041)	Yes	NFKC maps fullwidth to ASCII
ⅰ (U+2170) vs i (U+0069)	Yes	NFKC decomposes Roman numerals
а (U+0430) vs a (U+0061)	No	Different scripts, not compatibility equivalents
Α (U+0391) vs A (U+0041)	No	Greek and Latin are distinct
𝐀 (U+1D400) vs A (U+0041)	Yes	NFKC maps math alphanumerics

NFKC normalization is a useful first pass — it eliminates compatibility variants and fullwidth forms. But cross-script homoglyphs (Latin vs Cyrillic vs Greek) survive normalization because they are not compatibility equivalents. They are genuinely different characters that just happen to look the same.

For robust defense, you need both normalization and confusable detection. This tool lets you see exactly which code points are behind any text, making it easy to spot when characters are not what they appear to be.

Reveal the true identity of each character

👻