🔍

Unicode Homoglyph Attacks: When Characters Lie About Who They Are

How visually identical characters from different scripts enable phishing and spoofing, and how to detect them.

The Threat: Characters That Look Alike

Unicode contains over 150,000 characters from hundreds of scripts. Many of these characters are visually identical or nearly identical, despite being completely different code points from different writing systems. These are called homoglyphs.

The most notorious example: Latin a (U+0061) and Cyrillic а (U+0430). They are pixel-perfect identical in most fonts, but they are distinct Unicode characters from different scripts.

Looks likeLatinCyrillicGreek
AU+0041 AU+0410 АU+0391 Α
aU+0061 aU+0430 а
BU+0042 BU+0412 ВU+0392 Β
EU+0045 EU+0415 ЕU+0395 Ε
oU+006F oU+043E оU+03BF ο
pU+0070 pU+0440 рU+03C1 ρ
xU+0078 xU+0445 хU+03C7 χ

This is not a bug — these are legitimately different characters that happen to have converged on the same visual form. But attackers exploit this coincidence.

Multi-Script Homoglyphs in Depth

The problem extends far beyond Latin/Cyrillic. Unicode contains homoglyphs across many script pairs:

VisualScripts involvedCode points
1 / l / IDigit / Latin lower / Latin upperU+0031, U+006C, U+0049
0 / O / ОDigit / Latin / CyrillicU+0030, U+004F, U+041E
н / HCyrillic lower / Latin upperU+043D, U+0048
ν / vGreek lower / Latin lowerU+03BD, U+0076
ℓ / lScript small L / Latin LU+2113, U+006C
⁰ / ° / oSuperscript 0 / Degree / Latin oU+2070, U+00B0, U+006F
ー / — / ─Katakana / Em dash / Box drawingU+30FC, U+2014, U+2500

Fullwidth characters add another dimension: (U+FF21, FULLWIDTH LATIN CAPITAL LETTER A) looks similar to A (U+0041) in some contexts. Mathematical alphanumeric symbols (U+1D400–U+1D7FF) provide yet more lookalikes: 𝐀 (U+1D400, MATHEMATICAL BOLD CAPITAL A).

Real-World Attacks

Homoglyph attacks have caused real damage:

  • IDN homograph attacks: Registering domains like аpple.com (with Cyrillic а) that display identically to apple.com in browser address bars. This led to the development of IDN display policies in browsers.
  • Source code attacks (Trojan Source): Inserting Bidi override characters or homoglyphs in source code to make malicious logic appear benign during code review. A 2021 Cambridge paper demonstrated how this could inject vulnerabilities invisible to human reviewers.
  • Phishing emails: Spoofing sender names or URLs using mixed-script homoglyphs that pass visual inspection.
  • Social media impersonation: Creating usernames that look identical to legitimate accounts using Cyrillic or Greek substitutions.
// These strings look identical but are different:
const latin  = "apple";      // All Latin
const mixed  = "аpple";      // Cyrillic а + Latin pple

latin === mixed              // false
latin.length === mixed.length // true (both 5)

// Byte comparison reveals the difference:
latin.codePointAt(0)  // 97  (U+0061 Latin Small Letter A)
mixed.codePointAt(0)  // 1072 (U+0430 Cyrillic Small Letter A)

Detection Methods

Several approaches exist to detect homoglyph attacks:

  • Script mixing detection: Flag strings that contain characters from multiple scripts (e.g., Latin mixed with Cyrillic). Unicode TR#39 defines “mixed-script” detection algorithms.
  • Confusable detection: Unicode TR#39 also publishes a confusables.txt file that maps each character to its “skeleton” — a canonical form for comparison. Two strings with the same skeleton are confusable.
  • Single-script enforcement: Requiring that identifiers (usernames, domains) use characters from only one script.
  • Visual inspection tools: Using tools like this one to reveal the actual code points behind text that looks suspicious.
// Unicode TR#39 confusable skeleton (conceptual):
skeleton("аpple") → "apple"  // Cyrillic а maps to Latin a
skeleton("apple") → "apple"  // Already Latin

// If skeletons match, strings are confusable:
skeleton("аpple") === skeleton("apple")  // true → confusable!

// Script detection:
function getScripts(str) {
  return [...new Set(
    [...str].map(ch => {
      // Use Unicode script property
      // (simplified; real impl uses Unicode data)
      const cp = ch.codePointAt(0);
      if (cp >= 0x0400 && cp <= 0x04FF) return "Cyrillic";
      if (cp >= 0x0370 && cp <= 0x03FF) return "Greek";
      return "Latin";
    })
  )];
}

Normalization as a Partial Defense

Unicode normalization (especially NFKC) can collapse some homoglyphs but not all:

Homoglyph pairNFKC helps?Reason
A (U+FF21) vs A (U+0041)YesNFKC maps fullwidth to ASCII
ⅰ (U+2170) vs i (U+0069)YesNFKC decomposes Roman numerals
а (U+0430) vs a (U+0061)NoDifferent scripts, not compatibility equivalents
Α (U+0391) vs A (U+0041)NoGreek and Latin are distinct
𝐀 (U+1D400) vs A (U+0041)YesNFKC maps math alphanumerics

NFKC normalization is a useful first pass — it eliminates compatibility variants and fullwidth forms. But cross-script homoglyphs (Latin vs Cyrillic vs Greek) survive normalization because they are not compatibility equivalents. They are genuinely different characters that just happen to look the same.

For robust defense, you need both normalization and confusable detection. This tool lets you see exactly which code points are behind any text, making it easy to spot when characters are not what they appear to be.

Related articles