Unicode Homoglyph Attacks: When Characters Lie About Who They Are
How visually identical characters from different scripts enable phishing and spoofing, and how to detect them.
The Threat: Characters That Look Alike
Unicode contains over 150,000 characters from hundreds of scripts. Many of these characters are visually identical or nearly identical, despite being completely different code points from different writing systems. These are called homoglyphs.
The most notorious example: Latin a (U+0061) and Cyrillic а (U+0430). They are pixel-perfect identical in most fonts, but they are distinct Unicode characters from different scripts.
| Looks like | Latin | Cyrillic | Greek |
|---|---|---|---|
| A | U+0041 A | U+0410 А | U+0391 Α |
| a | U+0061 a | U+0430 а | — |
| B | U+0042 B | U+0412 В | U+0392 Β |
| E | U+0045 E | U+0415 Е | U+0395 Ε |
| o | U+006F o | U+043E о | U+03BF ο |
| p | U+0070 p | U+0440 р | U+03C1 ρ |
| x | U+0078 x | U+0445 х | U+03C7 χ |
This is not a bug — these are legitimately different characters that happen to have converged on the same visual form. But attackers exploit this coincidence.
Multi-Script Homoglyphs in Depth
The problem extends far beyond Latin/Cyrillic. Unicode contains homoglyphs across many script pairs:
| Visual | Scripts involved | Code points |
|---|---|---|
| 1 / l / I | Digit / Latin lower / Latin upper | U+0031, U+006C, U+0049 |
| 0 / O / О | Digit / Latin / Cyrillic | U+0030, U+004F, U+041E |
| н / H | Cyrillic lower / Latin upper | U+043D, U+0048 |
| ν / v | Greek lower / Latin lower | U+03BD, U+0076 |
| ℓ / l | Script small L / Latin L | U+2113, U+006C |
| ⁰ / ° / o | Superscript 0 / Degree / Latin o | U+2070, U+00B0, U+006F |
| ー / — / ─ | Katakana / Em dash / Box drawing | U+30FC, U+2014, U+2500 |
Fullwidth characters add another dimension: A (U+FF21, FULLWIDTH LATIN CAPITAL LETTER A) looks similar to A (U+0041) in some contexts. Mathematical alphanumeric symbols (U+1D400–U+1D7FF) provide yet more lookalikes: 𝐀 (U+1D400, MATHEMATICAL BOLD CAPITAL A).
Real-World Attacks
Homoglyph attacks have caused real damage:
- IDN homograph attacks: Registering domains like
аpple.com(with Cyrillic а) that display identically toapple.comin browser address bars. This led to the development of IDN display policies in browsers. - Source code attacks (Trojan Source): Inserting Bidi override characters or homoglyphs in source code to make malicious logic appear benign during code review. A 2021 Cambridge paper demonstrated how this could inject vulnerabilities invisible to human reviewers.
- Phishing emails: Spoofing sender names or URLs using mixed-script homoglyphs that pass visual inspection.
- Social media impersonation: Creating usernames that look identical to legitimate accounts using Cyrillic or Greek substitutions.
// These strings look identical but are different: const latin = "apple"; // All Latin const mixed = "аpple"; // Cyrillic а + Latin pple latin === mixed // false latin.length === mixed.length // true (both 5) // Byte comparison reveals the difference: latin.codePointAt(0) // 97 (U+0061 Latin Small Letter A) mixed.codePointAt(0) // 1072 (U+0430 Cyrillic Small Letter A)
Detection Methods
Several approaches exist to detect homoglyph attacks:
- Script mixing detection: Flag strings that contain characters from multiple scripts (e.g., Latin mixed with Cyrillic). Unicode TR#39 defines “mixed-script” detection algorithms.
- Confusable detection: Unicode TR#39 also publishes a
confusables.txtfile that maps each character to its “skeleton” — a canonical form for comparison. Two strings with the same skeleton are confusable. - Single-script enforcement: Requiring that identifiers (usernames, domains) use characters from only one script.
- Visual inspection tools: Using tools like this one to reveal the actual code points behind text that looks suspicious.
// Unicode TR#39 confusable skeleton (conceptual):
skeleton("аpple") → "apple" // Cyrillic а maps to Latin a
skeleton("apple") → "apple" // Already Latin
// If skeletons match, strings are confusable:
skeleton("аpple") === skeleton("apple") // true → confusable!
// Script detection:
function getScripts(str) {
return [...new Set(
[...str].map(ch => {
// Use Unicode script property
// (simplified; real impl uses Unicode data)
const cp = ch.codePointAt(0);
if (cp >= 0x0400 && cp <= 0x04FF) return "Cyrillic";
if (cp >= 0x0370 && cp <= 0x03FF) return "Greek";
return "Latin";
})
)];
}Normalization as a Partial Defense
Unicode normalization (especially NFKC) can collapse some homoglyphs but not all:
| Homoglyph pair | NFKC helps? | Reason |
|---|---|---|
| A (U+FF21) vs A (U+0041) | Yes | NFKC maps fullwidth to ASCII |
| ⅰ (U+2170) vs i (U+0069) | Yes | NFKC decomposes Roman numerals |
| а (U+0430) vs a (U+0061) | No | Different scripts, not compatibility equivalents |
| Α (U+0391) vs A (U+0041) | No | Greek and Latin are distinct |
| 𝐀 (U+1D400) vs A (U+0041) | Yes | NFKC maps math alphanumerics |
NFKC normalization is a useful first pass — it eliminates compatibility variants and fullwidth forms. But cross-script homoglyphs (Latin vs Cyrillic vs Greek) survive normalization because they are not compatibility equivalents. They are genuinely different characters that just happen to look the same.
For robust defense, you need both normalization and confusable detection. This tool lets you see exactly which code points are behind any text, making it easy to spot when characters are not what they appear to be.
Related articles
Invisible Characters: Zero-Width Spaces, Bidi Overrides, and Hidden Text
A catalog of invisible Unicode characters that can break or hide in text, with the tool to reveal them.
Emoji Under the Hood: ZWJ Sequences, Skin Tones, and Flag Math
How complex emoji are built from multiple code points using ZWJ, variation selectors, and regional indicators.
WHATWG vs Unicode.org: Why Browsers and Standards Disagree on Encoding
A cross-encoding survey of mapping discrepancies between web standards and official Unicode/national standards.