24

I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to:

a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or,

b) reduced to maximum 8 consecutive characters (the maximum encountered in actual languages)?

EDIT: In the meantime I found a completely differently phrased question ("How to protect against... diacritics?"), which is essentially the same as this one. I made its title more explicit so others will find it as well.

18
  • 6
    why should a a chat or forum software prevent vertical rubbish automatically, when it cannot do the same with horizontal rubbish? Commented Mar 9, 2014 at 1:14
  • 29
    Y̒̌͛́̓̀͊ͫ͌ͦo͊͂ͤ̊̒̆͊ͪ̋ͯͥ͌ͧ͑̑̂͐͗̏u̇̽̿͋̋́̅̐̄ͮ̿͆̚r͊ͥͣ͂̑ͩ̒̑̋̊̅ ͬ̔̍̾̓ͩ̇͒ͯ͗͐͐ͧ̍͊̚c͋̈́̂̽ͬ͒͊ͣͤ͊̋͛̿͒̚̚oͩͫ͛̂̄̐̽̑ͬ͑̍̃ͯm̉̈́̾ͨ̆̊ͨͪ͌mͫ̾͋ͨͤ̈́͑́͐́eͮ͐̍̌ͬ͛̃̃̿ͪ̌͂n̊͋ͫ͆t̊͊ͪ̌́͆ ̎̔̉ͮ̋̋͐̐ͮ͛̈̆̉̈́ͣ̎̐̏̚i͆̌͆̃̾̽ͥ̎͊́s̑̌̓̆͊́ͦ͆̍̇̌̀̈̓̈́ͪ̚ ̍̀͌ͩͮ́̿́̓̈́̍ͣ̔ṁ̋̑̉ͤi̒̌̿̔ͣ̇͐ͭͫͬ̎͊ͬ͊̓s͗̽ͦ̄͋ͤ͆͊ͬ̈́̂̌ͦ͒̈́̓ͪ̏gͣ͆̃͛ͨͩ̚u͆͆̄ͬ̍ͯiͬͩ̎̑d̍ͩ̐ͫ̍e͗ͪ̀ͥͨ̀͌̒ͦͩͣ̎ͯ͂̔ͤd̆ͭ͆.̑̃͂̆̀̈́̽ͭ̂ͮ̓ If that was not demo enough, here's why: 1) a crap regular comment affects only itself, while a Zalgo one affects others. 2) Because it is possible to automatically filter out Zalgo, while automatically filtering out low-quality comments requires developing general AI. Commented Mar 9, 2014 at 1:38
  • 4
    @DanDascalescu I vote to keep open, if only for your enlightening demonstrative comment. Finding this kind of thing puts a smile on my face and that's worth more to me than normalizing SO. Commented Mar 9, 2014 at 8:04
  • 3
    I really think this should be reopened. This question can be answered with code and I posted some code as an answer. Commented Mar 9, 2014 at 12:30
  • 14
    You cannot prevent Zalgo... Ḧ̛̪̠́̌ͦ̔̄̐̓͗ͭ̒̀͗́̚ͅE̻̪͇͓͓͖͕̖͓̘͚̰̺͔̻̬͙͑͂̑ͫͧ̊̏ͨ͛ͯ̅̋͑ͤͤ̅̒͘͞ͅͅ ̧̢̡̩̥̯̤͚̤͍͓͙̳̞̦̓̓̇ͧ̎̐̓ͤ̀͜ͅC̦̫̗̠̝̅̀ͨ̊̕͝͝ͅŌ̷̝̝̰̞͓͎̫̖͚̲̟̽ͫ́͛̋̍̒ͦ̊̂̈ͤ͆͒͞ͅṂ̴̠̠̜̣̹ͥ̓̇͐̇ͬͣ͆̆̈́̚͡͝Ē̵̳̞̝̙͕ͬ͒ͮ̀͑͊̎͑̔̀̕͜͞Ş̶̡̛̠̠͙̱̣̝͔̻̻̩̬ͮ͑̀̒͂̐̑̋̚͘ Commented May 18, 2015 at 0:00

6 Answers 6

19

Assuming you're very serious about this and want a technical solution you could do as follows:

  1. Split the incoming text into smaller units (words or sentences);
  2. Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);
  3. Train a machine learning algorithm to judge if it looks too "dark" and "busy";
  4. If the algorithm's confidence is low defer to human moderators.

This could be fun to implement but in practice it would likely be better to go to step four straight away.

Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to ZALGO_CHAR_CATEGORIES.

#!/usr/bin/env python
import unicodedata
import codecs

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),

Example input:

1
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
2
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
3

Output:

1
How does Zalgo text work?
2
How does Zalgo text work?
3

Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD. If THRESHOLD equals 0.5 it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.

#!/usr/bin/env python
from __future__ import division
import unicodedata
import codecs
import numpy

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
THRESHOLD = 0.5
DEBUG = True

def is_zalgo(s):
    if len(s) == 0:
        return False
    word_scores = []
    for word in s.split():
        cats = [unicodedata.category(c) for c in word]
        score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
        word_scores.append(score)
    total_score = numpy.percentile(word_scores, 75)
    if DEBUG:
        print total_score
    return total_score > THRESHOLD

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line

Sample output:

0.911483990148
True    Señor, could you or your fiancé explain, H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡

0.333333333333
False   Příliš žluťoučký kůň úpěl ďábelské ódy.  
Sign up to request clarification or add additional context in comments.

5 Comments

Appreciate the elaborate solution, but I was looking for a simple character range regular expression, or a library like strip-combining marks.
I wasn't quite sure how serious you were about looking for a solution (i.e., if you wanted something that's fun to play with vs. something you could plug in a forum today). I implemented two more practical solutions in Python; it was a fun little bit of research to figure this stuff out. Since this question is on hold right now I can't add my code as a separate answer, so I added it here.
I have (professionally) come across international text VALIDLY containing characters belonging to the two character classes you are banning, and please be aware that a word in CJK easily consists of a SINGLE character (and also be aware that in several langauges words may NOT be separated by non-word characters).
@WalterTross: "Banned" is a misnomer in the case of the second code snippet because it doesn't actually ban those marks. I'll change that.
@DanDascalescu Given that Regex is one of the ways in which Zalgo texts were generated, I would advise against trying so....stackoverflow.com/a/1732454/1808494
14

Make the box overflow:hidden. It doesn't actually disable Zalgo text, but it prevents it from damaging other comments.

.comment {
  /* the overflow: hidden is what prevents one comment's combining marks from affecting its siblings */
  overflow: hidden;
  /* the padding gives space for any legitimate combining marks */
  padding: 0.5em;
  /* the rest are just to visually divide the three comments */
  border: solid 1px #ccc;
  margin-top: -1px;
  margin-bottom: -1px;
}
<div class=comment>The below comment looks awful.</div>
<div class=comment>H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡</div>
<div class=comment>The above comment looks awful.</div>

2 Comments

Highly practical suggestion. Validation measures such as ''.join((c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')) are resource intensive and the opposite of subtle.
I think you mean "awful".
6

A related question was asked before: https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented but it's interesting to go into prevention here.

In terms of preventing this you can choose several strategies:

  1. prevent combining diacritics entirely (and piss off many international users),
  2. filter out combining characters using whitelisting or blacklisting (and piss off a smaller percentage of international users)
  3. prevent a certain number of combining characters (and piss of an even smaller percentage of users)
  4. have a healthy moderator community (with all the downsides that has, see your question as an example here)

3 Comments

"with all the downsides that has, see your question as an example here" - priceless :)
The smallest unit of text that is usually zalgoed is a line. Rather than the absolute number of combining characters you could look at their density (percentage) in each line.
@nwk good trick, but I was thinking to disallow successive combining characters (meaning you can only reach a certain height/depth)
4

You can get rid off Zalgo text in your application using strip-combining-marks by Mathias Bynens.

The module strip-combining-marks is available for browsers (via Bower) and Node.js applications (via npm).

Here is an example on how to use it with npm:

var stripCombiningMarks = require("strip-combining-marks");
var zalgoText = 'U̼̥̻̮͍͖n͠i͏c̯̮o̬̝̠͉̤d͖͟e̫̟̗͟ͅ';
var stripptedText = stripCombiningMarks(zalgoText); // "Unicode"

2 Comments

For anyone coming here via Google, be aware that strip-combining-marks will trash some valid emojis. It turns out the blue and white number emojis use combining marks... emojipedia.org/keycap-digit-one
This could also ruin other valid Unicode characters that use combining marks. Quoth the Unicode FAQ, "...unless a precomposed character is used, it is encoded as U+0301 COMBINING ACUTE ACCENT. Similarly, the U+0308 COMBINING DIAERESIS may be used for diaeresis, trema, umlaut, as well as other, possibly unrelated uses."
2

Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too.

$unZalgoText = iconv("UTF-8", "ISO-8859-1//IGNORE", $zalgoText);

Comments

1

Using RegExp to limit excessive combining marks

function removeExcessiveMarks(string) {
  string.replaceAll(/([\p{Mc}\p{Me}\p{Mn}]{2})[\p{Mc}\p{Me}\p{Mn}]+/gu, "$1");
}

Result

removeExcessiveMarks("Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ"); // Z͗ͣȃ̵l̉̃g̐̓o̔ͥ

Some unit tests

it("Should work with a zalgo text", () => {
  expect(removeExcessiveMarks("Z̸a̸͆l̸͆͐g̸͆͐̓o̸͆͐̓̈́")).toBe("Z̸a̸͆l̸͆g̸͆o̸͆");
});

it("Should work with arabic letter beh", () => {
  expect(removeExcessiveMarks("بٍٍّ")).toBe("بٍّ");
  expect(removeExcessiveMarks("ب\u0651\u0650\u0652\u0650")).toBe("ب\u0651\u0650");
});

it('Should work with "e" combined with 3 accents (total 3 accents)', () => {
  // Combining "e" with Grave Accent (U+0300), Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("e\u0300\u0301\u0303")).toBe("e\u0300\u0301");
});

it('Should work with "è" combined with 2 accents (total 3 accents)', () => {
  // Combining "è" with Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("è\u0301\u0303")).toBe("è\u0301\u0303");
});

Explanation

This regex handles combining characters defined by the "Mark" Unicode general category. These are characters that typically modify the preceding base character:

  • Mn (Non-Spacing): Usually used for accents and diacritics. Example: è (e + U+0300)
  • Mc (Spacing Combining): Usually used for diacritics that take up space. Example: Arabic Fatha بَ (ب + U+064E)
  • Me (Enclosing): Marks that surround the base character. Example: Combining Enclosing Circle a⃝ (a + U+20DD)

In some languages and writing systems, multiple diacritics are used in combination to accurately represent sounds and pronunciations. Although the number of combination marks is virtually infinite, more than 2 consecutive marks are generally not legitimate in most real-world scenarios. Feel free to change the RegExp selector to more than 2.

Note that emoji modifiers are kept as they belong to the Sk (Symbol, Modifier) category.

Unicode normalization

For more end-user consistency, I would recommend to apply the regex to the decomposed form NFD and then recompose using NFC:

function removeExcessiveMarks(string) {
  return string
    .normalize("NFD") // Decompose
    .replaceAll(/([\p{Mc}\p{Me}\p{Mn}]{2})[\p{Mc}\p{Me}\p{Mn}]+/gu, "$1")
    .normalize("NFC"); // Recompose
}
it('Should work with "e" combined with 3 accents (total 3 accents)', () => {
  // Combining "e" with Grave Accent (U+0300), Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("e\u0300\u0301\u0303")).toBe("è\u0301");
});

it('Should work with "è" combined with 2 accents (total 3 accents)', () => {
  // Combining "è" with Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("è\u0301\u0303")).toBe("è\u0301");
});

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.