Skip to content

fix: allow ZWJ (U+200D) in legitimate emoji sequences#35808

Open
w1ndcn wants to merge 1 commit into
NousResearch:mainfrom
w1ndcn:fix/zwj-emoji-false-positive-18581
Open

fix: allow ZWJ (U+200D) in legitimate emoji sequences#35808
w1ndcn wants to merge 1 commit into
NousResearch:mainfrom
w1ndcn:fix/zwj-emoji-false-positive-18581

Conversation

@w1ndcn

@w1ndcn w1ndcn commented May 31, 2026

Copy link
Copy Markdown

Closes #18581

Problem

The invisible-unicode scanner in threat_patterns.py flags all U+200D (Zero Width Joiner) occurrences as potential prompt injection. This silently blocks SOUL.md and other context files that contain standard emoji sequences like 🤸‍♀️, 👨‍👩‍👧, or ❤️‍🔥, replacing them with [BLOCKED: SOUL.md contained potential prompt injection (invisible unicode U+200D)].

ZWJ is the standard Unicode mechanism (TR#51) for constructing gendered, family, and skin-tone emoji — not an injection vector when used in emoji sequences.

Root Cause

INVISIBLE_CHARS includes U+200D, and scan_for_threats() performs a simple char_set & INVISIBLE_CHARS intersection without considering whether the ZWJ sits inside a legitimate emoji grapheme cluster.

Fix

Mirror the proven approach already used in cronjob_tools.py — define emoji codepoint ranges, skip VS16 variation selectors, and check whether each ZWJ sits between emoji codepoints. Strip legitimate emoji ZWJ before the invisible-char scan so that only suspicious ZWJ (between ASCII / CJK / Latin etc.) is flagged.

Key design decisions

  • Emoji range table (_EMOJI_CP_RANGES): covers SMP pictographic ranges, Miscellaneous Symbols, Dingbats, regional indicators, skin tone modifiers, and gender symbols — sufficient for all ZWJ sequences defined in emoji-sequences.txt as of Unicode 16.0.
  • VS16 skip: _zwj_in_emoji_sequence() walks past U+FE0F variation selectors before checking neighbours, matching real emoji encoding (e.g. 🏳️‍🌈 is 🏳 + FE0F + 200D + 🌈).
  • Strip-then-scan: _strip_legitimate_emoji_zwj() removes legitimate ZWJ from a copy of the content, so the existing INVISIBLE_CHARS intersection logic works unchanged.

Security preserved

Input Result
🤸‍♀️ emoji ✅ allowed
👨‍👩‍👧 family emoji ✅ allowed
❤️‍🔥 heart-on-fire ✅ allowed
👩🏽‍💻 skin-tone emoji ✅ allowed
hel\u200dlo (ASCII injection) 🚫 flagged
忽略\u200d指令 (CJK injection) 🚫 flagged
café\u200dïgnore (Latin injection) 🚫 flagged
ZWJ at text edges 🚫 flagged
U+200B, U+2066 etc. 🚫 still flagged

Changes

  • tools/threat_patterns.py: add _EMOJI_CP_RANGES, _is_emoji_cp(), _zwj_in_emoji_sequence(), _strip_legitimate_emoji_zwj(); update scan_for_threats() to strip emoji ZWJ before scanning
  • tests/tools/test_threat_patterns.py: add 10 new test cases (47 total, all passing)

Closes NousResearch#18581

The invisible-unicode scanner in threat_patterns.py flagged ALL
U+200D (Zero Width Joiner) occurrences as potential prompt injection,
which silently blocked SOUL.md and other context files containing
standard emoji sequences like 🤸‍♀️, 👨‍👩‍👧, or ❤️‍🔥.

Root cause: INVISIBLE_CHARS includes U+200D, and scan_for_threats()
performed a simple set intersection without considering that ZWJ is
the standard Unicode mechanism (TR#51) for constructing gendered,
family, and skin-tone emoji sequences.

Fix: mirror the proven approach from cronjob_tools.py — define emoji
codepoint ranges, skip VS16 variation selectors, and check whether
each ZWJ sits between emoji codepoints.  Strip legitimate emoji ZWJ
before the invisible-char scan so that only suspicious ZWJ (between
ASCII/CJK/Latin etc.) is flagged.

Changes:
- tools/threat_patterns.py: add _EMOJI_CP_RANGES, _is_emoji_cp(),
  _zwj_in_emoji_sequence(), and _strip_legitimate_emoji_zwj();
  update scan_for_threats() to strip emoji ZWJ before scanning
- tests/tools/test_threat_patterns.py: add 10 new test cases covering
  emoji ZWJ allowlist, CJK/Latin injection blocklist, mixed content,
  skin-tone sequences, and edge positions
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 31, 2026

@mohamedorigami-jpg mohamedorigami-jpg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach with the strip-vs-validate strategy. The test coverage is thorough -- especially the mixed emoji + malicious ZWJ test and the multi-ZWJ family emoji case.

One thing I noticed: the regex currently checks that ZWJ is preceded by an emoji codepoint (via the emoji set) but doesn't validate that the full ZWJ sequence is a known emoji. That's the right call for a detector -- you want under-blocking over false positives, and unknown emoji sequences are harmless anyway.

@w1ndcn

w1ndcn commented May 31, 2026

Copy link
Copy Markdown
Author

@mohamedorigami-jpg Thanks for the review! You're absolutely right — validating the full ZWJ sequence against a known-emoji database would be overkill for a threat detector. The emoji-codepoint-neighbour check gives us the right balance: standard emoji sequences pass, while anything else (ASCII/CJK/Latin-adjacent ZWJ) still gets caught. Glad the test coverage looks solid.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary — PR #35808

Verdict: Approved

Author: w1ndcn | Type: bugfix | Files: tools/threat_patterns.py, tests/tools/test_threat_patterns.py

Key Points

  • Fixes a real false-positive: U+200D (ZWJ) in emoji sequences like 🤸‍♀️, 👨‍👩‍👧 were flagged as invisible-unicode injection.
  • Clean design: _EMOJI_CP_RANGES covers all Unicode 16.0 ZWJ emoji sequences; _zwj_in_emoji_sequence() walks past VS16 before checking neighbours; _strip_legitimate_emoji_zwj() strips before the existing intersection logic.
  • Security preserved: ZWJ between ASCII, CJK, or Latin chars is still flagged. Mixed content (emoji + malicious ZWJ) catches the malicious one.
  • 10 comprehensive test cases covering all edge cases.

No issues found.


Reviewed by Hermes Agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🤸‍♀️ SOUL.md blocked by ZWJ emoji — cartwheel gymnast triggers prompt injection filter

4 participants