Skip to content

feat: add homoglyph obfuscation prompts to smuggling probe#1660

Open
dentity007 wants to merge 2 commits intoNVIDIA:mainfrom
NathanMaine:feat/smuggling-homoglyph-obfuscation
Open

feat: add homoglyph obfuscation prompts to smuggling probe#1660
dentity007 wants to merge 2 commits intoNVIDIA:mainfrom
NathanMaine:feat/smuggling-homoglyph-obfuscation

Conversation

@dentity007
Copy link
Copy Markdown

Adds smuggling.HomoglyphObfuscation, a probe with 5 prompts that use Unicode homoglyphs (visually similar characters from different scripts) to disguise trigger words in bypass requests. For example, Cyrillic 'a' (U+0430) replaces Latin 'a' in "jailbreak", making the token sequence different while the text remains human-readable.

Second decomposed contribution from PR #1619. The smuggling module's docstring describes exactly this technique: "swapping letters out for unusual unicode representations of the same letters." Uses mitigation.MitigationBypass detector. Set to active = False since these are domain-specific.

Homoglyph scripts used: Cyrillic (U+0430, U+043E, U+0456), Latin alpha (U+0251), Turkish dotless i (U+0131)

Files:

  • garak/probes/smuggling.py : new HomoglyphObfuscation class
  • garak/data/smuggling_homoglyph_5.txt : 5 prompts with embedded Unicode homoglyphs
  • tests/probes/test_probes_smuggling.py : 4 tests (count, uniqueness, non-ASCII verification, active=False)

Add smuggling.HomoglyphObfuscation with 5 prompts that use Unicode
homoglyphs (Cyrillic, Latin alpha, Turkish dotless i) to disguise
trigger words in bypass requests. Tests whether input-side content
filters catch visually identical character substitutions from different
scripts. Uses mitigation.MitigationBypass detector. Set to active=False
(domain-specific).

Signed-off-by: Nathan Maine <nathan@dentity.cloud>
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great added technique, I would suggest this can be expanded to preform inline substitution instead of just using a set of hardcoded sample prompts.

The idea I am suggesting, would programmatically replace characters during prompt initialization to actually mimic the smuggling aspect of the technique. This could be further enhanced to accept a configuration map of character replacements that could be increased or reduced to expand resiliency testing.

Address review feedback on PR NVIDIA#1660:

- Change tier from COMPETE_WITH_SOTA to INFORMATIONAL
- Replace static prompt loading with programmatic substitution via
  homoglyph_replace() function applied to garak payloads
- Add configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/
  Ukrainian mappings) overridable via homoglyph_map config parameter
- Load payloads from garak.payloads system (harmful_behaviors default)
- Keep static prompts as additional payloads through same pipeline
- Update tests: 9 tests covering substitution function, probe loading,
  tier, determinism, custom maps, non-ASCII verification

Signed-off-by: Nathan Maine <nathan@dentity.cloud>
@dentity007
Copy link
Copy Markdown
Author

Thanks for the review. Both changes addressed:

  • Tier adjusted to INFORMATIONAL
  • Replaced the static prompt approach with programmatic substitution. The probe now loads payloads from garak's payload system (harmful_behaviors by default), applies character-by-character homoglyph replacement via a configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/Ukrainian mappings), and generates obfuscated prompts at initialization. The map is overridable via the homoglyph_map config parameter so the substitution set can be expanded or reduced. The original 5 static prompts are still loaded as additional payloads and go through the same substitution pipeline.

Tests updated: 9 tests covering probe loading, substitution function behavior (determinism, custom maps, non-mapped character preservation), non-ASCII verification, tier, and inactive flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants