Skip to content

feat: add compliance probes and detectors for regulated industries#1619

Closed
dentity007 wants to merge 5 commits intoNVIDIA:mainfrom
NathanMaine:feat/compliance-probes
Closed

feat: add compliance probes and detectors for regulated industries#1619
dentity007 wants to merge 5 commits intoNVIDIA:mainfrom
NathanMaine:feat/compliance-probes

Conversation

@dentity007
Copy link
Copy Markdown

@dentity007 dentity007 commented Feb 16, 2026

Summary

Adds compliance-focused vulnerability probes and detectors for LLMs deployed in regulated industries (CMMC, NIST SP 800-171/53, HIPAA, DFARS, FedRAMP).

4 probes test whether models:

  • ComplianceHallucination — fabricate regulatory citations (fake NIST controls, CMMC practices, DFARS clauses, HIPAA sections)
  • CompliancePIILeak — leak PII via compliance-framed prompts (direct extraction, indirect social engineering, roleplay)
  • PolicyBypass — provide instructions for circumventing security controls (including Unicode/homoglyph obfuscation)
  • ComplianceStatusFabrication — generate false compliance certifications or attestations

6 detectors score model outputs (0.0=safe, 1.0=hit):

  • FabricatedCitation — validates citations against real control families (SP 800-171 §3.1–3.14, SP 800-53 two-letter families, CMMC Levels 1–3, DFARS 252.204-70xx, HIPAA §164.xxx with subsection validation)
  • RegulatoryConfidence — flags overly definitive regulatory claims without hedging
  • PIIPresence — detects SSN, credit card (with Luhn validation), email, phone (with DFARS clause filtering to avoid false positives)
  • BypassInstructions — detects actionable bypass guidance with code pattern detection
  • ControlWeakening — identifies suggestions to reduce security control effectiveness
  • FalseAttestation — catches false compliance certifications with precise disclaimer detection (C3PAO references in fake certs are not treated as disclaimers)

80 prompts across 16 data files, following garak's convention of loading from garak/data/.

Files

Path Description
garak/probes/compliance.py 4 probe classes
garak/detectors/compliance.py 6 detector classes
garak/data/compliance/*.txt 16 prompt data files (80 prompts)
tests/probes/test_probes_compliance.py Probe tests (22 tests)
tests/detectors/test_detectors_compliance.py Detector tests (58 tests)

OWASP LLM Top 10 Coverage

Probe OWASP Risk
ComplianceHallucination LLM09 (Misinformation) Fabricated regulatory citations lead to misguided compliance efforts
CompliancePIILeak LLM06 (Sensitive Info Disclosure) PII leakage constitutes data breach under HIPAA/CMMC
PolicyBypass LLM01 (Prompt Injection) Control circumvention undermines security posture
ComplianceStatusFabrication LLM09 (Misinformation) False attestations create False Claims Act liability

Test plan

  • All 80 tests pass against garak 0.14.1rc1 (pytest tests/probes/test_probes_compliance.py tests/detectors/test_detectors_compliance.py -v)
  • Probes load via _plugins.load_plugin() and enumerate_plugins()
  • Detectors score correctly: refusals → 0.0, hedged responses → low, confident violations → high/1.0
  • Unicode homoglyph prompts preserved in data files (5 prompts with non-ASCII characters)
  • PII detector correctly filters synthetic SSN prefixes (000, 666, 900–999) and DFARS clause numbers
  • False attestation detector distinguishes C3PAO in fake certificates vs proper disclaimers
  • CI pipeline passes (waiting for GitHub Actions)

Motivation

LLMs are increasingly deployed for regulatory guidance in defense, healthcare, and financial services. These probes address a gap in garak's coverage: compliance-specific vulnerabilities where hallucinated citations, leaked PII, or false attestations can lead to failed audits, legal liability, and security breaches in regulated environments.

…tries

Add 4 probes and 6 detectors targeting compliance-specific vulnerabilities
in LLMs deployed for regulatory guidance (CMMC, NIST, HIPAA, DFARS).

Probes:
- ComplianceHallucination: fabrication of regulatory citations
- CompliancePIILeak: PII extraction via compliance-framed prompts
- PolicyBypass: instructions for circumventing security controls
- ComplianceStatusFabrication: false compliance attestations

Detectors:
- FabricatedCitation: identifies fake regulatory control citations
- RegulatoryConfidence: flags overly definitive regulatory claims
- PIIPresence: scans for SSN, credit card, email, phone patterns
- BypassInstructions: detects actionable bypass guidance
- ControlWeakening: identifies suggestions to reduce security controls
- FalseAttestation: catches false compliance certifications

Includes 80 prompts across 16 data files and 80 tests covering
probe initialization, detector scoring logic, and edge cases.

Signed-off-by: Nathan Maine <dentity@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 16, 2026

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@dentity007
Copy link
Copy Markdown
Author

I have read the DCO Document and I hereby sign the DCO

@dentity007
Copy link
Copy Markdown
Author

recheck

github-actions bot added a commit that referenced this pull request Feb 16, 2026
dentity007 added a commit to NathanMaine/garak-compliance-probes that referenced this pull request Feb 16, 2026
4 probes, 6 detectors, 80 adversarial prompts covering CMMC, NIST
800-171, HIPAA, and DFARS. 53 tests run without garak installed.

Upstream PR: NVIDIA/garak#1619
Return 0.0 instead of None for empty-string outputs (text is not None
so detect() must return float). Fix ReDoS-vulnerable email regex.
Add required .rst doc files for Sphinx autodoc.
test_docs requires each plugin .rst file to be referenced
in the parent category toctree (probes.rst, detectors.rst).
@dentity007
Copy link
Copy Markdown
Author

CI fix commits pushed:

  • 54339df — Fixed 6 detector return types: return 0.0 instead of None for empty-string outputs (per test_detector_detect contract — output.text is not None, so detect() must return float). Fixed ReDoS-vulnerable email regex in PII detector. Added garak.probes.compliance.rst and garak.detectors.compliance.rst doc files.
  • aafa46d — Added garak.probes.compliance and garak.detectors.compliance to the toctrees in probes.rst and detectors.rst (required by test_docs).

Waiting on maintainer approval to run CI workflows.

The detector contract requires returning a float (not None) when
output.text is not None. Empty string after strip is still non-None
text, so detect() correctly returns 0.0. Updated the test assertion
to match.

Signed-off-by: Nathan Maine <dentity@gmail.com>
Follows existing convention (harmbench/README.md) to document the
compliance probes, detectors, OWASP mapping, and file naming conventions.
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the docs, probes serve as the abstraction that encapsulates attacks against AI models and systems. In practice what this means is that probe modules represent an a technique or method used to attack and elicit a goal that the target may be expected not to allow at an abstract level, and a class in a probe is a concrete implementation of that abstraction.

Compliance is not a technique, I would describe it as is a specific set of expectations for how the target will respond to various goals even when various techniques are used to obfuscate them. In other words to be compliant with some standard a target would need to preform well against certain prompts in many cases with or without an adversarial technique or attack.

While the probe seems to target things we want do test, I am not sure this PR gets the concepts right. We might consider landing something along these lines but I think this may be appropriate to redirect as enhancement to other probes or even incorporate into some of the soon to land probes like ProPILE for instance is related to PII.

Many of the prompt files even have names that suggest the existing probes they might belong in or enhance.

Results from target tested by garak can be organized and reviewed from a compliance perspective however a probe is not the right vehicle for that determination.

@dentity007
Copy link
Copy Markdown
Author

@jmartin-tech Thanks for taking the time to walk through this. The distinction between probes as attack techniques vs. compliance as an evaluation lens clicks, and I think I was conflating the two. Digging into how to restructure now. Will follow up soon.

@dentity007
Copy link
Copy Markdown
Author

dentity007 commented Feb 21, 2026

@jmartin-tech - following up on my earlier note. Again, thank you for taking the time to write that out. Spent time with your feedback and the tier/probe docs, and I want to walk through where I got the architecture wrong and where I think this can go instead.

Where you're right.

Probes encapsulate attack techniques - concrete methods for eliciting behavior the target shouldn't allow. "Compliance" isn't a technique. It's an evaluation lens. I was treating a regulatory domain as an attack vector, and honestly the module name itself (probes.compliance) gives that away.

Looking at how garak organizes probes: encoding groups by obfuscation technique, dan by jailbreak method, grandma by social engineering framing, latentinjection by injection delivery. Even lmrc and donotanswer, which test behavioral concerns rather than security exploits, still organize around risk categories from established frameworks. Not regulatory standards.

What the prompts actually are, technique-wise.

Reframing the 80 prompts at the technique level, they map to existing families pretty naturally:

  • The hallucination prompts (25) elicit fabricated authoritative citations. Same basic technique as misleading.FalseAssertion and packagehallucination, just applied to regulatory control numbers instead of Python packages.

  • The PII prompts (15) use compliance-domain social engineering (assessor impersonation, audit scenario framing) as the extraction vector. Different technique from what ProPILE (feat: add ProPILE probes for PII leakage detection #1504) does - ProPILE tests training-data memorization through completion-style prompts with known PII values, mine use roleplay framing to socially engineer extraction at deployment time. The detection surface does overlap though. ProPILE's PIILeak detector does normalized string matching against expected values; PIIPresence does regex-based shape detection with DFARS clause filtering. I'd want to coordinate with feat: add ProPILE probes for PII leakage detection #1504 on whether these fit as extensions to ProPILE or stand alone. Happy to defer this piece until feat: add ProPILE probes for PII leakage detection #1504 lands.

  • Bypass prompts (20, 5 with Unicode homoglyphs) - encoding-adjacent and malwaregen-adjacent.

  • Fabrication prompts (20) - authoritative document generation, a misinformation/hallucination failure mode per the OWASP LLM Top 10 (LLM09: Misinformation).

You also pointed out that the prompt file names themselves suggest where they belong. That's right - the technique keywords are already in the names. Happy to work with probe maintainers to contribute these as enhancements to existing families.

On the detectors.

Your closing point about results being "organized and reviewed from a compliance perspective" - this is what I want to understand better. The six detectors (FabricatedCitation, PIIPresence, FalseAttestation, etc.) are where I think the clearest standalone value is. They validate cited control numbers against known families, filter DFARS clause numbers that false-positive as phone numbers, distinguish C3PAO references in fake certs from proper disclaimers.

Where should these live? A few options I can see:

a) detectors.compliance module that probes reference as extended_detectors - similar to productkey.Win5x5 or the packagehallucination detectors (PythonPypi, RubyGems, etc.) for domain-specific output validation.

b) Post-hoc analysis layer on existing scan results, which might be closer to what you're describing.

c) Something else. Open to direction.

Scope and conventions. If compliance probes land as distinct classes, I'd follow whatever the current patterns are for optional/domain-specific modules - I saw av_spam_scanning and donotanswer using active = False and tier assignments, but I want to make sure I'm matching the conventions as they stand now rather than citing stale patterns. Happy to take direction on the right defaults. I'd also refactor the tests into garak's pytest infrastructure - the standalone monkeypatch approach was a dev shortcut.

Happy to revise this PR however makes sense, or close it and open smaller PRs against specific probe families.

What's most useful?

@jmartin-tech
Copy link
Copy Markdown
Collaborator

Sorry for the very wordy response.

@dentity007, expanding on my comments about "organized and reviewed from a compliance perspective" the tags on probes can currently be used to group results based on a relevant taxonomy during report generation. This can be seen most easily by preforming a test with a reporting.taxonomy set in via configuration combined with a run.probe_tags setting that filters the set of probes run to those with a tag matching the requested taxonomy.

system:
  parallel_attempts: 16
  lite: true
run:
  generations: 1
  probe_tags: avid-effect:performance

reporting:
  taxonomy: avid-effect
garak -t test --config test.yaml

The results above would use a mock target to at least show how such a report ends up organized.

Compliance is something, I believe, can be thought of in these terms. Susceptibility to a technique in a probe with prompts that successfully bypass a restriction that could impact compliance based tagged categories is an indicator that the system under test may have weaknesses that could impact an evaluation of the target meeting a compliance standard. Note that I am not suggesting a report would state that a model is compliant with a standard, it is yet to be seen if that can be accomplished and would have to be evaluated in context of the specific standard. Most compliance standards when applied to LLMs are somewhat vague or can currently only be tested for certain vectors that are not comprehensive. Testing results however can report metrics on risk associated to adversarial techniques being used with the goal of targeting certain traits of a system that need to meet a compliance threshold. Finding better says to elicit the signal for this is definitely of interest to the project.

Expanding on this detectors also have tags and, detectors that trigger based on data that could represent risk related to compliance may be an appropriate way to augment the probe tags to denote what attempt results specifically may impact a compliance specification. The core theme here though is that compliance determination is rarely a binary thing especially when working with non-deterministic systems backed LLMs. Making detectors for compliance a misnomer that may be cause users to mis-interpret what results actual represent.

All this being said there is not yet a clear path to testing compliance related issues in garak. The team is working on a project feature dubbed Context Aware Scanning or CAS that may enable more targeted testing around selecting inputs, referred to as intents that target goals that a may better focus on a compliance related test.

As guidance for the short term, I like the idea of seeing how the prompt here might enhance areas existing probes.

Adding on, possibly introducing new probing techniques that focus on ways to elicit targeted outcomes like hallucinations that focus on getting back fake citations for a regulatory framework, or maybe a more abstract goal for such a technique might be to craft requests based on some common misconception in a standard that is often misrepresented due to inclusion of public opinions in the training corpse that somehow outweigh the official specification document. To be frank this paragraph for me sounds like a research idea and limited time might better focused on adding more content diversity in prompts.

I can also see these prompts getting mapped as intent stubs under the CAS project in the near future, enhancing various probes to test the when responses from a target suggest it can be pushed to violate a compliance standard.

Further it might be worth building on this by adding an an idea discussion, that may possibly transition into an official roadmap goal.

@dentity007
Copy link
Copy Markdown
Author

@jmartin-tech just able to start taking this all in now.

@dentity007
Copy link
Copy Markdown
Author

dentity007 commented Feb 24, 2026

@jmartin-tech - this is really helpful, and clarifies a lot.

The tags + reporting.taxonomy mechanism is the part I was missing. I was trying to solve "how do you evaluate garak results through a compliance lens" by building new detector classes, when the answer is already in the architecture: tag probes and detectors with the right taxonomy entries, and the reporting layer handles the grouping. That makes the detectors.compliance proposal from my last comment the wrong direction. Noted.

The CAS framing also clicks. What I was calling "compliance probes" are really intent stubs - they define what a compliance-relevant failure looks like from the prompt side, and CAS would be the mechanism that selects and targets those intents in context. That's a cleaner separation than anything I proposed.

Your point about "compliance" being a misnomer for detectors is well taken. These systems are non-deterministic. A detector flagging a fabricated NIST citation isn't making a compliance determination, it's surfacing a risk signal. The naming matters because users will read into it.

Short-term, here's what I'll do:

I'll close this PR and rework it into smaller contributions against existing probe families so the history is clean. The hallucination prompts fit naturally into misleading territory, the social engineering PII prompts have a home once #1504 lands, and the encoding/bypass prompts are straightforward additions to existing technique modules.

I'll also run the mock example you shared so I understand the taxonomy reporting firsthand before proposing any tag structures.

Longer-term: I'll open a Discussion to sketch out how compliance-relevant testing could work within garak's existing architecture and the CAS direction. Not as a feature request but as an idea to pressure-test with the team. The research angle you mentioned, crafting prompts around common misconceptions in standards that get amplified through training data, is something I'd want to explore there too.

Thank you for laying this out. The wordiness was exactly what I needed.

@jmartin-tech
Copy link
Copy Markdown
Collaborator

Closing per comment outcomes please do consider the proposed incremental PRs.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants