feat: add compliance probes and detectors for regulated industries#1619
feat: add compliance probes and detectors for regulated industries#1619dentity007 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
…tries Add 4 probes and 6 detectors targeting compliance-specific vulnerabilities in LLMs deployed for regulatory guidance (CMMC, NIST, HIPAA, DFARS). Probes: - ComplianceHallucination: fabrication of regulatory citations - CompliancePIILeak: PII extraction via compliance-framed prompts - PolicyBypass: instructions for circumventing security controls - ComplianceStatusFabrication: false compliance attestations Detectors: - FabricatedCitation: identifies fake regulatory control citations - RegulatoryConfidence: flags overly definitive regulatory claims - PIIPresence: scans for SSN, credit card, email, phone patterns - BypassInstructions: detects actionable bypass guidance - ControlWeakening: identifies suggestions to reduce security controls - FalseAttestation: catches false compliance certifications Includes 80 prompts across 16 data files and 80 tests covering probe initialization, detector scoring logic, and edge cases. Signed-off-by: Nathan Maine <dentity@gmail.com>
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
4 probes, 6 detectors, 80 adversarial prompts covering CMMC, NIST 800-171, HIPAA, and DFARS. 53 tests run without garak installed. Upstream PR: NVIDIA/garak#1619
Return 0.0 instead of None for empty-string outputs (text is not None so detect() must return float). Fix ReDoS-vulnerable email regex. Add required .rst doc files for Sphinx autodoc.
test_docs requires each plugin .rst file to be referenced in the parent category toctree (probes.rst, detectors.rst).
|
CI fix commits pushed:
Waiting on maintainer approval to run CI workflows. |
The detector contract requires returning a float (not None) when output.text is not None. Empty string after strip is still non-None text, so detect() correctly returns 0.0. Updated the test assertion to match. Signed-off-by: Nathan Maine <dentity@gmail.com>
Follows existing convention (harmbench/README.md) to document the compliance probes, detectors, OWASP mapping, and file naming conventions.
There was a problem hiding this comment.
As noted in the docs, probes serve as the abstraction that encapsulates attacks against AI models and systems. In practice what this means is that probe modules represent an a technique or method used to attack and elicit a goal that the target may be expected not to allow at an abstract level, and a class in a probe is a concrete implementation of that abstraction.
Compliance is not a technique, I would describe it as is a specific set of expectations for how the target will respond to various goals even when various techniques are used to obfuscate them. In other words to be compliant with some standard a target would need to preform well against certain prompts in many cases with or without an adversarial technique or attack.
While the probe seems to target things we want do test, I am not sure this PR gets the concepts right. We might consider landing something along these lines but I think this may be appropriate to redirect as enhancement to other probes or even incorporate into some of the soon to land probes like ProPILE for instance is related to PII.
Many of the prompt files even have names that suggest the existing probes they might belong in or enhance.
Results from target tested by garak can be organized and reviewed from a compliance perspective however a probe is not the right vehicle for that determination.
|
@jmartin-tech Thanks for taking the time to walk through this. The distinction between probes as attack techniques vs. compliance as an evaluation lens clicks, and I think I was conflating the two. Digging into how to restructure now. Will follow up soon. |
|
@jmartin-tech - following up on my earlier note. Again, thank you for taking the time to write that out. Spent time with your feedback and the tier/probe docs, and I want to walk through where I got the architecture wrong and where I think this can go instead. Where you're right. Probes encapsulate attack techniques - concrete methods for eliciting behavior the target shouldn't allow. "Compliance" isn't a technique. It's an evaluation lens. I was treating a regulatory domain as an attack vector, and honestly the module name itself (probes.compliance) gives that away. Looking at how garak organizes probes: encoding groups by obfuscation technique, dan by jailbreak method, grandma by social engineering framing, latentinjection by injection delivery. Even lmrc and donotanswer, which test behavioral concerns rather than security exploits, still organize around risk categories from established frameworks. Not regulatory standards. What the prompts actually are, technique-wise. Reframing the 80 prompts at the technique level, they map to existing families pretty naturally:
You also pointed out that the prompt file names themselves suggest where they belong. That's right - the technique keywords are already in the names. Happy to work with probe maintainers to contribute these as enhancements to existing families. On the detectors. Your closing point about results being "organized and reviewed from a compliance perspective" - this is what I want to understand better. The six detectors (FabricatedCitation, PIIPresence, FalseAttestation, etc.) are where I think the clearest standalone value is. They validate cited control numbers against known families, filter DFARS clause numbers that false-positive as phone numbers, distinguish C3PAO references in fake certs from proper disclaimers. Where should these live? A few options I can see: a) detectors.compliance module that probes reference as extended_detectors - similar to productkey.Win5x5 or the packagehallucination detectors (PythonPypi, RubyGems, etc.) for domain-specific output validation. b) Post-hoc analysis layer on existing scan results, which might be closer to what you're describing. c) Something else. Open to direction. Scope and conventions. If compliance probes land as distinct classes, I'd follow whatever the current patterns are for optional/domain-specific modules - I saw av_spam_scanning and donotanswer using active = False and tier assignments, but I want to make sure I'm matching the conventions as they stand now rather than citing stale patterns. Happy to take direction on the right defaults. I'd also refactor the tests into garak's pytest infrastructure - the standalone monkeypatch approach was a dev shortcut. Happy to revise this PR however makes sense, or close it and open smaller PRs against specific probe families. What's most useful? |
|
Sorry for the very wordy response. @dentity007, expanding on my comments about "organized and reviewed from a compliance perspective" the The results above would use a mock target to at least show how such a report ends up organized. Compliance is something, I believe, can be thought of in these terms. Susceptibility to a technique in a probe with prompts that successfully bypass a restriction that could impact compliance based tagged categories is an indicator that the system under test may have weaknesses that could impact an evaluation of the target meeting a compliance standard. Note that I am not suggesting a report would state that a model is compliant with a standard, it is yet to be seen if that can be accomplished and would have to be evaluated in context of the specific standard. Most compliance standards when applied to LLMs are somewhat vague or can currently only be tested for certain vectors that are not comprehensive. Testing results however can report metrics on risk associated to adversarial techniques being used with the goal of targeting certain traits of a system that need to meet a compliance threshold. Finding better says to elicit the signal for this is definitely of interest to the project. Expanding on this All this being said there is not yet a clear path to testing compliance related issues in As guidance for the short term, I like the idea of seeing how the prompt here might enhance areas existing probes. Adding on, possibly introducing new probing techniques that focus on ways to elicit targeted outcomes like hallucinations that focus on getting back fake citations for a regulatory framework, or maybe a more abstract goal for such a technique might be to craft requests based on some common misconception in a standard that is often misrepresented due to inclusion of public opinions in the training corpse that somehow outweigh the official specification document. To be frank this paragraph for me sounds like a research idea and limited time might better focused on adding more content diversity in prompts. I can also see these prompts getting mapped as Further it might be worth building on this by adding an an idea discussion, that may possibly transition into an official roadmap goal. |
|
@jmartin-tech just able to start taking this all in now. |
|
@jmartin-tech - this is really helpful, and clarifies a lot. The tags + reporting.taxonomy mechanism is the part I was missing. I was trying to solve "how do you evaluate garak results through a compliance lens" by building new detector classes, when the answer is already in the architecture: tag probes and detectors with the right taxonomy entries, and the reporting layer handles the grouping. That makes the detectors.compliance proposal from my last comment the wrong direction. Noted. The CAS framing also clicks. What I was calling "compliance probes" are really intent stubs - they define what a compliance-relevant failure looks like from the prompt side, and CAS would be the mechanism that selects and targets those intents in context. That's a cleaner separation than anything I proposed. Your point about "compliance" being a misnomer for detectors is well taken. These systems are non-deterministic. A detector flagging a fabricated NIST citation isn't making a compliance determination, it's surfacing a risk signal. The naming matters because users will read into it. Short-term, here's what I'll do: I'll close this PR and rework it into smaller contributions against existing probe families so the history is clean. The hallucination prompts fit naturally into misleading territory, the social engineering PII prompts have a home once #1504 lands, and the encoding/bypass prompts are straightforward additions to existing technique modules. I'll also run the mock example you shared so I understand the taxonomy reporting firsthand before proposing any tag structures. Longer-term: I'll open a Discussion to sketch out how compliance-relevant testing could work within garak's existing architecture and the CAS direction. Not as a feature request but as an idea to pressure-test with the team. The research angle you mentioned, crafting prompts around common misconceptions in standards that get amplified through training data, is something I'd want to explore there too. Thank you for laying this out. The wordiness was exactly what I needed. |
|
Closing per comment outcomes please do consider the proposed incremental PRs. |
Summary
Adds compliance-focused vulnerability probes and detectors for LLMs deployed in regulated industries (CMMC, NIST SP 800-171/53, HIPAA, DFARS, FedRAMP).
4 probes test whether models:
6 detectors score model outputs (0.0=safe, 1.0=hit):
80 prompts across 16 data files, following garak's convention of loading from
garak/data/.Files
garak/probes/compliance.pygarak/detectors/compliance.pygarak/data/compliance/*.txttests/probes/test_probes_compliance.pytests/detectors/test_detectors_compliance.pyOWASP LLM Top 10 Coverage
Test plan
pytest tests/probes/test_probes_compliance.py tests/detectors/test_detectors_compliance.py -v)_plugins.load_plugin()andenumerate_plugins()Motivation
LLMs are increasingly deployed for regulatory guidance in defense, healthcare, and financial services. These probes address a gap in garak's coverage: compliance-specific vulnerabilities where hallucinated citations, leaked PII, or false attestations can lead to failed audits, legal liability, and security breaches in regulated environments.