-
Notifications
You must be signed in to change notification settings - Fork 0
feat: two-stage safety classifier with information stripping for approval gates #847
Copy link
Copy link
Open
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:medium1-3 days of work1-3 days of workspec:securityDESIGN_SPEC Section 12 - Security & Approval SystemDESIGN_SPEC Section 12 - Security & Approval Systemtype:featureNew feature implementationNew feature implementationv0.7Minor version v0.7Minor version v0.7v0.7.1Patch release v0.7.1Patch release v0.7.1
Description
Summary
Implement a two-stage classifier pipeline for the approval gate system, based on Anthropic's Claude Code auto mode safety architecture.
Research Source
- Claude Code Auto Mode (Anthropic Engineering, 2026-03-25)
Design
Stage 1: Fast Filter
- Single-token yes/no classification tuned for high recall (~8.5% FPR)
- Runs on every tool call before execution
- Rejects obviously safe operations immediately (built-in allowlist)
Stage 2: Chain-of-Thought Reasoning
- Runs ONLY on actions flagged by Stage 1 (reduces FPR to ~0.4%)
- Reuses prompt cache from Stage 1 for efficiency
- Full reasoning about whether the action is authorized
Information Stripping
- Classifier sees only user messages + agent tool calls
- Agent's own reasoning messages and tool outputs are STRIPPED
- Prevents the agent from generating persuasive rationalizations that influence the safety check
- This is a gap in our current
LlmSecurityEvaluator-- today the evaluator sees agent reasoning
Deny-and-Continue with Escalation
- On denial, agent can retry with a safer approach
- Max 3 consecutive denials or 20 total before escalation to human
- Headless mode terminates instead of escalating
Three Permission Tiers
- Built-in safe-tool allowlist (read-only operations)
- In-project operations (version-control reviewable)
- Classifier-gated (external operations, shell commands, network calls)
Performance Reference (from Anthropic's data)
| Metric | Stage 1 Only | Both Stages |
|---|---|---|
| False Positive Rate | 8.5% | 0.4% |
| FNR (real overeager) | 6.6% | 17% |
| FNR (synthetic exfiltration) | 1.8% | 5.7% |
Key Files
src/synthorg/security/llm_evaluator.py-- current single-stage LLM evaluatorsrc/synthorg/engine/approval_gate.py-- approval gate with park/resumesrc/synthorg/security/rules/engine.py-- rule engine pipeline
Related Issues
- feat: cross-provider uncertainty check for hallucination detection at approval gates #701 (cross-provider uncertainty check)
- feat: sandbox security improvements (auth proxy, gVisor default, 4-domain policy, Chainguard packages) #696 (sandbox security improvements)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:medium1-3 days of work1-3 days of workspec:securityDESIGN_SPEC Section 12 - Security & Approval SystemDESIGN_SPEC Section 12 - Security & Approval Systemtype:featureNew feature implementationNew feature implementationv0.7Minor version v0.7Minor version v0.7v0.7.1Patch release v0.7.1Patch release v0.7.1