feat: two-stage safety classifier with information stripping for approval gates

## Summary

Implement a two-stage classifier pipeline for the approval gate system, based on Anthropic's Claude Code auto mode safety architecture.

## Research Source

- [Claude Code Auto Mode](https://www.anthropic.com/engineering/claude-code-auto-mode) (Anthropic Engineering, 2026-03-25)

## Design

### Stage 1: Fast Filter
- Single-token yes/no classification tuned for high recall (~8.5% FPR)
- Runs on every tool call before execution
- Rejects obviously safe operations immediately (built-in allowlist)

### Stage 2: Chain-of-Thought Reasoning
- Runs ONLY on actions flagged by Stage 1 (reduces FPR to ~0.4%)
- Reuses prompt cache from Stage 1 for efficiency
- Full reasoning about whether the action is authorized

### Information Stripping
- Classifier sees only user messages + agent tool calls
- Agent's own reasoning messages and tool outputs are STRIPPED
- Prevents the agent from generating persuasive rationalizations that influence the safety check
- This is a gap in our current `LlmSecurityEvaluator` -- today the evaluator sees agent reasoning

### Deny-and-Continue with Escalation
- On denial, agent can retry with a safer approach
- Max 3 consecutive denials or 20 total before escalation to human
- Headless mode terminates instead of escalating

### Three Permission Tiers
1. Built-in safe-tool allowlist (read-only operations)
2. In-project operations (version-control reviewable)
3. Classifier-gated (external operations, shell commands, network calls)

## Performance Reference (from Anthropic's data)

| Metric | Stage 1 Only | Both Stages |
|--------|-------------|-------------|
| False Positive Rate | 8.5% | 0.4% |
| FNR (real overeager) | 6.6% | 17% |
| FNR (synthetic exfiltration) | 1.8% | 5.7% |

## Key Files

- `src/synthorg/security/llm_evaluator.py` -- current single-stage LLM evaluator
- `src/synthorg/engine/approval_gate.py` -- approval gate with park/resume
- `src/synthorg/security/rules/engine.py` -- rule engine pipeline

## Related Issues

- #701 (cross-provider uncertainty check)
- #696 (sandbox security improvements)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: two-stage safety classifier with information stripping for approval gates #847

Summary

Research Source

Design

Stage 1: Fast Filter

Stage 2: Chain-of-Thought Reasoning

Information Stripping

Deny-and-Continue with Escalation

Three Permission Tiers

Performance Reference (from Anthropic's data)

Key Files

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Stage 1 Only	Both Stages
False Positive Rate	8.5%	0.4%
FNR (real overeager)	6.6%	17%
FNR (synthetic exfiltration)	1.8%	5.7%

feat: two-stage safety classifier with information stripping for approval gates #847

Description

Summary

Research Source

Design

Stage 1: Fast Filter

Stage 2: Chain-of-Thought Reasoning

Information Stripping

Deny-and-Continue with Escalation

Three Permission Tiers

Performance Reference (from Anthropic's data)

Key Files

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions