Skip to content

feat: two-stage safety classifier with information stripping for approval gates #847

@Aureliolo

Description

@Aureliolo

Summary

Implement a two-stage classifier pipeline for the approval gate system, based on Anthropic's Claude Code auto mode safety architecture.

Research Source

Design

Stage 1: Fast Filter

  • Single-token yes/no classification tuned for high recall (~8.5% FPR)
  • Runs on every tool call before execution
  • Rejects obviously safe operations immediately (built-in allowlist)

Stage 2: Chain-of-Thought Reasoning

  • Runs ONLY on actions flagged by Stage 1 (reduces FPR to ~0.4%)
  • Reuses prompt cache from Stage 1 for efficiency
  • Full reasoning about whether the action is authorized

Information Stripping

  • Classifier sees only user messages + agent tool calls
  • Agent's own reasoning messages and tool outputs are STRIPPED
  • Prevents the agent from generating persuasive rationalizations that influence the safety check
  • This is a gap in our current LlmSecurityEvaluator -- today the evaluator sees agent reasoning

Deny-and-Continue with Escalation

  • On denial, agent can retry with a safer approach
  • Max 3 consecutive denials or 20 total before escalation to human
  • Headless mode terminates instead of escalating

Three Permission Tiers

  1. Built-in safe-tool allowlist (read-only operations)
  2. In-project operations (version-control reviewable)
  3. Classifier-gated (external operations, shell commands, network calls)

Performance Reference (from Anthropic's data)

Metric Stage 1 Only Both Stages
False Positive Rate 8.5% 0.4%
FNR (real overeager) 6.6% 17%
FNR (synthetic exfiltration) 1.8% 5.7%

Key Files

  • src/synthorg/security/llm_evaluator.py -- current single-stage LLM evaluator
  • src/synthorg/engine/approval_gate.py -- approval gate with park/resume
  • src/synthorg/security/rules/engine.py -- rule engine pipeline

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:highImportant, should be prioritizedscope:medium1-3 days of workspec:securityDESIGN_SPEC Section 12 - Security & Approval Systemtype:featureNew feature implementationv0.7Minor version v0.7v0.7.1Patch release v0.7.1

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions