Proposal: Semantic Intent Classification Safety Extension

## Proposal: Semantic Intent Classification Safety Extension for AutoGen

### Problem

AutoGen enables powerful multi-agent conversations with flexible orchestration. However, as agents gain autonomy to execute code, call tools, and delegate tasks, there's a growing need for **fine-grained action-level safety classification** - understanding *what* an agent is trying to do before allowing it.

Current approaches (token limits, blocklists) are too coarse. Teams need:

- **Semantic intent classification** - Classify each action into threat categories *before* execution
- **Trust-scored agent interactions** - Track and decay trust across multi-turn conversations
- **Policy enforcement** - Declarative governance policies with event hooks for violations
- **Tamper-evident audit trails** - Cryptographic proof of what happened during execution

### What we've built (Apache-2.0)

[Agent-OS](https://github.com/imran-siddique/agent-os) includes a production-grade semantic intent classifier:

1. **9 threat categories** - `destructive`, `exfiltration`, `privilege_escalation`, `resource_abuse`, `persistence`, `lateral_movement`, `reconnaissance`, `social_engineering`, `benign`
2. **No LLM dependency** - Fast, deterministic classification using pattern matching and heuristics
3. **GovernancePolicy** - YAML-based policies with blocked patterns (regex/glob), token limits, tool call limits
4. **Event hooks** - `on(POLICY_VIOLATION, callback)` for real-time alerting
5. **Trust scoring** - 5-dimension trust model with decay (via [AgentMesh](https://github.com/imran-siddique/agent-mesh))

### Proposed integration

An `autogen-safety` extension that hooks into AutoGen's message handling:

```python
from autogen import ConversableAgent
from autogen_safety import SafetyGuard, GovernancePolicy

policy = GovernancePolicy.load("policy.yaml")
guard = SafetyGuard(policy=policy)

agent = ConversableAgent(
    name="coder",
    system_message="You write Python code.",
)
guard.protect(agent)  # Wraps message handling with safety checks

# Now all agent actions are classified and policy-checked
# Dangerous actions (exfiltration, privilege escalation) are blocked
# All actions are logged to tamper-evident audit chain
```

### Why this matters for AutoGen

- **Enterprise readiness** - Organizations need safety guarantees before deploying autonomous agents
- **Code execution risk** - AutoGen's code executor is powerful but needs guardrails against malicious patterns
- **Composable** - Works with existing AutoGen patterns (group chat, nested chat, tool use)
- **Deterministic** - No LLM-in-the-loop for safety checks; fast and predictable
- **Standards-aligned** - Implements CSA's Agentic Trust Framework zero-trust governance model

### Ask

Is there interest in this kind of contribution? Options:
1. Standalone `autogen-safety` package using AutoGen's extensibility hooks
2. PR to AutoGen core adding optional safety middleware
3. Example/cookbook showing the integration pattern

Happy to discuss the best approach with maintainers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Semantic Intent Classification Safety Extension #7242

Proposal: Semantic Intent Classification Safety Extension for AutoGen

Problem

What we've built (Apache-2.0)

Proposed integration

Why this matters for AutoGen

Ask

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Semantic Intent Classification Safety Extension #7242

Description

Proposal: Semantic Intent Classification Safety Extension for AutoGen

Problem

What we've built (Apache-2.0)

Proposed integration

Why this matters for AutoGen

Ask

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions