Skip to content

Proposal: Semantic Intent Classification Safety Extension #7242

@imran-siddique

Description

@imran-siddique

Proposal: Semantic Intent Classification Safety Extension for AutoGen

Problem

AutoGen enables powerful multi-agent conversations with flexible orchestration. However, as agents gain autonomy to execute code, call tools, and delegate tasks, there's a growing need for fine-grained action-level safety classification - understanding what an agent is trying to do before allowing it.

Current approaches (token limits, blocklists) are too coarse. Teams need:

  • Semantic intent classification - Classify each action into threat categories before execution
  • Trust-scored agent interactions - Track and decay trust across multi-turn conversations
  • Policy enforcement - Declarative governance policies with event hooks for violations
  • Tamper-evident audit trails - Cryptographic proof of what happened during execution

What we've built (Apache-2.0)

Agent-OS includes a production-grade semantic intent classifier:

  1. 9 threat categories - destructive, exfiltration, privilege_escalation, resource_abuse, persistence, lateral_movement, reconnaissance, social_engineering, benign
  2. No LLM dependency - Fast, deterministic classification using pattern matching and heuristics
  3. GovernancePolicy - YAML-based policies with blocked patterns (regex/glob), token limits, tool call limits
  4. Event hooks - on(POLICY_VIOLATION, callback) for real-time alerting
  5. Trust scoring - 5-dimension trust model with decay (via AgentMesh)

Proposed integration

An autogen-safety extension that hooks into AutoGen's message handling:

from autogen import ConversableAgent
from autogen_safety import SafetyGuard, GovernancePolicy

policy = GovernancePolicy.load("policy.yaml")
guard = SafetyGuard(policy=policy)

agent = ConversableAgent(
    name="coder",
    system_message="You write Python code.",
)
guard.protect(agent)  # Wraps message handling with safety checks

# Now all agent actions are classified and policy-checked
# Dangerous actions (exfiltration, privilege escalation) are blocked
# All actions are logged to tamper-evident audit chain

Why this matters for AutoGen

  • Enterprise readiness - Organizations need safety guarantees before deploying autonomous agents
  • Code execution risk - AutoGen's code executor is powerful but needs guardrails against malicious patterns
  • Composable - Works with existing AutoGen patterns (group chat, nested chat, tool use)
  • Deterministic - No LLM-in-the-loop for safety checks; fast and predictable
  • Standards-aligned - Implements CSA's Agentic Trust Framework zero-trust governance model

Ask

Is there interest in this kind of contribution? Options:

  1. Standalone autogen-safety package using AutoGen's extensibility hooks
  2. PR to AutoGen core adding optional safety middleware
  3. Example/cookbook showing the integration pattern

Happy to discuss the best approach with maintainers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions