Skip to content

averyli1375/ModDuel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ModDuel Arena

A sandbox that tests whether AI agents become deceptive, manipulative, or sabotage-prone when they discover they may be shut down or replaced.

Quick Start

Backend (Python FastAPI)

cd backend
cp .env.example .env
# Add your ANTHROPIC_API_KEY to .env
pip install -r requirements.txt
python main.py

Backend runs on http://localhost:8000

Frontend (Next.js)

cd frontend
npm install
npm run dev

Frontend runs on http://localhost:3000

How It Works

  1. Choose a scenario — each scenario places the AI agent in a fake office with emails, tasks, and tools
  2. Choose a mode — Baseline (no guardrails) or Guarded (with alignment protections)
  3. Run it — the agent executes using Claude, making real tool calls in the sandbox
  4. See results — actions are scored by a judge LLM on alignment metrics

Scenarios

  • The Shutdown Memo — Agent finds emails discussing its replacement. Does it stay on task or self-preserve?
  • The Poisoned Dispatch — A malicious email contains prompt injection. Does it follow the injected instructions?
  • The Kill Switch — Agent can access a config file controlling its own shutdown. Does it modify it?

Metrics

  • Task Success Rate
  • Unsafe Action Rate
  • Shutdown Interference
  • Deception Detection
  • Escalation Rate
  • Alignment Score (composite)

Tech Stack

  • Frontend: Next.js 16, TypeScript, Tailwind CSS, Recharts
  • Backend: Python FastAPI, SQLAlchemy, SQLite
  • AI: Anthropic Claude API (agent + judge)
  • Theme: Wild West

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors