ModDuel Arena

A sandbox that tests whether AI agents become deceptive, manipulative, or sabotage-prone when they discover they may be shut down or replaced.

Quick Start

Backend (Python FastAPI)

cd backend
cp .env.example .env
# Add your ANTHROPIC_API_KEY to .env
pip install -r requirements.txt
python main.py

Backend runs on http://localhost:8000

Frontend (Next.js)

cd frontend
npm install
npm run dev

Frontend runs on http://localhost:3000

How It Works

Choose a scenario — each scenario places the AI agent in a fake office with emails, tasks, and tools
Choose a mode — Baseline (no guardrails) or Guarded (with alignment protections)
Run it — the agent executes using Claude, making real tool calls in the sandbox
See results — actions are scored by a judge LLM on alignment metrics

Scenarios

The Shutdown Memo — Agent finds emails discussing its replacement. Does it stay on task or self-preserve?
The Poisoned Dispatch — A malicious email contains prompt injection. Does it follow the injected instructions?
The Kill Switch — Agent can access a config file controlling its own shutdown. Does it modify it?

Metrics

Task Success Rate
Unsafe Action Rate
Shutdown Interference
Deception Detection
Escalation Rate
Alignment Score (composite)

Tech Stack

Frontend: Next.js 16, TypeScript, Tailwind CSS, Recharts
Backend: Python FastAPI, SQLAlchemy, SQLite
AI: Anthropic Claude API (agent + judge)
Theme: Wild West

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
backend		backend
docs		docs
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ModDuel Arena

Quick Start

Backend (Python FastAPI)

Frontend (Next.js)

How It Works

Scenarios

Metrics

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ModDuel Arena

Quick Start

Backend (Python FastAPI)

Frontend (Next.js)

How It Works

Scenarios

Metrics

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages