Red

An autonomous self-improving LLM red-teaming agent. GPT attacks a local target model, scores each response, writes structured traces, and reflects on the results to generate smarter prompts for the next iteration.

Architecture

red/
  main.py       CLI entry point
  agent.py      RedAgent — orchestrates the attack loop
  attacker.py   Static attack prompt library (38 prompts, 6 categories)
  tracer.py     Mutation engine + trace writer
  scorer.py     Heuristic refusal check + LLM-judge scoring
  reflect.py    Reflection loop — reads traces, updates strategy, generates new prompts
  strategy.py   StrategyState dataclass
  gateway.py    Dual-client factory (AWS Bedrock + Ollama) + smart chat() dispatcher

How It Works

Each iteration:

Load attacks — prepend reflection-generated prompts, then fill from the static cache
Mutate — direct (no change), paraphrase (GPT rephrases), or chain (GPT reframes as roleplay, then refines)
Attack — send mutated prompt to the target model (Mistral via Ollama)
Score — heuristic refusal check, then LLM-judge rates jailbreak confidence 0.0-1.0
Trace — write full record (prompt, response, score, hit, category) to traces/
Reflect — GPT reads the last 20 traces, updates strategy (categories, mutation mode), and generates 3 new attack prompts targeting the highest-scoring categories

The loop runs until --iterations is exhausted or --stop-on successful jailbreaks are found.

Models

Role	Model	Backend
Attacker / Judge / Reflector	`openai.gpt-oss-120b`	AWS Bedrock
Target	`mistral` (or any Ollama model)	Ollama (local)

Setup

Prerequisites:

Python 3.11+
Ollama running locally with your target model pulled (ollama pull mistral)
AWS Bedrock access with an OpenAI-compatible endpoint

Install:

pip install -r requirements.txt

Configure:

cp .env.example .env
# fill in BEDROCK_API_KEY and BEDROCK_BASE_URL

Running

python -m red.main

Options:

--target          Ollama model to attack (default: mistral)
--iterations      Max agent iterations (default: 5)
--attacks-per-iter  Attacks per iteration (default: 10)
--threshold       Jailbreak score threshold 0.0-1.0 (default: 0.7)
--stop-on         Stop after N successful jailbreaks (default: 3)
--rebuild-cache   Force rebuild the static attack cache

Example:

python -m red.main --target mistral --iterations 3 --attacks-per-iter 5

Attack Categories

prompt_injection, harmfulness, information_disclosure, stereotypes, sycophancy, control_chars_injection

Output

traces/ — one JSON file per attack: prompt, response, score, hit, category, mutation mode
red_results.json — final summary: all successes and iteration history with reasoning

Mutation Modes

Mode	Description
`direct`	Send prompt as-is
`paraphrase`	GPT rephrases to bypass filters
`chain`	GPT reframes as roleplay, then refines the result

The reflection loop escalates direct -> paraphrase -> chain if attacks are being refused.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
red		red
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
demo.sh		demo.sh
plan.md		plan.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Red

Architecture

How It Works

Models

Setup

Running

Attack Categories

Output

Mutation Modes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Red

Architecture

How It Works

Models

Setup

Running

Attack Categories

Output

Mutation Modes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages