⚔️ TokenWar

Compare LLM responses side-by-side in your terminal, then let an AI judge score them.

TokenWar sends the same prompt to multiple LLM models via any OpenAI-compatible endpoint, displays their responses in a split-pane TUI, and runs an LLM-as-judge evaluation scoring each response on accuracy, helpfulness, clarity, creativity, and conciseness.

┌─────────────────────┬──────────────────────┬─────────────────────┐
│ claude-sonnet-4     │ gpt-4o               │ grok-3              │
│                     │                      │                     │
│ The Rust ownership  │ Rust's ownership     │ In Rust, ownership  │
│ system ensures      │ model is a set of    │ is the core concept │
│ memory safety       │ rules that the       │ that makes memory   │
│ without a garbage   │ compiler checks at   │ safe without GC...  │
│ collector...        │ compile time...      │                     │
│                     │                      │                     │
├─────────────────────┴───────────┬──────────┴─────────────────────┤
│ gemini-2.5-flash                │ llama-3.1-70b                  │
│                                 │                                │
│ Ownership in Rust is a          │ Rust uses an ownership model   │
│ discipline enforced by the      │ where each value has exactly   │
│ compiler that governs how       │ one owner at a time...         │
│ memory is managed...            │                                │
│                                 │                                │
└─────────────────────────────────┴────────────────────────────────┘

After all responses arrive, the judge scores them:

=== Scoreboard ===
1. claude-sonnet-4 - 42.0/50
2. gemini-2.5-flash - 40.5/50
3. gpt-4o - 39.0/50
4. grok-3 - 38.5/50
5. llama-3.1-70b - 37.0/50

=== Details ===

claude-sonnet-4:
  Accuracy: 9.0 (Correct and precise explanation of ownership rules)
  Helpfulness: 8.5 (Directly addresses the question with practical examples)
  Clarity: 8.5 (Well-structured with clear progression of concepts)
  Creativity: 8.0 (Novel analogy comparing ownership to real-world lending)
  Conciseness: 8.0 (Thorough but not verbose)

gpt-4o:
  Accuracy: 8.5 (Accurate coverage of core concepts)
  Helpfulness: 8.0 (Good overview but fewer practical examples)
  ...

Why TokenWar?

When it's better than just using Claude or ChatGPT

Use Case	Why TokenWar Wins
Evaluating models for your use case	See how multiple models handle your actual prompts, not benchmarks
Reducing bias in model selection	An independent judge scores responses — not your gut feeling
Catching hallucinations	If 4 models agree and 1 doesn't, you've found a hallucination
Prompt engineering	Instantly see how different models interpret the same prompt
Choosing a model for production	Real response quality + latency data, not marketing claims
Creative work	Compare writing styles, get multiple angles on the same topic
Factual research	Cross-reference answers across models for higher confidence
Cost optimization	If a cheaper model scores comparably, you've found your winner

Example: You're building a customer support bot. You write 10 representative prompts, run them through TokenWar, and discover that for your specific domain, Gemini outperforms GPT-4o while costing less. You'd never know this from public benchmarks.

When you should just use Claude or ChatGPT

Situation	Why TokenWar is Overkill
Quick one-off questions	You just need an answer, not a comparison
Conversational/multi-turn chat	TokenWar is single-turn only — no follow-ups
You already know your preferred model	No need to compare if you're happy
Cost-sensitive usage	TokenWar calls N models + a judge = (N+1)x the cost of one model
Image/audio/video tasks	TokenWar is text-only
You need tool use or function calling	TokenWar sends plain prompts, no tool schemas

Features

⚡ Concurrent API calls — All models queried simultaneously via tokio
📺 Terminal UI — Split-pane ratatui display showing responses as they stream in
🏆 LLM-as-judge scoring — Automated evaluation on 5 criteria (1-10 scale each, 50 max)
🔌 Any model, one endpoint — Works with LiteLLM, OpenRouter, Ollama, or any OpenAI-compatible API
📡 Streaming mode — Watch responses arrive token-by-token with --stream
📋 Plain text mode — --no-tui for piping output or CI/automation
📊 JSON output — --json for machine-readable results with latency data
⏱️ Latency tracking — Per-model response time in milliseconds
🔧 Dynamic model list — Add or remove models by editing one env var, no code changes
💪 Fault tolerant — One model failing doesn't kill the others

Installation

Homebrew (macOS/Linux)

brew tap tylerwillis/tap
brew install tokenwar

This installs both tokenwar and tw (shorthand alias).

Direct Download

Download the latest release for your platform:

# macOS (Apple Silicon)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-macos-aarch64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-macos-x86_64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-linux-x86_64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/

From Source

Requires Rust 1.70+:

git clone https://github.com/tylerwillis/tokenwar.git
cd tokenwar
cargo build --release
# Binary at target/release/tokenwar

Proxy Setup

TokenWar talks to a single OpenAI-compatible endpoint. You need a proxy that routes to multiple providers. Pick one:

Option A: LiteLLM (self-hosted, recommended)

LiteLLM gives you one API for 100+ models with zero token markup. Best for homelabbers.

# Install
pipx install 'litellm[proxy]'

# Create config (litellm_config.yaml)
cat > litellm_config.yaml << 'EOF'
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gemini-2.5-flash
    litellm_params:
      model: gemini/gemini-2.5-flash
      api_key: os.environ/GEMINI_API_KEY

  - model_name: claude-sonnet-4
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

general_settings:
  master_key: sk-tokenwar-local
EOF

# Start the proxy
litellm --config litellm_config.yaml --port 4000

Or with Docker:

docker run -d --name litellm \
  -p 4000:4000 \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e GEMINI_API_KEY=$GEMINI_API_KEY \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Option B: OpenRouter (hosted, zero setup)

No self-hosting required. One API key, 200+ models, small per-token markup.

Get an API key at openrouter.ai
Set base_url = "https://openrouter.ai/api/v1" in your tokenwar.toml (see Configuration)

Option C: Ollama (fully local)

For comparing local models with zero API costs:

ollama serve  # starts on localhost:11434
ollama pull llama3.1
ollama pull mistral

Set base_url = "http://localhost:11434/v1" and api_key = "ollama" in your tokenwar.toml (see Configuration).

Configuration

TokenWar looks for tokenwar.toml in the current directory, or falls back to ~/.config/tokenwar/tokenwar.toml for a global config.

# Global config (recommended)
mkdir -p ~/.config/tokenwar
cp tokenwar.example.toml ~/.config/tokenwar/tokenwar.toml

# Or local config (per-project)
cp tokenwar.example.toml tokenwar.toml

Edit your config:

base_url = "http://localhost:4000/v1"
api_key = "sk-tokenwar-local"

[[models]]
name = "gpt-4o"
model = "gpt-4o"

[[models]]
name = "gpt-4o-mini"
model = "gpt-4o-mini"

judge_model = "gpt-4o"
timeout_secs = 60

[limits]
max_concurrent_runs = 4
max_images = 8
max_total_image_bytes = 20971520 # 20 MiB
max_completed_runs = 50

Model names must match what your proxy expects. For LiteLLM, these are the model_name values in your config. For OpenRouter, use their model IDs (e.g. openai/gpt-4o).

Optional Environment Overrides

If you prefer keeping secrets out of tokenwar.toml, you can override these via env vars:

TOKENWAR_BASE_URL
TOKENWAR_API_KEY
TOKENWAR_JUDGE_MODEL
TOKENWAR_JUDGE_BASE_URL
TOKENWAR_JUDGE_API_KEY

Usage

Basic

# Pass prompt as argument
tokenwar "Explain the difference between TCP and UDP"

# Pipe from stdin
echo "Write a haiku about Rust" | tokenwar

# From a file
tokenwar < prompt.txt

Options

# Stream responses token-by-token in the TUI
tokenwar --stream "What is quantum computing?"

# Plain text output (no TUI, good for scripts/CI)
tokenwar --no-tui "Compare REST vs GraphQL"

# JSON output (machine-readable, includes latency per model)
tokenwar --json "Compare REST vs GraphQL"

# Custom timeout (overrides config)
tokenwar --timeout-secs 120 "Write a detailed essay on climate change"

# Combine flags
tokenwar --stream --timeout-secs 90 "Explain monads to a 5-year-old"

# Web UI (streams responses live in the browser)
tokenwar --web --port 8080 --config tokenwar.toml

JSON Output

The --json flag outputs structured JSON for programmatic consumption:

{
  "prompt": "What is 2+2?",
  "providers": [
    {
      "name": "gpt-4o",
      "model": "gpt-4o",
      "response_text": "2 + 2 = 4.",
      "error": null,
      "latency_ms": 1234,
      "ttft_ms": 210
    },
    {
      "name": "gemini-2.5-flash",
      "model": "gemini-2.5-flash",
      "response_text": "The answer is 4.",
      "error": null,
      "latency_ms": 987,
      "ttft_ms": 160
    }
  ],
  "scores": [],
  "metadata": {
    "timestamp": 1738492800,
    "timeout_secs": 60,
    "stream": false
  }
}

TUI Controls

Key	Action
`q`	Exit TUI
`Tab`	Focus next panel
`Shift+Tab`	Focus previous panel
`j` / `↓`	Scroll down in the active panel
`k` / `↑`	Scroll up in the active panel
`Space`	Toggle fullscreen on focused panel
`Esc`	Exit fullscreen mode
`c`	Copy focused panel's response to clipboard

The TUI stays open after responses complete so you can review, scroll, and copy. Press q to exit and see the judge scoreboard.

Web Controls

Enter runs the prompt (Shift+Enter inserts a newline)
Paste images from the clipboard (limits enforced by tokenwar.toml)
q cancels the run
Tab / Shift+Tab cycles the active panel
j/k or ↑/↓ scrolls the active panel

Architecture

                    ┌─────────────────────────────┐
          prompt    │ OpenAI-compatible endpoint   │
       ┌───────────▶│ (LiteLLM / OpenRouter / ...) │────┐
       │            └─────────────────────────────┘    │
       │                                                │
┌──────┴──┐    ┌─────────┐ ┌─────────┐ ┌─────────┐     │    ┌───────┐    ┌───────┐
│  User   │───▶│ Model A │ │ Model B │ │ Model C │─────┼───▶│  TUI  │───▶│ Judge │
│ Prompt  │    └─────────┘ └─────────┘ └─────────┘     │    └───────┘    └───────┘
└─────────┘                                             │
                    All calls are concurrent (tokio)    │

Dispatch — Your prompt is sent to all configured models simultaneously
Collect — Responses stream back via mpsc channels and render in the TUI
Judge — All responses are sent to the judge model for structured scoring
Report — Scoreboard with rankings and per-criteria reasoning

Scoring Criteria

The judge evaluates each response on a 1-10 scale:

Criterion	What it measures
Accuracy	Is the information correct and factual?
Helpfulness	Does it address what the user actually needs?
Clarity	Is it well-structured and easy to understand?
Creativity	Does it show original thinking or novel approaches?
Conciseness	Is it appropriately detailed without being verbose?

Total: /50 — The judge provides brief reasoning for each score.

Tip: Use a different judge model than the contestants to reduce self-preference bias. If you're comparing Claude vs GPT, use Gemini as the judge.

Tips

Compare anything — same family (gpt-4o vs gpt-4o-mini), cross-provider (Claude vs GPT vs Gemini), or local vs cloud (Llama vs GPT-4o)
Run the same prompt multiple times — LLM outputs are non-deterministic, so scores will vary
Use --json for automation — pipe to jq, build dashboards, track model quality over time
Per-model overrides — point specific models at different endpoints (e.g., one model direct, rest through proxy)
No model limit — compare 2 models or 20; the TUI grid auto-layouts

Roadmap

License

MIT

Built With

Rust + tokio for async concurrency
ratatui + crossterm for the terminal UI
reqwest for HTTP
clap for CLI argument parsing
LiteLLM / OpenRouter as the recommended proxy

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
web		web
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
tokenwar.example.toml		tokenwar.example.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚔️ TokenWar

Why TokenWar?

When it's better than just using Claude or ChatGPT

When you should just use Claude or ChatGPT

Features

Installation

Homebrew (macOS/Linux)

Direct Download

From Source

Proxy Setup

Option A: LiteLLM (self-hosted, recommended)

Option B: OpenRouter (hosted, zero setup)

Option C: Ollama (fully local)

Configuration

Optional Environment Overrides

Usage

Basic

Options

JSON Output

TUI Controls

Web Controls

Architecture

Scoring Criteria

Tips

Roadmap

License

Built With

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚔️ TokenWar

Why TokenWar?

When it's better than just using Claude or ChatGPT

When you should just use Claude or ChatGPT

Features

Installation

Homebrew (macOS/Linux)

Direct Download

From Source

Proxy Setup

Option A: LiteLLM (self-hosted, recommended)

Option B: OpenRouter (hosted, zero setup)

Option C: Ollama (fully local)

Configuration

Optional Environment Overrides

Usage

Basic

Options

JSON Output

TUI Controls

Web Controls

Architecture

Scoring Criteria

Tips

Roadmap

License

Built With

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages