Compare LLM responses side-by-side in your terminal, then let an AI judge score them.
TokenWar sends the same prompt to multiple LLM models via any OpenAI-compatible endpoint, displays their responses in a split-pane TUI, and runs an LLM-as-judge evaluation scoring each response on accuracy, helpfulness, clarity, creativity, and conciseness.
┌─────────────────────┬──────────────────────┬─────────────────────┐
│ claude-sonnet-4 │ gpt-4o │ grok-3 │
│ │ │ │
│ The Rust ownership │ Rust's ownership │ In Rust, ownership │
│ system ensures │ model is a set of │ is the core concept │
│ memory safety │ rules that the │ that makes memory │
│ without a garbage │ compiler checks at │ safe without GC... │
│ collector... │ compile time... │ │
│ │ │ │
├─────────────────────┴───────────┬──────────┴─────────────────────┤
│ gemini-2.5-flash │ llama-3.1-70b │
│ │ │
│ Ownership in Rust is a │ Rust uses an ownership model │
│ discipline enforced by the │ where each value has exactly │
│ compiler that governs how │ one owner at a time... │
│ memory is managed... │ │
│ │ │
└─────────────────────────────────┴────────────────────────────────┘
After all responses arrive, the judge scores them:
=== Scoreboard ===
1. claude-sonnet-4 - 42.0/50
2. gemini-2.5-flash - 40.5/50
3. gpt-4o - 39.0/50
4. grok-3 - 38.5/50
5. llama-3.1-70b - 37.0/50
=== Details ===
claude-sonnet-4:
Accuracy: 9.0 (Correct and precise explanation of ownership rules)
Helpfulness: 8.5 (Directly addresses the question with practical examples)
Clarity: 8.5 (Well-structured with clear progression of concepts)
Creativity: 8.0 (Novel analogy comparing ownership to real-world lending)
Conciseness: 8.0 (Thorough but not verbose)
gpt-4o:
Accuracy: 8.5 (Accurate coverage of core concepts)
Helpfulness: 8.0 (Good overview but fewer practical examples)
...
| Use Case | Why TokenWar Wins |
|---|---|
| Evaluating models for your use case | See how multiple models handle your actual prompts, not benchmarks |
| Reducing bias in model selection | An independent judge scores responses — not your gut feeling |
| Catching hallucinations | If 4 models agree and 1 doesn't, you've found a hallucination |
| Prompt engineering | Instantly see how different models interpret the same prompt |
| Choosing a model for production | Real response quality + latency data, not marketing claims |
| Creative work | Compare writing styles, get multiple angles on the same topic |
| Factual research | Cross-reference answers across models for higher confidence |
| Cost optimization | If a cheaper model scores comparably, you've found your winner |
Example: You're building a customer support bot. You write 10 representative prompts, run them through TokenWar, and discover that for your specific domain, Gemini outperforms GPT-4o while costing less. You'd never know this from public benchmarks.
| Situation | Why TokenWar is Overkill |
|---|---|
| Quick one-off questions | You just need an answer, not a comparison |
| Conversational/multi-turn chat | TokenWar is single-turn only — no follow-ups |
| You already know your preferred model | No need to compare if you're happy |
| Cost-sensitive usage | TokenWar calls N models + a judge = (N+1)x the cost of one model |
| Image/audio/video tasks | TokenWar is text-only |
| You need tool use or function calling | TokenWar sends plain prompts, no tool schemas |
- ⚡ Concurrent API calls — All models queried simultaneously via tokio
- 📺 Terminal UI — Split-pane ratatui display showing responses as they stream in
- 🏆 LLM-as-judge scoring — Automated evaluation on 5 criteria (1-10 scale each, 50 max)
- 🔌 Any model, one endpoint — Works with LiteLLM, OpenRouter, Ollama, or any OpenAI-compatible API
- 📡 Streaming mode — Watch responses arrive token-by-token with
--stream - 📋 Plain text mode —
--no-tuifor piping output or CI/automation - 📊 JSON output —
--jsonfor machine-readable results with latency data - ⏱️ Latency tracking — Per-model response time in milliseconds
- 🔧 Dynamic model list — Add or remove models by editing one env var, no code changes
- 💪 Fault tolerant — One model failing doesn't kill the others
brew tap tylerwillis/tap
brew install tokenwarThis installs both tokenwar and tw (shorthand alias).
Download the latest release for your platform:
# macOS (Apple Silicon)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-macos-aarch64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/
# macOS (Intel)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-macos-x86_64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/
# Linux (x86_64)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-linux-x86_64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/Requires Rust 1.70+:
git clone https://github.com/tylerwillis/tokenwar.git
cd tokenwar
cargo build --release
# Binary at target/release/tokenwarTokenWar talks to a single OpenAI-compatible endpoint. You need a proxy that routes to multiple providers. Pick one:
LiteLLM gives you one API for 100+ models with zero token markup. Best for homelabbers.
# Install
pipx install 'litellm[proxy]'
# Create config (litellm_config.yaml)
cat > litellm_config.yaml << 'EOF'
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: gpt-4o-mini
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: gemini-2.5-flash
litellm_params:
model: gemini/gemini-2.5-flash
api_key: os.environ/GEMINI_API_KEY
- model_name: claude-sonnet-4
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
general_settings:
master_key: sk-tokenwar-local
EOF
# Start the proxy
litellm --config litellm_config.yaml --port 4000Or with Docker:
docker run -d --name litellm \
-p 4000:4000 \
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
-e GEMINI_API_KEY=$GEMINI_API_KEY \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yamlNo self-hosting required. One API key, 200+ models, small per-token markup.
- Get an API key at openrouter.ai
- Set
base_url = "https://openrouter.ai/api/v1"in yourtokenwar.toml(see Configuration)
For comparing local models with zero API costs:
ollama serve # starts on localhost:11434
ollama pull llama3.1
ollama pull mistralSet base_url = "http://localhost:11434/v1" and api_key = "ollama" in your tokenwar.toml (see Configuration).
TokenWar looks for tokenwar.toml in the current directory, or falls back to ~/.config/tokenwar/tokenwar.toml for a global config.
# Global config (recommended)
mkdir -p ~/.config/tokenwar
cp tokenwar.example.toml ~/.config/tokenwar/tokenwar.toml
# Or local config (per-project)
cp tokenwar.example.toml tokenwar.tomlEdit your config:
base_url = "http://localhost:4000/v1"
api_key = "sk-tokenwar-local"
[[models]]
name = "gpt-4o"
model = "gpt-4o"
[[models]]
name = "gpt-4o-mini"
model = "gpt-4o-mini"
judge_model = "gpt-4o"
timeout_secs = 60
[limits]
max_concurrent_runs = 4
max_images = 8
max_total_image_bytes = 20971520 # 20 MiB
max_completed_runs = 50Model names must match what your proxy expects. For LiteLLM, these are the
model_namevalues in your config. For OpenRouter, use their model IDs (e.g.openai/gpt-4o).
If you prefer keeping secrets out of tokenwar.toml, you can override these via env vars:
TOKENWAR_BASE_URLTOKENWAR_API_KEYTOKENWAR_JUDGE_MODELTOKENWAR_JUDGE_BASE_URLTOKENWAR_JUDGE_API_KEY
# Pass prompt as argument
tokenwar "Explain the difference between TCP and UDP"
# Pipe from stdin
echo "Write a haiku about Rust" | tokenwar
# From a file
tokenwar < prompt.txt# Stream responses token-by-token in the TUI
tokenwar --stream "What is quantum computing?"
# Plain text output (no TUI, good for scripts/CI)
tokenwar --no-tui "Compare REST vs GraphQL"
# JSON output (machine-readable, includes latency per model)
tokenwar --json "Compare REST vs GraphQL"
# Custom timeout (overrides config)
tokenwar --timeout-secs 120 "Write a detailed essay on climate change"
# Combine flags
tokenwar --stream --timeout-secs 90 "Explain monads to a 5-year-old"
# Web UI (streams responses live in the browser)
tokenwar --web --port 8080 --config tokenwar.tomlThe --json flag outputs structured JSON for programmatic consumption:
{
"prompt": "What is 2+2?",
"providers": [
{
"name": "gpt-4o",
"model": "gpt-4o",
"response_text": "2 + 2 = 4.",
"error": null,
"latency_ms": 1234,
"ttft_ms": 210
},
{
"name": "gemini-2.5-flash",
"model": "gemini-2.5-flash",
"response_text": "The answer is 4.",
"error": null,
"latency_ms": 987,
"ttft_ms": 160
}
],
"scores": [],
"metadata": {
"timestamp": 1738492800,
"timeout_secs": 60,
"stream": false
}
}| Key | Action |
|---|---|
q |
Exit TUI |
Tab |
Focus next panel |
Shift+Tab |
Focus previous panel |
j / ↓ |
Scroll down in the active panel |
k / ↑ |
Scroll up in the active panel |
Space |
Toggle fullscreen on focused panel |
Esc |
Exit fullscreen mode |
c |
Copy focused panel's response to clipboard |
The TUI stays open after responses complete so you can review, scroll, and copy. Press q to exit and see the judge scoreboard.
- Enter runs the prompt (Shift+Enter inserts a newline)
- Paste images from the clipboard (limits enforced by
tokenwar.toml) - q cancels the run
- Tab / Shift+Tab cycles the active panel
- j/k or ↑/↓ scrolls the active panel
┌─────────────────────────────┐
prompt │ OpenAI-compatible endpoint │
┌───────────▶│ (LiteLLM / OpenRouter / ...) │────┐
│ └─────────────────────────────┘ │
│ │
┌──────┴──┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ┌───────┐ ┌───────┐
│ User │───▶│ Model A │ │ Model B │ │ Model C │─────┼───▶│ TUI │───▶│ Judge │
│ Prompt │ └─────────┘ └─────────┘ └─────────┘ │ └───────┘ └───────┘
└─────────┘ │
All calls are concurrent (tokio) │
- Dispatch — Your prompt is sent to all configured models simultaneously
- Collect — Responses stream back via mpsc channels and render in the TUI
- Judge — All responses are sent to the judge model for structured scoring
- Report — Scoreboard with rankings and per-criteria reasoning
The judge evaluates each response on a 1-10 scale:
| Criterion | What it measures |
|---|---|
| Accuracy | Is the information correct and factual? |
| Helpfulness | Does it address what the user actually needs? |
| Clarity | Is it well-structured and easy to understand? |
| Creativity | Does it show original thinking or novel approaches? |
| Conciseness | Is it appropriately detailed without being verbose? |
Total: /50 — The judge provides brief reasoning for each score.
Tip: Use a different judge model than the contestants to reduce self-preference bias. If you're comparing Claude vs GPT, use Gemini as the judge.
- Compare anything — same family (gpt-4o vs gpt-4o-mini), cross-provider (Claude vs GPT vs Gemini), or local vs cloud (Llama vs GPT-4o)
- Run the same prompt multiple times — LLM outputs are non-deterministic, so scores will vary
- Use
--jsonfor automation — pipe tojq, build dashboards, track model quality over time - Per-model overrides — point specific models at different endpoints (e.g., one model direct, rest through proxy)
- No model limit — compare 2 models or 20; the TUI grid auto-layouts
-
JSON output mode -
Per-model latency tracking -
Unified OpenAI-compatible endpoint (LiteLLM/OpenRouter) -
Dynamic model list (no code changes to add models) -
Auto-layout TUI grid for any number of models - Multi-turn conversation support
- Token usage and cost tracking per model
- Configurable scoring criteria
- Export results to CSV
- Time-to-first-token latency tracking
- Side-by-side diff view for similar responses
MIT