Skip to content

tylerwillis/tokenwar

Repository files navigation

⚔️ TokenWar

Compare LLM responses side-by-side in your terminal, then let an AI judge score them.

TokenWar sends the same prompt to multiple LLM models via any OpenAI-compatible endpoint, displays their responses in a split-pane TUI, and runs an LLM-as-judge evaluation scoring each response on accuracy, helpfulness, clarity, creativity, and conciseness.

┌─────────────────────┬──────────────────────┬─────────────────────┐
│ claude-sonnet-4     │ gpt-4o               │ grok-3              │
│                     │                      │                     │
│ The Rust ownership  │ Rust's ownership     │ In Rust, ownership  │
│ system ensures      │ model is a set of    │ is the core concept │
│ memory safety       │ rules that the       │ that makes memory   │
│ without a garbage   │ compiler checks at   │ safe without GC...  │
│ collector...        │ compile time...      │                     │
│                     │                      │                     │
├─────────────────────┴───────────┬──────────┴─────────────────────┤
│ gemini-2.5-flash                │ llama-3.1-70b                  │
│                                 │                                │
│ Ownership in Rust is a          │ Rust uses an ownership model   │
│ discipline enforced by the      │ where each value has exactly   │
│ compiler that governs how       │ one owner at a time...         │
│ memory is managed...            │                                │
│                                 │                                │
└─────────────────────────────────┴────────────────────────────────┘

After all responses arrive, the judge scores them:

=== Scoreboard ===
1. claude-sonnet-4 - 42.0/50
2. gemini-2.5-flash - 40.5/50
3. gpt-4o - 39.0/50
4. grok-3 - 38.5/50
5. llama-3.1-70b - 37.0/50

=== Details ===

claude-sonnet-4:
  Accuracy: 9.0 (Correct and precise explanation of ownership rules)
  Helpfulness: 8.5 (Directly addresses the question with practical examples)
  Clarity: 8.5 (Well-structured with clear progression of concepts)
  Creativity: 8.0 (Novel analogy comparing ownership to real-world lending)
  Conciseness: 8.0 (Thorough but not verbose)

gpt-4o:
  Accuracy: 8.5 (Accurate coverage of core concepts)
  Helpfulness: 8.0 (Good overview but fewer practical examples)
  ...

Why TokenWar?

When it's better than just using Claude or ChatGPT

Use Case Why TokenWar Wins
Evaluating models for your use case See how multiple models handle your actual prompts, not benchmarks
Reducing bias in model selection An independent judge scores responses — not your gut feeling
Catching hallucinations If 4 models agree and 1 doesn't, you've found a hallucination
Prompt engineering Instantly see how different models interpret the same prompt
Choosing a model for production Real response quality + latency data, not marketing claims
Creative work Compare writing styles, get multiple angles on the same topic
Factual research Cross-reference answers across models for higher confidence
Cost optimization If a cheaper model scores comparably, you've found your winner

Example: You're building a customer support bot. You write 10 representative prompts, run them through TokenWar, and discover that for your specific domain, Gemini outperforms GPT-4o while costing less. You'd never know this from public benchmarks.

When you should just use Claude or ChatGPT

Situation Why TokenWar is Overkill
Quick one-off questions You just need an answer, not a comparison
Conversational/multi-turn chat TokenWar is single-turn only — no follow-ups
You already know your preferred model No need to compare if you're happy
Cost-sensitive usage TokenWar calls N models + a judge = (N+1)x the cost of one model
Image/audio/video tasks TokenWar is text-only
You need tool use or function calling TokenWar sends plain prompts, no tool schemas

Features

  • ⚡ Concurrent API calls — All models queried simultaneously via tokio
  • 📺 Terminal UI — Split-pane ratatui display showing responses as they stream in
  • 🏆 LLM-as-judge scoring — Automated evaluation on 5 criteria (1-10 scale each, 50 max)
  • 🔌 Any model, one endpoint — Works with LiteLLM, OpenRouter, Ollama, or any OpenAI-compatible API
  • 📡 Streaming mode — Watch responses arrive token-by-token with --stream
  • 📋 Plain text mode--no-tui for piping output or CI/automation
  • 📊 JSON output--json for machine-readable results with latency data
  • ⏱️ Latency tracking — Per-model response time in milliseconds
  • 🔧 Dynamic model list — Add or remove models by editing one env var, no code changes
  • 💪 Fault tolerant — One model failing doesn't kill the others

Installation

Homebrew (macOS/Linux)

brew tap tylerwillis/tap
brew install tokenwar

This installs both tokenwar and tw (shorthand alias).

Direct Download

Download the latest release for your platform:

# macOS (Apple Silicon)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-macos-aarch64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-macos-x86_64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/

# Linux (x86_64)
curl -L https://github.com/tylerwillis/tokenwar/releases/latest/download/tokenwar-linux-x86_64.tar.gz | tar xz
sudo mv tokenwar /usr/local/bin/

From Source

Requires Rust 1.70+:

git clone https://github.com/tylerwillis/tokenwar.git
cd tokenwar
cargo build --release
# Binary at target/release/tokenwar

Proxy Setup

TokenWar talks to a single OpenAI-compatible endpoint. You need a proxy that routes to multiple providers. Pick one:

Option A: LiteLLM (self-hosted, recommended)

LiteLLM gives you one API for 100+ models with zero token markup. Best for homelabbers.

# Install
pipx install 'litellm[proxy]'

# Create config (litellm_config.yaml)
cat > litellm_config.yaml << 'EOF'
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gpt-4o-mini
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY

  - model_name: gemini-2.5-flash
    litellm_params:
      model: gemini/gemini-2.5-flash
      api_key: os.environ/GEMINI_API_KEY

  - model_name: claude-sonnet-4
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

general_settings:
  master_key: sk-tokenwar-local
EOF

# Start the proxy
litellm --config litellm_config.yaml --port 4000

Or with Docker:

docker run -d --name litellm \
  -p 4000:4000 \
  -v $(pwd)/litellm_config.yaml:/app/config.yaml \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -e GEMINI_API_KEY=$GEMINI_API_KEY \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Option B: OpenRouter (hosted, zero setup)

No self-hosting required. One API key, 200+ models, small per-token markup.

  1. Get an API key at openrouter.ai
  2. Set base_url = "https://openrouter.ai/api/v1" in your tokenwar.toml (see Configuration)

Option C: Ollama (fully local)

For comparing local models with zero API costs:

ollama serve  # starts on localhost:11434
ollama pull llama3.1
ollama pull mistral

Set base_url = "http://localhost:11434/v1" and api_key = "ollama" in your tokenwar.toml (see Configuration).

Configuration

TokenWar looks for tokenwar.toml in the current directory, or falls back to ~/.config/tokenwar/tokenwar.toml for a global config.

# Global config (recommended)
mkdir -p ~/.config/tokenwar
cp tokenwar.example.toml ~/.config/tokenwar/tokenwar.toml

# Or local config (per-project)
cp tokenwar.example.toml tokenwar.toml

Edit your config:

base_url = "http://localhost:4000/v1"
api_key = "sk-tokenwar-local"

[[models]]
name = "gpt-4o"
model = "gpt-4o"

[[models]]
name = "gpt-4o-mini"
model = "gpt-4o-mini"

judge_model = "gpt-4o"
timeout_secs = 60

[limits]
max_concurrent_runs = 4
max_images = 8
max_total_image_bytes = 20971520 # 20 MiB
max_completed_runs = 50

Model names must match what your proxy expects. For LiteLLM, these are the model_name values in your config. For OpenRouter, use their model IDs (e.g. openai/gpt-4o).

Optional Environment Overrides

If you prefer keeping secrets out of tokenwar.toml, you can override these via env vars:

  • TOKENWAR_BASE_URL
  • TOKENWAR_API_KEY
  • TOKENWAR_JUDGE_MODEL
  • TOKENWAR_JUDGE_BASE_URL
  • TOKENWAR_JUDGE_API_KEY

Usage

Basic

# Pass prompt as argument
tokenwar "Explain the difference between TCP and UDP"

# Pipe from stdin
echo "Write a haiku about Rust" | tokenwar

# From a file
tokenwar < prompt.txt

Options

# Stream responses token-by-token in the TUI
tokenwar --stream "What is quantum computing?"

# Plain text output (no TUI, good for scripts/CI)
tokenwar --no-tui "Compare REST vs GraphQL"

# JSON output (machine-readable, includes latency per model)
tokenwar --json "Compare REST vs GraphQL"

# Custom timeout (overrides config)
tokenwar --timeout-secs 120 "Write a detailed essay on climate change"

# Combine flags
tokenwar --stream --timeout-secs 90 "Explain monads to a 5-year-old"

# Web UI (streams responses live in the browser)
tokenwar --web --port 8080 --config tokenwar.toml

JSON Output

The --json flag outputs structured JSON for programmatic consumption:

{
  "prompt": "What is 2+2?",
  "providers": [
    {
      "name": "gpt-4o",
      "model": "gpt-4o",
      "response_text": "2 + 2 = 4.",
      "error": null,
      "latency_ms": 1234,
      "ttft_ms": 210
    },
    {
      "name": "gemini-2.5-flash",
      "model": "gemini-2.5-flash",
      "response_text": "The answer is 4.",
      "error": null,
      "latency_ms": 987,
      "ttft_ms": 160
    }
  ],
  "scores": [],
  "metadata": {
    "timestamp": 1738492800,
    "timeout_secs": 60,
    "stream": false
  }
}

TUI Controls

Key Action
q Exit TUI
Tab Focus next panel
Shift+Tab Focus previous panel
j / Scroll down in the active panel
k / Scroll up in the active panel
Space Toggle fullscreen on focused panel
Esc Exit fullscreen mode
c Copy focused panel's response to clipboard

The TUI stays open after responses complete so you can review, scroll, and copy. Press q to exit and see the judge scoreboard.

Web Controls

  • Enter runs the prompt (Shift+Enter inserts a newline)
  • Paste images from the clipboard (limits enforced by tokenwar.toml)
  • q cancels the run
  • Tab / Shift+Tab cycles the active panel
  • j/k or ↑/↓ scrolls the active panel

Architecture

                    ┌─────────────────────────────┐
          prompt    │ OpenAI-compatible endpoint   │
       ┌───────────▶│ (LiteLLM / OpenRouter / ...) │────┐
       │            └─────────────────────────────┘    │
       │                                                │
┌──────┴──┐    ┌─────────┐ ┌─────────┐ ┌─────────┐     │    ┌───────┐    ┌───────┐
│  User   │───▶│ Model A │ │ Model B │ │ Model C │─────┼───▶│  TUI  │───▶│ Judge │
│ Prompt  │    └─────────┘ └─────────┘ └─────────┘     │    └───────┘    └───────┘
└─────────┘                                             │
                    All calls are concurrent (tokio)    │
  1. Dispatch — Your prompt is sent to all configured models simultaneously
  2. Collect — Responses stream back via mpsc channels and render in the TUI
  3. Judge — All responses are sent to the judge model for structured scoring
  4. Report — Scoreboard with rankings and per-criteria reasoning

Scoring Criteria

The judge evaluates each response on a 1-10 scale:

Criterion What it measures
Accuracy Is the information correct and factual?
Helpfulness Does it address what the user actually needs?
Clarity Is it well-structured and easy to understand?
Creativity Does it show original thinking or novel approaches?
Conciseness Is it appropriately detailed without being verbose?

Total: /50 — The judge provides brief reasoning for each score.

Tip: Use a different judge model than the contestants to reduce self-preference bias. If you're comparing Claude vs GPT, use Gemini as the judge.

Tips

  • Compare anything — same family (gpt-4o vs gpt-4o-mini), cross-provider (Claude vs GPT vs Gemini), or local vs cloud (Llama vs GPT-4o)
  • Run the same prompt multiple times — LLM outputs are non-deterministic, so scores will vary
  • Use --json for automation — pipe to jq, build dashboards, track model quality over time
  • Per-model overrides — point specific models at different endpoints (e.g., one model direct, rest through proxy)
  • No model limit — compare 2 models or 20; the TUI grid auto-layouts

Roadmap

  • JSON output mode
  • Per-model latency tracking
  • Unified OpenAI-compatible endpoint (LiteLLM/OpenRouter)
  • Dynamic model list (no code changes to add models)
  • Auto-layout TUI grid for any number of models
  • Multi-turn conversation support
  • Token usage and cost tracking per model
  • Configurable scoring criteria
  • Export results to CSV
  • Time-to-first-token latency tracking
  • Side-by-side diff view for similar responses

License

MIT

Built With

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors