Quickstart

From zero to your first benchmark report in 3 minutes.

Want to see results first? Run evaly demo to browse 4 real benchmark case studies in your browser — no API keys needed. Learn more →

In 3 minutes you'll have this: A full benchmark report with model rankings, per-dimension scores, cost breakdown, and side-by-side image comparison. See a sample report →

Install

pip install evalytic

Requires Python 3.10+. Installs the CLI, VLM judge, and Rich terminal output (~5 MB).

Setup

One API key, two commands. Get your FAL_KEY from fal.ai/dashboard/keys ($10 free credit), then:

# Set your API key
export FAL_KEY=your_fal_key

# Write config (judge uses fal.ai too — no other keys needed)
cat > evalytic.toml << 'EOF'
[keys]
fal = "${FAL_KEY}"

[bench]
judge = "fal/gemini-2.5-flash"
EOF

One key does it all. FAL_KEY handles both image generation and VLM judging (via fal/gemini-2.5-flash). No separate Gemini or OpenAI key required.

Interactive wizard (alternative)

Prefer a guided setup? Run evaly init — it walks you through use case selection, API key collection, and config generation:

$ evaly init

? What do you want to evaluate? [text2img / img2img / both]
? fal.ai API key: fal_... validated

  .env written (1 key)
  evalytic.toml written
  Ready! Run: evaly bench -y

Note: evaly init is interactive and requires a TTY. For CI/CD or agentic workflows, use the manual setup above.

First Benchmark

evaly bench -y

Zero arguments needed. Smart defaults: generates one image with Flux Schnell, scores it with Gemini, prints a terminal report. Done in ~15 seconds.

$ evaly bench -y

  Evalytic Bench
  Models: flux-schnell | Prompts: 1 | Dimensions: auto
  Est. cost: ~$0.01

  Generating... flux-schnell: 1/1
  Scoring... fal/gemini-2.5-flash: 2/2

  Rankings
  ┌────────────────┬─────────────────┬──────────────────┬─────────┐
  │ Model          │ visual_quality  │ prompt_adherence │ Overall │
  ├────────────────┼─────────────────┼──────────────────┼─────────┤
  │ flux-schnell   │ 4.5             │ 4.0              │ 4.2     │
  └────────────────┴─────────────────┴──────────────────┴─────────┘
  Cost: $0.004 gen + $0.000 judge = $0.004 total

Evalytic generated an image with Flux Schnell, scored it with Gemini, and printed the results — all from a single command.

Compare Models

Create a prompts.json file with your test prompts:

[
  "A photorealistic cat on a windowsill at sunset",
  "A modern minimalist logo for 'ACME Corp'",
  "Product photo: white sneakers on marble",
  "A watercolor painting of a mountain landscape"
]

Then run a multi-model benchmark with an HTML report:

evaly bench \
    -m flux-schnell -m flux-dev -m flux-pro \
    -p prompts.json \
    -o report.html \
    --review

The --review flag opens an interactive HTML report in your browser with side-by-side image comparison, per-dimension scores, radar charts, and cost breakdown.

See a sample report → — 5 models, rankings, radar chart, cost analysis, all from a real benchmark.

Already have images? Skip generation and score directly:

evaly eval --image photo.jpg --prompt "A product photo of sneakers"

Only needs a judge key (FAL_KEY or GEMINI_API_KEY). See evaly eval for full reference.

Real Results

These benchmarks were run with Evalytic — same CLI you just installed.

Do I really need the flagship model?

Schnell scores 4.3 at $0.003/img. Pro scores 4.7 at $0.05. Is 0.4 points worth 16× the cost? 3 Flux models compared.

See benchmark →

Is my product photo still my product?

AI edits warp shapes, lose logos, change colors. Input fidelity scoring catches every drift. seedream-edit leads at 5.0/5.

See benchmark →

Why do users say "that's not me"?

Face edits lose identity. ArcFace + VLM judges agree (r=0.99). flux-dev-i2i scores 0.04 face similarity — unusable.

See benchmark →

Local Metrics

Sharpness (Variance of Laplacian) is built-in and always active — no extra install. For heavier deterministic metrics, install the metrics extra:

pip install "evalytic[metrics]"  # adds CLIP Score, LPIPS, ArcFace (~2 GB)

Once installed, CLIP (text2img) and LPIPS (img2img) are auto-enabled alongside sharpness. Use --no-metrics to disable all, or --metrics face to add ArcFace for identity preservation.

Free evaluation: Use --no-judge to skip the VLM judge and rely only on local metrics (sharpness, CLIP, LPIPS). No judge API key required — deterministic and zero cost.

evaly bench -m flux-schnell -p "A cat" --no-judge -y

Quickstart

Install

Setup

First Benchmark

Compare Models

Real Results

Do I really need the flagship model?

Is my product photo still my product?

Why do users say "that's not me"?

Local Metrics

What's Next

evaly bench

Judges

7 Dimensions

CI/CD Gate