Quickstart
From zero to your first benchmark report in 3 minutes.
evaly demo to browse 4 real benchmark case studies in your browser — no API keys needed.
Learn more →
Install
pip install evalytic
Requires Python 3.10+. Installs the CLI, VLM judge, and Rich terminal output (~5 MB).
Setup
One API key, two commands. Get your FAL_KEY from
fal.ai/dashboard/keys ($10 free credit),
then:
# Set your API key
export FAL_KEY=your_fal_key
# Write config (judge uses fal.ai too — no other keys needed)
cat > evalytic.toml << 'EOF'
[keys]
fal = "${FAL_KEY}"
[bench]
judge = "fal/gemini-2.5-flash"
EOF
FAL_KEY handles both image generation and VLM judging (via fal/gemini-2.5-flash).
No separate Gemini or OpenAI key required.
Interactive wizard (alternative)
Prefer a guided setup? Run evaly init — it walks you through use case selection, API key collection, and config generation:
$ evaly init ? What do you want to evaluate? [text2img / img2img / both] ? fal.ai API key: fal_... validated .env written (1 key) evalytic.toml written Ready! Run: evaly bench -y
Note: evaly init is interactive and requires a TTY. For CI/CD or agentic workflows, use the manual setup above.
First Benchmark
evaly bench -y
Zero arguments needed. Smart defaults: generates one image with Flux Schnell, scores it with Gemini, prints a terminal report. Done in ~15 seconds.
$ evaly bench -y Evalytic Bench Models: flux-schnell | Prompts: 1 | Dimensions: auto Est. cost: ~$0.01 Generating... flux-schnell: 1/1 Scoring... fal/gemini-2.5-flash: 2/2 Rankings ┌────────────────┬─────────────────┬──────────────────┬─────────┐ │ Model │ visual_quality │ prompt_adherence │ Overall │ ├────────────────┼─────────────────┼──────────────────┼─────────┤ │ flux-schnell │ 4.5 │ 4.0 │ 4.2 │ └────────────────┴─────────────────┴──────────────────┴─────────┘ Cost: $0.004 gen + $0.000 judge = $0.004 total
Evalytic generated an image with Flux Schnell, scored it with Gemini, and printed the results — all from a single command.
Compare Models
Create a prompts.json file with your test prompts:
[
"A photorealistic cat on a windowsill at sunset",
"A modern minimalist logo for 'ACME Corp'",
"Product photo: white sneakers on marble",
"A watercolor painting of a mountain landscape"
]
Then run a multi-model benchmark with an HTML report:
evaly bench \
-m flux-schnell -m flux-dev -m flux-pro \
-p prompts.json \
-o report.html \
--review
The --review flag opens an interactive HTML report in your browser with side-by-side image comparison, per-dimension scores, radar charts, and cost breakdown.
evaly eval --image photo.jpg --prompt "A product photo of sneakers"
Only needs a judge key (FAL_KEY or GEMINI_API_KEY). See evaly eval for full reference.
Real Results
These benchmarks were run with Evalytic — same CLI you just installed.
Do I really need the flagship model?
Schnell scores 4.3 at $0.003/img. Pro scores 4.7 at $0.05. Is 0.4 points worth 16× the cost? 3 Flux models compared.
See benchmark →Is my product photo still my product?
AI edits warp shapes, lose logos, change colors. Input fidelity scoring catches every drift. seedream-edit leads at 5.0/5.
See benchmark →Why do users say "that's not me"?
Face edits lose identity. ArcFace + VLM judges agree (r=0.99). flux-dev-i2i scores 0.04 face similarity — unusable.
See benchmark →Local Metrics
Sharpness (Variance of Laplacian) is built-in and always active — no extra install. For heavier deterministic metrics, install the metrics extra:
pip install "evalytic[metrics]" # adds CLIP Score, LPIPS, ArcFace (~2 GB)
Once installed, CLIP (text2img) and LPIPS (img2img) are auto-enabled alongside sharpness. Use --no-metrics to disable all, or --metrics face to add ArcFace for identity preservation.
--no-judge to skip the VLM judge and rely only on local metrics (sharpness, CLIP, LPIPS). No judge API key required — deterministic and zero cost.
evaly bench -m flux-schnell -p "A cat" --no-judge -y