Evals for visual AI. Automated quality evaluation for AI-generated images and video.
Know if your AI-generated visuals are good — before your users tell you they're not.
pip install evalytic
evaly bench \
-m flux-schnell -m flux-dev -m flux-pro \
-p "A photorealistic cat on a windowsill" \
-o report.html --yesEvalytic benchmarks AI image generation models by generating images, scoring them with VLM judges (Gemini, GPT, Claude, Ollama), and producing rich reports — all in one command.
- Model Selection — Compare Flux Schnell vs Dev vs Pro with real prompts
- Prompt Optimization — Measure how well models follow your prompts
- Regression Detection — Catch quality drops when models update
- CI/CD Quality Gate — Block deploys when image quality falls below threshold
- 7 Semantic Dimensions — visual_quality, prompt_adherence, text_rendering, input_fidelity, transformation_quality, artifact_detection, identity_preservation
- Consensus Judging — Multi-judge scoring with automatic agreement analysis
pip install evalyticevaly demo # Opens showcase with 4 real benchmark case studies
evaly demo face # Face identity preservation comparison
evaly demo flagship # Flux Schnell vs Dev vs Pro cost/qualityexport FAL_KEY=your_fal_key # fal.ai for image generation
export GEMINI_API_KEY=your_gemini_key # Default judge# Single model benchmark
evaly bench -m flux-schnell -p "A cat sitting on a windowsill" --yes
# Compare models with HTML report
evaly bench -m flux-schnell -m flux-dev -m flux-pro \
-p prompts.json -o report.html --review
# img2img benchmark
evaly bench -m flux-kontext -m seedream-edit -m reve-edit \
-p prompts.json --input product.jpg --yes
# Score an existing image
evaly eval --image output.png --prompt "A sunset over mountains"
# CI/CD quality gate
evaly gate --report report.json --threshold 3.5| Command | Description |
|---|---|
evaly demo |
Browse real benchmark showcases (no API key needed) |
evaly bench |
Generate, score, and report in one command |
evaly eval |
Score a single image without generation |
evaly gate |
CI/CD quality gate with pass/fail exit codes |
Any VLM that can analyze images works as a judge:
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-flash # Default
evaly bench -m flux-schnell -p "A cat" -j gemini-2.5-pro # Gemini Pro
evaly bench -m flux-schnell -p "A cat" -j openai/gpt-5.2 # OpenAI
evaly bench -m flux-schnell -p "A cat" -j anthropic/claude-sonnet-4-6 # Anthropic
evaly bench -m flux-schnell -p "A cat" -j ollama/qwen2.5-vl:7b # LocalUse multiple judges for more reliable scores:
evaly bench -m flux-schnell -p "A cat" \
--judges "gemini-2.5-flash,openai/gpt-5.2"Two judges score in parallel. If they disagree, a third breaks the tie.
Local deterministic metrics auto-enabled when installed:
pip install "evalytic[metrics]" # CLIP Score + LPIPS + ArcFace + NIMA (~2GB)
pip install "evalytic[ocr]" # OCR accuracy for text rendering prompts
pip install "evalytic[all]" # EverythingRun without VLM judges (free, deterministic):
evaly bench -m flux-schnell -p "A cat" --no-judgeCreate evalytic.toml in your project root:
[keys]
fal = "your_fal_key"
gemini = "your_gemini_key"
[bench]
judge = "gemini-2.5-flash"
dimensions = ["visual_quality", "prompt_adherence"]
concurrency = 4Full docs at docs.evalytic.ai
MIT