Parent Epic
Part of #257821 — Extend @kbn/evals with advanced evaluation capabilities
Summary
Add Kibana Lens trend dashboards as an additive layer on top of the existing evals Kibana plugin. The evals plugin already provides runs list, run detail, and trace waterfall. This phase adds historical trend analysis and cross-model comparison visualizations using Kibana's native Lens capabilities.
Context
The kibana-evaluations data stream already contains all score data. The evals plugin provides a React UI for browsing runs. What's missing is trend analysis over time — how are scores changing across builds, branches, and models?
Lens Visualizations to Create
- Score Trend — Line chart of mean evaluator scores over time (x:
@timestamp, y: evaluator.score, breakdown: evaluator.name)
- Model Comparison — Bar chart comparing mean scores across
task.model.id for the same dataset
- Evaluator Heatmap — Heatmap of
evaluator.score by example.dataset.name x evaluator.name
- Pass Rate Over Runs — Area chart of pass rates (score >= threshold) across
run_id
- Token Usage — Stacked bar of input/output/cached tokens per dataset
- Latency Distribution — Histogram of latency evaluator scores
- Regression Highlight — Metric showing score delta between latest two runs with color coding
- Per-Suite Breakdown — Table of datasets with mean/median/stdDev/min/max per evaluator
Dashboard
A single "Evals Trends" dashboard combining the above, with controls for:
- Run ID selector (dropdown)
- Model filter
- Date range
- Dataset filter
- Evaluator name filter
Bootstrap CLI Command
node scripts/evals dashboard # Create/update saved objects in Kibana
node scripts/evals dashboard --delete # Remove saved objects
Uses the Kibana saved objects API to import NDJSON with data view, visualizations, and dashboard. All objects use a deterministic ID prefix (evals-) for idempotent updates.
Files to Create
kbn-evals/src/dashboard/saved_objects.ts (generates NDJSON)
kbn-evals/src/cli/commands/dashboard.ts
Dependencies
- Independent — reads existing
kibana-evaluations data stream
- More valuable after Phases 1-2 add more evaluator data
Acceptance Criteria
Parent Epic
Part of #257821 — Extend @kbn/evals with advanced evaluation capabilities
Summary
Add Kibana Lens trend dashboards as an additive layer on top of the existing
evalsKibana plugin. The evals plugin already provides runs list, run detail, and trace waterfall. This phase adds historical trend analysis and cross-model comparison visualizations using Kibana's native Lens capabilities.Context
The
kibana-evaluationsdata stream already contains all score data. The evals plugin provides a React UI for browsing runs. What's missing is trend analysis over time — how are scores changing across builds, branches, and models?Lens Visualizations to Create
@timestamp, y:evaluator.score, breakdown:evaluator.name)task.model.idfor the same datasetevaluator.scorebyexample.dataset.namexevaluator.namerun_idDashboard
A single "Evals Trends" dashboard combining the above, with controls for:
Bootstrap CLI Command
Uses the Kibana saved objects API to import NDJSON with data view, visualizations, and dashboard. All objects use a deterministic ID prefix (
evals-) for idempotent updates.Files to Create
kbn-evals/src/dashboard/saved_objects.ts(generates NDJSON)kbn-evals/src/cli/commands/dashboard.tsDependencies
kibana-evaluationsdata streamAcceptance Criteria
node scripts/evals dashboardcreates all saved objects--deletecleanly removes all created saved objects