Skip to content

[kbn/evals] Phase 4: Kibana Lens trend dashboards for evaluation results #257825

@patrykkopycinski

Description

@patrykkopycinski

Parent Epic

Part of #257821 — Extend @kbn/evals with advanced evaluation capabilities

Summary

Add Kibana Lens trend dashboards as an additive layer on top of the existing evals Kibana plugin. The evals plugin already provides runs list, run detail, and trace waterfall. This phase adds historical trend analysis and cross-model comparison visualizations using Kibana's native Lens capabilities.

Context

The kibana-evaluations data stream already contains all score data. The evals plugin provides a React UI for browsing runs. What's missing is trend analysis over time — how are scores changing across builds, branches, and models?

Lens Visualizations to Create

  1. Score Trend — Line chart of mean evaluator scores over time (x: @timestamp, y: evaluator.score, breakdown: evaluator.name)
  2. Model Comparison — Bar chart comparing mean scores across task.model.id for the same dataset
  3. Evaluator Heatmap — Heatmap of evaluator.score by example.dataset.name x evaluator.name
  4. Pass Rate Over Runs — Area chart of pass rates (score >= threshold) across run_id
  5. Token Usage — Stacked bar of input/output/cached tokens per dataset
  6. Latency Distribution — Histogram of latency evaluator scores
  7. Regression Highlight — Metric showing score delta between latest two runs with color coding
  8. Per-Suite Breakdown — Table of datasets with mean/median/stdDev/min/max per evaluator

Dashboard

A single "Evals Trends" dashboard combining the above, with controls for:

  • Run ID selector (dropdown)
  • Model filter
  • Date range
  • Dataset filter
  • Evaluator name filter

Bootstrap CLI Command

node scripts/evals dashboard          # Create/update saved objects in Kibana
node scripts/evals dashboard --delete  # Remove saved objects

Uses the Kibana saved objects API to import NDJSON with data view, visualizations, and dashboard. All objects use a deterministic ID prefix (evals-) for idempotent updates.

Files to Create

  • kbn-evals/src/dashboard/saved_objects.ts (generates NDJSON)
  • kbn-evals/src/cli/commands/dashboard.ts

Dependencies

  • Independent — reads existing kibana-evaluations data stream
  • More valuable after Phases 1-2 add more evaluator data

Acceptance Criteria

  • node scripts/evals dashboard creates all saved objects
  • Dashboard loads in Kibana with working controls
  • Visualizations render correctly with real eval data
  • --delete cleanly removes all created saved objects
  • Saved objects use deterministic IDs (re-runnable without duplicates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Team:agent-builderenhancementNew value added to drive a business resultkbn-evalsIssue related to the work on Kibana's LLM evaluation framework.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions