Overview
Extend `@kbn/evals` with capabilities ported from cursor-plugin-evals to provide comprehensive evaluation infrastructure for Agent Builder development. This brings advanced evaluators, CI quality gates, security red-teaming, trend dashboards, and auto-test generation natively into the Kibana evaluation framework.
Motivation
`@kbn/evals` currently has a strong foundation (Playwright runner, score repository, paired t-tests, criteria/correctness/groundedness/trace evaluators). However, it lacks several capabilities that are critical for mature LLM agent evaluation:
- Trajectory evaluation — Did the agent use the right tools in the right order?
- Conversation coherence — Does multi-turn quality hold up?
- Multi-judge panels — Reduce single-judge bias
- Security testing — Prompt injection, privilege escalation detection
- CI quality gates — Automated pass/fail enforcement in Buildkite
- Trend analysis — Score drift and regression detection over time
- Auto-test generation — Reduce manual dataset creation effort
Architecture
All new capabilities extend the existing `@kbn/evals` patterns:
- Evaluators follow the `Evaluator<TExample, TTaskOutput>` factory pattern
- Scores flow to the same `kibana-evaluations` ES data stream
- CLI commands use `@kbn/dev-cli-runner`
- No new external npm dependencies beyond what Kibana already uses
Child Issues
| Phase |
Issue |
Depends On |
Status |
| Phase 1: New Evaluators |
#257822 |
None |
Not started |
| Phase 2: CI Quality Gates |
#257823 |
Phase 1 |
Not started |
| Phase 3: Red-Teaming |
#257824 |
Phase 1 |
Not started |
| Phase 4: Lens Dashboards |
#257825 |
Independent |
Not started |
| Phase 5: Auto-Generation |
#257826 |
Phase 1 + 2 |
Not started |
Dependency Graph
```
Phase 1 (evaluators) ─── no dependencies
│
├──> Phase 2 (CI gates) ─── depends on Phase 1
│ │
│ └──> Phase 5 (auto-gen) ─── depends on Phase 1 + 2
│
├──> Phase 3 (red-team) ─── depends on Phase 1
│
└──> Phase 4 (dashboards) ─── independent (reads existing data)
```
Companion: Cursor Plugin
The agent-builder-skill-dev Cursor plugin provides IDE-level helpers (skills, rules, knowledge docs) that wrap `@kbn/evals` CLI and APIs. The plugin is being updated in parallel to use the evals plugin API and leverage new evaluators as they land.
Key Design Principles
- Follow existing `@kbn/evals` patterns (factory functions, Scout fixtures, `@kbn/dev-cli-runner`)
- No external npm dependencies beyond what Kibana already uses
- Evaluators are composable — suites pick which to run
- All scores flow to the same `kibana-evaluations` ES data stream
- Each phase ships as an independent PR
Overview
Extend `@kbn/evals` with capabilities ported from cursor-plugin-evals to provide comprehensive evaluation infrastructure for Agent Builder development. This brings advanced evaluators, CI quality gates, security red-teaming, trend dashboards, and auto-test generation natively into the Kibana evaluation framework.
Motivation
`@kbn/evals` currently has a strong foundation (Playwright runner, score repository, paired t-tests, criteria/correctness/groundedness/trace evaluators). However, it lacks several capabilities that are critical for mature LLM agent evaluation:
Architecture
All new capabilities extend the existing `@kbn/evals` patterns:
Child Issues
Dependency Graph
```
Phase 1 (evaluators) ─── no dependencies
│
├──> Phase 2 (CI gates) ─── depends on Phase 1
│ │
│ └──> Phase 5 (auto-gen) ─── depends on Phase 1 + 2
│
├──> Phase 3 (red-team) ─── depends on Phase 1
│
└──> Phase 4 (dashboards) ─── independent (reads existing data)
```
Companion: Cursor Plugin
The agent-builder-skill-dev Cursor plugin provides IDE-level helpers (skills, rules, knowledge docs) that wrap `@kbn/evals` CLI and APIs. The plugin is being updated in parallel to use the evals plugin API and leverage new evaluators as they land.
Key Design Principles