[Epic] Extend @kbn/evals with advanced evaluation capabilities

## Overview

Extend \`@kbn/evals\` with capabilities ported from [cursor-plugin-evals](https://github.com/elastic/cursor-plugin-evals) to provide comprehensive evaluation infrastructure for Agent Builder development. This brings advanced evaluators, CI quality gates, security red-teaming, trend dashboards, and auto-test generation natively into the Kibana evaluation framework.

## Motivation

\`@kbn/evals\` currently has a strong foundation (Playwright runner, score repository, paired t-tests, criteria/correctness/groundedness/trace evaluators). However, it lacks several capabilities that are critical for mature LLM agent evaluation:

- **Trajectory evaluation** — Did the agent use the right tools in the right order?
- **Conversation coherence** — Does multi-turn quality hold up?
- **Multi-judge panels** — Reduce single-judge bias
- **Security testing** — Prompt injection, privilege escalation detection
- **CI quality gates** — Automated pass/fail enforcement in Buildkite
- **Trend analysis** — Score drift and regression detection over time
- **Auto-test generation** — Reduce manual dataset creation effort

## Architecture

All new capabilities extend the existing \`@kbn/evals\` patterns:
- Evaluators follow the \`Evaluator<TExample, TTaskOutput>\` factory pattern
- Scores flow to the same \`kibana-evaluations\` ES data stream
- CLI commands use \`@kbn/dev-cli-runner\`
- No new external npm dependencies beyond what Kibana already uses

## Child Issues

| Phase | Issue | Depends On | Status |
|-------|-------|-----------|--------|
| **Phase 1: New Evaluators** | #257822 | None | Not started |
| **Phase 2: CI Quality Gates** | #257823 | Phase 1 | Not started |
| **Phase 3: Red-Teaming** | #257824 | Phase 1 | Not started |
| **Phase 4: Lens Dashboards** | #257825 | Independent | Not started |
| **Phase 5: Auto-Generation** | #257826 | Phase 1 + 2 | Not started |

## Dependency Graph

\`\`\`
Phase 1 (evaluators) ─── no dependencies
    │
    ├──> Phase 2 (CI gates) ─── depends on Phase 1
    │        │
    │        └──> Phase 5 (auto-gen) ─── depends on Phase 1 + 2
    │
    ├──> Phase 3 (red-team) ─── depends on Phase 1
    │
    └──> Phase 4 (dashboards) ─── independent (reads existing data)
\`\`\`

## Companion: Cursor Plugin

The [agent-builder-skill-dev Cursor plugin](https://github.com/elastic/agent-builder-skill-dev-cursor-plugin) provides IDE-level helpers (skills, rules, knowledge docs) that wrap \`@kbn/evals\` CLI and APIs. The plugin is being updated in parallel to use the evals plugin API and leverage new evaluators as they land.

## Key Design Principles

- Follow existing \`@kbn/evals\` patterns (factory functions, Scout fixtures, \`@kbn/dev-cli-runner\`)
- No external npm dependencies beyond what Kibana already uses
- Evaluators are composable — suites pick which to run
- All scores flow to the same \`kibana-evaluations\` ES data stream
- Each phase ships as an independent PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Extend @kbn/evals with advanced evaluation capabilities #257821

Overview

Motivation

Architecture

Child Issues

Dependency Graph

Companion: Cursor Plugin

Key Design Principles

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Issue	Depends On	Status
Phase 1: New Evaluators	#257822	None	Not started
Phase 2: CI Quality Gates	#257823	Phase 1	Not started
Phase 3: Red-Teaming	#257824	Phase 1	Not started
Phase 4: Lens Dashboards	#257825	Independent	Not started
Phase 5: Auto-Generation	#257826	Phase 1 + 2	Not started

[Epic] Extend @kbn/evals with advanced evaluation capabilities #257821

Description

Overview

Motivation

Architecture

Child Issues

Dependency Graph

Companion: Cursor Plugin

Key Design Principles

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions