kbn/evals <-> Agent Builder eval feature completeness

## Overview 

This epic contains feature requests for the `kbn/evals` framework to support specific evaluation needs for Agent Builder use cases. 

Today, Agent Builder evaluations run using the [@kbn/evals](https://github.com/elastic/kibana/tree/1c2092dd738aef688bdbb465ef619022a5f679c2/x-pack/platform/packages/shared/kbn-evals) framework. This overall framework is actively worked on by the Observability and Security teams, primarily focused on running evaluations in CI and improving the Kibana developer experience. The Agent Builder tests running on this framework are located [here](https://github.com/elastic/kibana/tree/1c2092dd738aef688bdbb465ef619022a5f679c2/x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder). 

The DS team currently relies on the kbn/evals framework for offline experimentation and benchmarking. While functional, this framework presents several limitations:

- Stability and usability concerns. The framework is under active development, and there have been stability issues in the past. 
- Missing primitives. It lacks essential abstractions such as experiments, runs, and datasets.
- Limited tracing and visualization. Native support for tracing visualization within Elastic does not match the capabilities of specialized solutions like LangSmith, Arize, and Opik tools that are critical for effective experimentation. As an interim measure, the team relies on Arize Phoenix. However, this implementation has significant stability issues and cannot be reliably used at scale, and local environments have data persistence and consistency issues. 
- The `kbn/evals` framework is owned and maintained by the o11y team, the goal is to incorporate the missing functionality over time and ultimately eliminate the dependency on external solutions like Arize Phoenix. This epic describes feature requests from the Agent Builder team to help make this a more fully functional solution for offline and online evaluations. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kbn/evals <-> Agent Builder eval feature completeness #255820

Overview

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

kbn/evals <-> Agent Builder eval feature completeness #255820

Description

Overview

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions