Overview
This epic contains feature requests for the kbn/evals framework to support specific evaluation needs for Agent Builder use cases.
Today, Agent Builder evaluations run using the @kbn/evals framework. This overall framework is actively worked on by the Observability and Security teams, primarily focused on running evaluations in CI and improving the Kibana developer experience. The Agent Builder tests running on this framework are located here.
The DS team currently relies on the kbn/evals framework for offline experimentation and benchmarking. While functional, this framework presents several limitations:
- Stability and usability concerns. The framework is under active development, and there have been stability issues in the past.
- Missing primitives. It lacks essential abstractions such as experiments, runs, and datasets.
- Limited tracing and visualization. Native support for tracing visualization within Elastic does not match the capabilities of specialized solutions like LangSmith, Arize, and Opik tools that are critical for effective experimentation. As an interim measure, the team relies on Arize Phoenix. However, this implementation has significant stability issues and cannot be reliably used at scale, and local environments have data persistence and consistency issues.
- The
kbn/evals framework is owned and maintained by the o11y team, the goal is to incorporate the missing functionality over time and ultimately eliminate the dependency on external solutions like Arize Phoenix. This epic describes feature requests from the Agent Builder team to help make this a more fully functional solution for offline and online evaluations.
Overview
This epic contains feature requests for the
kbn/evalsframework to support specific evaluation needs for Agent Builder use cases.Today, Agent Builder evaluations run using the @kbn/evals framework. This overall framework is actively worked on by the Observability and Security teams, primarily focused on running evaluations in CI and improving the Kibana developer experience. The Agent Builder tests running on this framework are located here.
The DS team currently relies on the kbn/evals framework for offline experimentation and benchmarking. While functional, this framework presents several limitations:
kbn/evalsframework is owned and maintained by the o11y team, the goal is to incorporate the missing functionality over time and ultimately eliminate the dependency on external solutions like Arize Phoenix. This epic describes feature requests from the Agent Builder team to help make this a more fully functional solution for offline and online evaluations.